DESeq2 vs edgeR: A Practical Guide for RNA-seq Differential Expression

For the impatient: what should I choose

Figure 1. Decision flowchart for choosing between DESeq2, edgeR, and limma-voom based on data type, sample size, and analysis priorities.

DESeq2 and edgeR are the two most widely used tools for differential expression analysis of RNA-seq count data. Both use Negative Binomial generalized linear models, and both are mature, well-validated, and actively maintained.

They often agree on strong signals, but they can diverge on borderline genes — especially with small sample sizes, strong composition effects, or outliers — because dispersion estimation and hypothesis testing differ. This guide covers both the statistical machinery and the experimental design specification, which is where most real-world errors originate.

DESeq2 vs edgeR comparison — **Figure 2.** Overview of DESeq2 and edgeR for RNA-seq differential expression analysis.

1. The Shared Statistical Foundation

For each gene g in each sample i, DESeq2 and edgeR treat the read count as:

a count value that varies across samples, and
more variable than a Poisson model would allow (because biology adds extra noise).

So they both use a negative binomial model:

μ_gi (mu) = the expected/average count for that gene in that sample
α_g (alpha) = the extra variability for that gene (dispersion)

They then say the expected count is determined by two things:

How deep that sample was sequenced — captured by the size factor s_i (normalization)
What condition/batch/covariates the sample belongs to — captured by the design matrix x_i and the gene’s coefficients β_g

β_g are basically the log fold changes for that gene under your model.

Bottom line: both packages use the same core idea:

\[\text{counts} \approx \text{(normalization for sample)} + \text{(effects of condition/batch/etc.)} + \text{(extra biological variability)}\]

And that’s why the “DESeq2 vs edgeR” differences are mostly about implementation details, especially:

how they estimate dispersion (α_g),
how they test for differences in β_g (differential expression),
and how they shrink/report fold changes and p-values.

2. Normalization

Both methods assume that most genes are not differentially expressed. Both can struggle under global transcriptome shifts (e.g., transcriptome-wide up-regulation) unless external controls are introduced.

Feature	DESeq2	edgeR
Normalization	Median-of-ratios (size factors)	TMM (trimmed mean of M-values)
Dispersion	Trend + empirical Bayes shrinkage	Trend + shrinkage + quasi-likelihood layer
Recommended test	Wald (pairwise) / LRT (omnibus)	Quasi-likelihood F-test
LFC shrinkage	Central feature (apeglm / ashr)	Available, less central
Filtering	Automatic independent filtering	User-driven: filterByExpr()
Outlier handling	Cook’s distance (needs adequate n)	Robust QL estimation
UX philosophy	Safe defaults, wrapped steps	Modular, explicit, flexible

When assumptions break: If you suspect global shifts, consider spike-in normalization, a set of empirically stable reference genes, or at minimum examine MA plots of normalized counts for asymmetry. These are escape hatches, not defaults.

In routine bulk RNA-seq, normalization is rarely the deciding factor between these packages. Design and dispersion estimation matter more.

3. Dispersion Estimation

Estimating per-gene dispersion from small sample sizes is inherently noisy. Both packages borrow information across genes (empirical Bayes shrinkage), but they do so differently.

DESeq2 fits a mean–dispersion trend and shrinks each gene’s estimate toward it. edgeR’s QL pipeline does the same, but adds a quasi-likelihood variance layer on top — an additional gene-specific parameter that provides better control of false positive rates, particularly with small n or heterogeneous variance.

Key distinction: edgeR’s QL framework is often more conservative and better calibrated for Type I error control in challenging settings. DESeq2’s dispersion estimation is stable and well-validated but does not include this additional QL layer.

4. Hypothesis Testing

Each package offers multiple testing procedures. Comparing results across packages without accounting for the test used is a common source of confusion.

DESeq2

Wald test (default): tests a single coefficient ≠ 0. Fast, standard for pairwise.
LRT: full vs reduced model. For omnibus questions (any effect across ≥3 levels).

edgeR

QL F-test (recommended): robust error control, especially with small n.
LRT options exist but QL is the standard recommendation.

Warning: Comparing DESeq2 Wald test results to edgeR QL F-test results is comparing different inferential procedures — not “packages.” Be explicit about which test you used in your methods.

5. Log Fold-Change Shrinkage

DESeq2 provides a polished workflow for shrinking noisy log2FC estimates toward zero, preventing low-count genes from producing extreme fold changes that dominate visualizations and rankings. edgeR can moderate estimates through its modeling framework, but LFC shrinkage is not as prominently integrated into the default workflow.

Common misconception: Shrinkage primarily affects effect size estimates, ranking, and visualization. Your p-values come from the model fit; shrinkage is typically applied post hoc for interpretability, not for significance testing.

6. Filtering and Outliers

Genes with near-zero counts across samples cannot support reliable dispersion or fold-change estimation and inflate the multiple-testing burden. Filtering them is essential, not optional.

DESeq2

Independent filtering: automatically selects a threshold to maximize discoveries at target FDR. Cook’s distance flags outlier samples per gene — but only works reliably with adequate replicates. With very small n, treat outliers as a QC/design issue.

edgeR

User-driven pre-filtering: filterByExpr() is preferred over hand-rolled CPM thresholds because it accounts for group structure. Outlier robustness comes from the QL framework.

7. Recommended Default Pipelines

These are the recommended defaults for a standard pairwise comparison.

8. Experimental Design Specification

This is where most real-world errors originate — not in the choice of package, but in the choice of model formula.

Unpaired (independent samples)

Samples in each group are independent biological replicates. No pairing, blocking, or additional covariates.

Example: 3 control vs 3 treated mice, all different animals.

Design formula: ~ condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treated","control"))
res <- lfcShrink(dds, coef = "condition_treated_vs_control",
                 type = "apeglm")

edgeR:

design <- model.matrix(~ condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreated")

Paired design

The same subject is measured under two conditions. Including the subject as a blocking factor absorbs inter-subject variability, so the condition test is based on within-subject differences.

Example: 4 patients, each contributing a tumor and a normal sample.

Design formula: ~ subject + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ patient + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","tumor","normal"))
res <- lfcShrink(dds, coef = "condition_tumor_vs_normal",
                 type = "apeglm")

edgeR:

design <- model.matrix(~ patient + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontumor")

Note: Without the pairing term, inter-patient variability inflates the error estimate and reduces power. In strongly paired data, this can be the difference between finding 50 vs 500 DE genes.

Batch correction (blocking)

Batch effects are systematic technical variation. If batch is not confounded with condition, include it as a blocking factor.

Example: Samples processed in 2 batches, with conditions represented in both.

Design formula: ~ batch + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))

edgeR:

design <- model.matrix(~ batch + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")

Note: If batch and condition are perfectly confounded (all controls in batch 1, all treated in batch 2), no method can separate the effects. This is a design failure.

Interaction (treatment × genotype)

Tests whether the treatment effect depends on genotype. The interaction term captures this difference.

Example: 2 genotypes (WT, KO) × 2 treatments (ctrl, treat), with replicates.

Design formula: ~ genotype * treatment

Interpreting interactions: treatment coefficient = treatment effect in the reference genotype (WT). interaction = difference in treatment effect between KO and WT. Treatment effect in KO = treatment + interaction. A significant interaction does not mean “no effect” — it means the effect differs between genotypes.

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ genotype * treatment)
dds <- DESeq(dds)

# Interaction: does treatment effect differ by genotype?
res_int <- results(dds, name = "genotypeKO.treatmenttreat")

# Treatment effect in WT (reference genotype):
res_wt  <- results(dds, contrast = c("treatment","treat","ctrl"))

edgeR:

design <- model.matrix(~ genotype * treatment, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Interaction term:
qlf_int <- glmQLFTest(fit, coef = "genotypeKO:treatmenttreat")

# Treatment effect in WT:
qlf_wt  <- glmQLFTest(fit, coef = "treatmenttreat")

# Treatment effect in KO (treatment + interaction):
con <- makeContrasts(
  treatmenttreat + genotypeKO:treatmenttreat,
  levels = design)
qlf_ko  <- glmQLFTest(fit, contrast = con)

Time-course (omnibus test)

With ≥3 time points, test whether any change occurs across time. This is an omnibus test — it identifies genes with any temporal pattern, not the direction.

Example: Samples at 0h, 6h, 12h, 24h with 3 replicates each.

Design formula: ~ time (tested via LRT / omnibus QL)

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ time)

# LRT: full model vs intercept-only (any time effect?)
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
res <- results(dds)

edgeR:

design <- model.matrix(~ time, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Test all time coefficients simultaneously (ANOVA-like)
qlf <- glmQLFTest(fit, coef = 2:ncol(design))

Note: Omnibus significance tells you some difference exists across time — not which time points differ or in which direction. Follow up with planned contrasts or clustering to interpret temporal patterns.

Continuous covariates (age, RIN)

Continuous variables can be included directly in the formula to adjust for technical or biological confounders.

Example: Adjusting for age and RNA integrity (RIN) while testing treatment.

Design formula: ~ age + RIN + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ age + RIN + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))

edgeR:

design <- model.matrix(~ age + RIN + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")

On covariates and formula order: Include covariates to reduce confounding and improve precision. In GLMs, coefficient estimates do not depend on term order (unlike Type I sums of squares in ANOVA). What matters is reference levels and contrasts — be explicit about them, as those choices affect interpretation. Only include covariates that are biologically or technically justified; irrelevant covariates waste degrees of freedom.

9. QC and Diagnostics

No results section is complete without these checks.

Pre-alignment

Library sizes across samples — Flag outlier libraries. Large imbalances may indicate failed preps.
Mapping rates & duplication — Low mapping or high duplication can indicate degraded RNA or contamination.

Post-normalization

PCA / sample distance heatmap — Check for batch effects, outliers, and whether conditions separate. The single most informative plot.
MA plot of normalized counts — Should show balanced fold changes around zero. Asymmetry suggests normalization failure.

Model diagnostics

Dispersion plot (mean vs dispersion) — Gene-wise estimates should scatter around the fitted trend. Wild outliers or flat trends indicate problems.
Check for confounded batch + condition — Inspect the design matrix. If any factor is perfectly correlated with the variable of interest, the model is unidentifiable.

Results-level

P-value histogram — Should be uniform with a spike near 0 (true positives). U-shapes or spikes at 1 indicate model misspecification.
Volcano / MA plot of results — Sanity-check effect sizes and significance patterns. Inspect the top hits manually against raw counts.

10. Common Failure Modes

Issue	Problem
n = 1 per group	Cannot estimate biological variability. DE is not possible.
Batch ≡ condition	Perfectly confounded. No method can separate the effects.
Unmodeled pairing	Treating paired samples as independent inflates error and kills power.
Overfitting covariates	Adding irrelevant covariates wastes degrees of freedom with small n.
No pre-filtering	Inflates multiple-testing burden and distorts FDR.
Intersecting two tools	Not validation — both use NB GLMs. Reduces sensitivity for false comfort.

Rules of Thumb

If unsure and doing standard bulk RNA-seq, DESeq2 is the lowest-risk default.
If sample sizes are small or you need conservative inference, edgeR + QL is a strong choice.
Do not intersect results from both packages and call it validation. Use diagnostics and validate the biology.
Spend more time on design, contrasts, and QC than on package selection. The formula you write matters more than the function you call.

Test Yourself: DESeq2 vs edgeR Quiz

Think you’ve got it? Click an answer to check.

References

Guo, L. “DESeq2, limma & edgeR for RNA-seq Differential Expression: Which to Use?” YouTube.
Starmer, J. “StatQuest: DESeq2, part 1, Library Normalization.” StatQuest with Josh Starmer. YouTube.
Starmer, J. “StatQuest: edgeR and DESeq2, part 2 - Independent Filtering.” StatQuest with Josh Starmer. YouTube.
Starmer, J. “StatQuest: edgeR, part 1, Library Normalization.” StatQuest with Josh Starmer. YouTube.

Covers bulk RNA-seq and pseudo-bulk DE as of current best practices. Consult DESeq2 and edgeR vignettes for the latest recommendations.

For the impatient: what should I choose

1. The Shared Statistical Foundation

2. Normalization

3. Dispersion Estimation

4. Hypothesis Testing

DESeq2

edgeR

5. Log Fold-Change Shrinkage

6. Filtering and Outliers

DESeq2

edgeR

7. Recommended Default Pipelines

8. Experimental Design Specification

Unpaired (independent samples)

Paired design

Batch correction (blocking)

Interaction (treatment × genotype)

Time-course (omnibus test)

Continuous covariates (age, RIN)

9. QC and Diagnostics

Pre-alignment

Post-normalization

Model diagnostics

Results-level

10. Common Failure Modes

Rules of Thumb

Test Yourself: DESeq2 vs edgeR Quiz

References

Comments