DESeq2 vs edgeR: A Practical Guide for RNA-seq Differential Expression

A practical, under-the-hood guide for choosing between DESeq2 and edgeR for RNA-seq differential expression analysis, covering statistical foundations, design specification, and best practices.

For the impatient: what should I choose

What type of RNA-seq data? Bulk RNA-seq Pseudo-bulk (scRNA-seq) Single-cell (cell-level) ? How many biological replicates? ⚠ scRNA-seq tools Aggregate to pseudo-bulk n = 1 no replicates n = 2–3 small n β‰₯ 4 moderate+ β›” STOP DE is not reliable. Cannot estimate biological variability. β†’ edgeR QL pipeline Better Type I error control with small sample sizes ? What matters most? Safe defaults + LFC shrinkage Maximum control + QL calibration Already in limma ecosystem DESeq2 edgeR QL limma-voom Stop / Warning DESeq2 / edgeR / limma-voom Decision flow
Figure 1. Decision flowchart for choosing between DESeq2, edgeR, and limma-voom based on data type, sample size, and analysis priorities.

DESeq2 and edgeR are the two most widely used tools for differential expression analysis of RNA-seq count data. Both use Negative Binomial generalized linear models, and both are mature, well-validated, and actively maintained.

They often agree on strong signals, but they can diverge on borderline genes β€” especially with small sample sizes, strong composition effects, or outliers β€” because dispersion estimation and hypothesis testing differ. This guide covers both the statistical machinery and the experimental design specification, which is where most real-world errors originate.

DESeq2 vs edgeR comparison
Figure 2. Overview of DESeq2 and edgeR for RNA-seq differential expression analysis.

1. The Shared Statistical Foundation


For each gene g in each sample i, DESeq2 and edgeR treat the read count as:

  • a count value that varies across samples, and
  • more variable than a Poisson model would allow (because biology adds extra noise).

So they both use a negative binomial model:

  • ΞΌgi (mu) = the expected/average count for that gene in that sample
  • Ξ±g (alpha) = the extra variability for that gene (dispersion)

They then say the expected count is determined by two things:

  1. How deep that sample was sequenced β€” captured by the size factor si (normalization)
  2. What condition/batch/covariates the sample belongs to β€” captured by the design matrix xi and the gene’s coefficients Ξ²g

Ξ²g are basically the log fold changes for that gene under your model.

Bottom line: both packages use the same core idea:

\[\text{counts} \approx \text{(normalization for sample)} + \text{(effects of condition/batch/etc.)} + \text{(extra biological variability)}\]

And that’s why the β€œDESeq2 vs edgeR” differences are mostly about implementation details, especially:

  • how they estimate dispersion (Ξ±g),
  • how they test for differences in Ξ²g (differential expression),
  • and how they shrink/report fold changes and p-values.

2. Normalization

Both methods assume that most genes are not differentially expressed. Both can struggle under global transcriptome shifts (e.g., transcriptome-wide up-regulation) unless external controls are introduced.

Feature DESeq2 edgeR
Normalization Median-of-ratios (size factors) TMM (trimmed mean of M-values)
Dispersion Trend + empirical Bayes shrinkage Trend + shrinkage + quasi-likelihood layer
Recommended test Wald (pairwise) / LRT (omnibus) Quasi-likelihood F-test
LFC shrinkage Central feature (apeglm / ashr) Available, less central
Filtering Automatic independent filtering User-driven: filterByExpr()
Outlier handling Cook’s distance (needs adequate n) Robust QL estimation
UX philosophy Safe defaults, wrapped steps Modular, explicit, flexible

When assumptions break: If you suspect global shifts, consider spike-in normalization, a set of empirically stable reference genes, or at minimum examine MA plots of normalized counts for asymmetry. These are escape hatches, not defaults.

In routine bulk RNA-seq, normalization is rarely the deciding factor between these packages. Design and dispersion estimation matter more.


3. Dispersion Estimation

Estimating per-gene dispersion from small sample sizes is inherently noisy. Both packages borrow information across genes (empirical Bayes shrinkage), but they do so differently.

DESeq2 fits a mean–dispersion trend and shrinks each gene’s estimate toward it. edgeR’s QL pipeline does the same, but adds a quasi-likelihood variance layer on top β€” an additional gene-specific parameter that provides better control of false positive rates, particularly with small n or heterogeneous variance.

Key distinction: edgeR’s QL framework is often more conservative and better calibrated for Type I error control in challenging settings. DESeq2’s dispersion estimation is stable and well-validated but does not include this additional QL layer.


4. Hypothesis Testing

Each package offers multiple testing procedures. Comparing results across packages without accounting for the test used is a common source of confusion.

DESeq2

  • Wald test (default): tests a single coefficient β‰  0. Fast, standard for pairwise.
  • LRT: full vs reduced model. For omnibus questions (any effect across β‰₯3 levels).

edgeR

  • QL F-test (recommended): robust error control, especially with small n.
  • LRT options exist but QL is the standard recommendation.

Warning: Comparing DESeq2 Wald test results to edgeR QL F-test results is comparing different inferential procedures β€” not β€œpackages.” Be explicit about which test you used in your methods.


5. Log Fold-Change Shrinkage

DESeq2 provides a polished workflow for shrinking noisy log2FC estimates toward zero, preventing low-count genes from producing extreme fold changes that dominate visualizations and rankings. edgeR can moderate estimates through its modeling framework, but LFC shrinkage is not as prominently integrated into the default workflow.

Common misconception: Shrinkage primarily affects effect size estimates, ranking, and visualization. Your p-values come from the model fit; shrinkage is typically applied post hoc for interpretability, not for significance testing.


6. Filtering and Outliers

Genes with near-zero counts across samples cannot support reliable dispersion or fold-change estimation and inflate the multiple-testing burden. Filtering them is essential, not optional.

DESeq2

Independent filtering: automatically selects a threshold to maximize discoveries at target FDR. Cook’s distance flags outlier samples per gene β€” but only works reliably with adequate replicates. With very small n, treat outliers as a QC/design issue.

edgeR

User-driven pre-filtering: filterByExpr() is preferred over hand-rolled CPM thresholds because it accounts for group structure. Outlier robustness comes from the QL framework.


These are the recommended defaults for a standard pairwise comparison.

RECOMMENDED DEFAULT PIPELINES DESeq2 1 DESeqDataSetFromMatrix() counts + coldata + design formula 2 DESeq() normalization + dispersion + GLM fit 3 results() Wald or LRT + independent filtering 4 lfcShrink(type="apeglm") shrink fold changes for ranking/plots does not change p-values Steps 1–3 wrapped β€” safe defaults Few decisions needed for standard analysis edgeR (QL pipeline) 1 DGEList() + calcNormFactors() create object + TMM normalization 2 filterByExpr() group-aware low-count removal 3 estimateDisp() common + trended + gene-wise dispersion 4 glmQLFit() QL GLM β€” extra variance layer for calibration 5 glmQLFTest() QL F-test on coef or contrast Each step explicit β€” maximum control More decisions, more flexibility

8. Experimental Design Specification

This is where most real-world errors originate β€” not in the choice of package, but in the choice of model formula.

Unpaired (independent samples)

Samples in each group are independent biological replicates. No pairing, blocking, or additional covariates.

Example: 3 control vs 3 treated mice, all different animals.

Design formula: ~ condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treated","control"))
res <- lfcShrink(dds, coef = "condition_treated_vs_control",
                 type = "apeglm")

edgeR:

design <- model.matrix(~ condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreated")

Paired design

The same subject is measured under two conditions. Including the subject as a blocking factor absorbs inter-subject variability, so the condition test is based on within-subject differences.

Example: 4 patients, each contributing a tumor and a normal sample.

Design formula: ~ subject + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ patient + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","tumor","normal"))
res <- lfcShrink(dds, coef = "condition_tumor_vs_normal",
                 type = "apeglm")

edgeR:

design <- model.matrix(~ patient + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontumor")

Note: Without the pairing term, inter-patient variability inflates the error estimate and reduces power. In strongly paired data, this can be the difference between finding 50 vs 500 DE genes.

Batch correction (blocking)

Batch effects are systematic technical variation. If batch is not confounded with condition, include it as a blocking factor.

Example: Samples processed in 2 batches, with conditions represented in both.

Design formula: ~ batch + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))

edgeR:

design <- model.matrix(~ batch + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")

Note: If batch and condition are perfectly confounded (all controls in batch 1, all treated in batch 2), no method can separate the effects. This is a design failure.

Interaction (treatment Γ— genotype)

Tests whether the treatment effect depends on genotype. The interaction term captures this difference.

Example: 2 genotypes (WT, KO) Γ— 2 treatments (ctrl, treat), with replicates.

Design formula: ~ genotype * treatment

Interpreting interactions: treatment coefficient = treatment effect in the reference genotype (WT). interaction = difference in treatment effect between KO and WT. Treatment effect in KO = treatment + interaction. A significant interaction does not mean β€œno effect” β€” it means the effect differs between genotypes.

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ genotype * treatment)
dds <- DESeq(dds)

# Interaction: does treatment effect differ by genotype?
res_int <- results(dds, name = "genotypeKO.treatmenttreat")

# Treatment effect in WT (reference genotype):
res_wt  <- results(dds, contrast = c("treatment","treat","ctrl"))

edgeR:

design <- model.matrix(~ genotype * treatment, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Interaction term:
qlf_int <- glmQLFTest(fit, coef = "genotypeKO:treatmenttreat")

# Treatment effect in WT:
qlf_wt  <- glmQLFTest(fit, coef = "treatmenttreat")

# Treatment effect in KO (treatment + interaction):
con <- makeContrasts(
  treatmenttreat + genotypeKO:treatmenttreat,
  levels = design)
qlf_ko  <- glmQLFTest(fit, contrast = con)

Time-course (omnibus test)

With β‰₯3 time points, test whether any change occurs across time. This is an omnibus test β€” it identifies genes with any temporal pattern, not the direction.

Example: Samples at 0h, 6h, 12h, 24h with 3 replicates each.

Design formula: ~ time (tested via LRT / omnibus QL)

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ time)

# LRT: full model vs intercept-only (any time effect?)
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
res <- results(dds)

edgeR:

design <- model.matrix(~ time, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Test all time coefficients simultaneously (ANOVA-like)
qlf <- glmQLFTest(fit, coef = 2:ncol(design))

Note: Omnibus significance tells you some difference exists across time β€” not which time points differ or in which direction. Follow up with planned contrasts or clustering to interpret temporal patterns.

Continuous covariates (age, RIN)

Continuous variables can be included directly in the formula to adjust for technical or biological confounders.

Example: Adjusting for age and RNA integrity (RIN) while testing treatment.

Design formula: ~ age + RIN + condition

DESeq2:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ age + RIN + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))

edgeR:

design <- model.matrix(~ age + RIN + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")

On covariates and formula order: Include covariates to reduce confounding and improve precision. In GLMs, coefficient estimates do not depend on term order (unlike Type I sums of squares in ANOVA). What matters is reference levels and contrasts β€” be explicit about them, as those choices affect interpretation. Only include covariates that are biologically or technically justified; irrelevant covariates waste degrees of freedom.


9. QC and Diagnostics

No results section is complete without these checks.

Pre-alignment

  • Library sizes across samples β€” Flag outlier libraries. Large imbalances may indicate failed preps.
  • Mapping rates & duplication β€” Low mapping or high duplication can indicate degraded RNA or contamination.

Post-normalization

  • PCA / sample distance heatmap β€” Check for batch effects, outliers, and whether conditions separate. The single most informative plot.
  • MA plot of normalized counts β€” Should show balanced fold changes around zero. Asymmetry suggests normalization failure.

Model diagnostics

  • Dispersion plot (mean vs dispersion) β€” Gene-wise estimates should scatter around the fitted trend. Wild outliers or flat trends indicate problems.
  • Check for confounded batch + condition β€” Inspect the design matrix. If any factor is perfectly correlated with the variable of interest, the model is unidentifiable.

Results-level

  • P-value histogram β€” Should be uniform with a spike near 0 (true positives). U-shapes or spikes at 1 indicate model misspecification.
  • Volcano / MA plot of results β€” Sanity-check effect sizes and significance patterns. Inspect the top hits manually against raw counts.

10. Common Failure Modes

Issue Problem
n = 1 per group Cannot estimate biological variability. DE is not possible.
Batch ≑ condition Perfectly confounded. No method can separate the effects.
Unmodeled pairing Treating paired samples as independent inflates error and kills power.
Overfitting covariates Adding irrelevant covariates wastes degrees of freedom with small n.
No pre-filtering Inflates multiple-testing burden and distorts FDR.
Intersecting two tools Not validation β€” both use NB GLMs. Reduces sensitivity for false comfort.

Rules of Thumb

  1. If unsure and doing standard bulk RNA-seq, DESeq2 is the lowest-risk default.
  2. If sample sizes are small or you need conservative inference, edgeR + QL is a strong choice.
  3. Do not intersect results from both packages and call it validation. Use diagnostics and validate the biology.
  4. Spend more time on design, contrasts, and QC than on package selection. The formula you write matters more than the function you call.

Test Yourself: DESeq2 vs edgeR Quiz

Think you’ve got it? Click an answer to check.


References

  1. Guo, L. β€œDESeq2, limma & edgeR for RNA-seq Differential Expression: Which to Use?” YouTube.

  2. Starmer, J. β€œStatQuest: DESeq2, part 1, Library Normalization.” StatQuest with Josh Starmer. YouTube.

  3. Starmer, J. β€œStatQuest: edgeR and DESeq2, part 2 - Independent Filtering.” StatQuest with Josh Starmer. YouTube.

  4. Starmer, J. β€œStatQuest: edgeR, part 1, Library Normalization.” StatQuest with Josh Starmer. YouTube.


Covers bulk RNA-seq and pseudo-bulk DE as of current best practices. Consult DESeq2 and edgeR vignettes for the latest recommendations.

Comments

Leave a comment using your GitHub account. Your feedback is appreciated!