For the impatient: what should I choose
DESeq2 and edgeR are the two most widely used tools for differential expression analysis of RNA-seq count data. Both use Negative Binomial generalized linear models, and both are mature, well-validated, and actively maintained.
They often agree on strong signals, but they can diverge on borderline genes β especially with small sample sizes, strong composition effects, or outliers β because dispersion estimation and hypothesis testing differ. This guide covers both the statistical machinery and the experimental design specification, which is where most real-world errors originate.
1. The Shared Statistical Foundation
For each gene g in each sample i, DESeq2 and edgeR treat the read count as:
- a count value that varies across samples, and
- more variable than a Poisson model would allow (because biology adds extra noise).
So they both use a negative binomial model:
- ΞΌgi (mu) = the expected/average count for that gene in that sample
- Ξ±g (alpha) = the extra variability for that gene (dispersion)
They then say the expected count is determined by two things:
- How deep that sample was sequenced β captured by the size factor si (normalization)
- What condition/batch/covariates the sample belongs to β captured by the design matrix xi and the geneβs coefficients Ξ²g
Ξ²g are basically the log fold changes for that gene under your model.
Bottom line: both packages use the same core idea:
\[\text{counts} \approx \text{(normalization for sample)} + \text{(effects of condition/batch/etc.)} + \text{(extra biological variability)}\]And thatβs why the βDESeq2 vs edgeRβ differences are mostly about implementation details, especially:
- how they estimate dispersion (Ξ±g),
- how they test for differences in Ξ²g (differential expression),
- and how they shrink/report fold changes and p-values.
2. Normalization
Both methods assume that most genes are not differentially expressed. Both can struggle under global transcriptome shifts (e.g., transcriptome-wide up-regulation) unless external controls are introduced.
| Feature | DESeq2 | edgeR |
|---|---|---|
| Normalization | Median-of-ratios (size factors) | TMM (trimmed mean of M-values) |
| Dispersion | Trend + empirical Bayes shrinkage | Trend + shrinkage + quasi-likelihood layer |
| Recommended test | Wald (pairwise) / LRT (omnibus) | Quasi-likelihood F-test |
| LFC shrinkage | Central feature (apeglm / ashr) | Available, less central |
| Filtering | Automatic independent filtering | User-driven: filterByExpr() |
| Outlier handling | Cookβs distance (needs adequate n) | Robust QL estimation |
| UX philosophy | Safe defaults, wrapped steps | Modular, explicit, flexible |
When assumptions break: If you suspect global shifts, consider spike-in normalization, a set of empirically stable reference genes, or at minimum examine MA plots of normalized counts for asymmetry. These are escape hatches, not defaults.
In routine bulk RNA-seq, normalization is rarely the deciding factor between these packages. Design and dispersion estimation matter more.
3. Dispersion Estimation
Estimating per-gene dispersion from small sample sizes is inherently noisy. Both packages borrow information across genes (empirical Bayes shrinkage), but they do so differently.
DESeq2 fits a meanβdispersion trend and shrinks each geneβs estimate toward it. edgeRβs QL pipeline does the same, but adds a quasi-likelihood variance layer on top β an additional gene-specific parameter that provides better control of false positive rates, particularly with small n or heterogeneous variance.
Key distinction: edgeRβs QL framework is often more conservative and better calibrated for Type I error control in challenging settings. DESeq2βs dispersion estimation is stable and well-validated but does not include this additional QL layer.
4. Hypothesis Testing
Each package offers multiple testing procedures. Comparing results across packages without accounting for the test used is a common source of confusion.
DESeq2
- Wald test (default): tests a single coefficient β 0. Fast, standard for pairwise.
- LRT: full vs reduced model. For omnibus questions (any effect across β₯3 levels).
edgeR
- QL F-test (recommended): robust error control, especially with small n.
- LRT options exist but QL is the standard recommendation.
Warning: Comparing DESeq2 Wald test results to edgeR QL F-test results is comparing different inferential procedures β not βpackages.β Be explicit about which test you used in your methods.
5. Log Fold-Change Shrinkage
DESeq2 provides a polished workflow for shrinking noisy log2FC estimates toward zero, preventing low-count genes from producing extreme fold changes that dominate visualizations and rankings. edgeR can moderate estimates through its modeling framework, but LFC shrinkage is not as prominently integrated into the default workflow.
Common misconception: Shrinkage primarily affects effect size estimates, ranking, and visualization. Your p-values come from the model fit; shrinkage is typically applied post hoc for interpretability, not for significance testing.
6. Filtering and Outliers
Genes with near-zero counts across samples cannot support reliable dispersion or fold-change estimation and inflate the multiple-testing burden. Filtering them is essential, not optional.
DESeq2
Independent filtering: automatically selects a threshold to maximize discoveries at target FDR. Cookβs distance flags outlier samples per gene β but only works reliably with adequate replicates. With very small n, treat outliers as a QC/design issue.
edgeR
User-driven pre-filtering: filterByExpr() is preferred over hand-rolled CPM thresholds because it accounts for group structure. Outlier robustness comes from the QL framework.
7. Recommended Default Pipelines
These are the recommended defaults for a standard pairwise comparison.
8. Experimental Design Specification
This is where most real-world errors originate β not in the choice of package, but in the choice of model formula.
Unpaired (independent samples)
Samples in each group are independent biological replicates. No pairing, blocking, or additional covariates.
Example: 3 control vs 3 treated mice, all different animals.
Design formula: ~ condition
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treated","control"))
res <- lfcShrink(dds, coef = "condition_treated_vs_control",
type = "apeglm")
edgeR:
design <- model.matrix(~ condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreated")
Paired design
The same subject is measured under two conditions. Including the subject as a blocking factor absorbs inter-subject variability, so the condition test is based on within-subject differences.
Example: 4 patients, each contributing a tumor and a normal sample.
Design formula: ~ subject + condition
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ patient + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","tumor","normal"))
res <- lfcShrink(dds, coef = "condition_tumor_vs_normal",
type = "apeglm")
edgeR:
design <- model.matrix(~ patient + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontumor")
Note: Without the pairing term, inter-patient variability inflates the error estimate and reduces power. In strongly paired data, this can be the difference between finding 50 vs 500 DE genes.
Batch correction (blocking)
Batch effects are systematic technical variation. If batch is not confounded with condition, include it as a blocking factor.
Example: Samples processed in 2 batches, with conditions represented in both.
Design formula: ~ batch + condition
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))
edgeR:
design <- model.matrix(~ batch + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")
Note: If batch and condition are perfectly confounded (all controls in batch 1, all treated in batch 2), no method can separate the effects. This is a design failure.
Interaction (treatment Γ genotype)
Tests whether the treatment effect depends on genotype. The interaction term captures this difference.
Example: 2 genotypes (WT, KO) Γ 2 treatments (ctrl, treat), with replicates.
Design formula: ~ genotype * treatment
Interpreting interactions:
treatmentcoefficient = treatment effect in the reference genotype (WT).interaction= difference in treatment effect between KO and WT. Treatment effect in KO =treatment + interaction. A significant interaction does not mean βno effectβ β it means the effect differs between genotypes.
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ genotype * treatment)
dds <- DESeq(dds)
# Interaction: does treatment effect differ by genotype?
res_int <- results(dds, name = "genotypeKO.treatmenttreat")
# Treatment effect in WT (reference genotype):
res_wt <- results(dds, contrast = c("treatment","treat","ctrl"))
edgeR:
design <- model.matrix(~ genotype * treatment, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
# Interaction term:
qlf_int <- glmQLFTest(fit, coef = "genotypeKO:treatmenttreat")
# Treatment effect in WT:
qlf_wt <- glmQLFTest(fit, coef = "treatmenttreat")
# Treatment effect in KO (treatment + interaction):
con <- makeContrasts(
treatmenttreat + genotypeKO:treatmenttreat,
levels = design)
qlf_ko <- glmQLFTest(fit, contrast = con)
Time-course (omnibus test)
With β₯3 time points, test whether any change occurs across time. This is an omnibus test β it identifies genes with any temporal pattern, not the direction.
Example: Samples at 0h, 6h, 12h, 24h with 3 replicates each.
Design formula: ~ time (tested via LRT / omnibus QL)
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ time)
# LRT: full model vs intercept-only (any time effect?)
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
res <- results(dds)
edgeR:
design <- model.matrix(~ time, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
# Test all time coefficients simultaneously (ANOVA-like)
qlf <- glmQLFTest(fit, coef = 2:ncol(design))
Note: Omnibus significance tells you some difference exists across time β not which time points differ or in which direction. Follow up with planned contrasts or clustering to interpret temporal patterns.
Continuous covariates (age, RIN)
Continuous variables can be included directly in the formula to adjust for technical or biological confounders.
Example: Adjusting for age and RNA integrity (RIN) while testing treatment.
Design formula: ~ age + RIN + condition
DESeq2:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ age + RIN + condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition","treat","ctrl"))
edgeR:
design <- model.matrix(~ age + RIN + condition, data = coldata)
y <- DGEList(counts = counts)
y <- calcNormFactors(y)
keep <- filterByExpr(y, design = design)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "conditiontreat")
On covariates and formula order: Include covariates to reduce confounding and improve precision. In GLMs, coefficient estimates do not depend on term order (unlike Type I sums of squares in ANOVA). What matters is reference levels and contrasts β be explicit about them, as those choices affect interpretation. Only include covariates that are biologically or technically justified; irrelevant covariates waste degrees of freedom.
9. QC and Diagnostics
No results section is complete without these checks.
Pre-alignment
- Library sizes across samples β Flag outlier libraries. Large imbalances may indicate failed preps.
- Mapping rates & duplication β Low mapping or high duplication can indicate degraded RNA or contamination.
Post-normalization
- PCA / sample distance heatmap β Check for batch effects, outliers, and whether conditions separate. The single most informative plot.
- MA plot of normalized counts β Should show balanced fold changes around zero. Asymmetry suggests normalization failure.
Model diagnostics
- Dispersion plot (mean vs dispersion) β Gene-wise estimates should scatter around the fitted trend. Wild outliers or flat trends indicate problems.
- Check for confounded batch + condition β Inspect the design matrix. If any factor is perfectly correlated with the variable of interest, the model is unidentifiable.
Results-level
- P-value histogram β Should be uniform with a spike near 0 (true positives). U-shapes or spikes at 1 indicate model misspecification.
- Volcano / MA plot of results β Sanity-check effect sizes and significance patterns. Inspect the top hits manually against raw counts.
10. Common Failure Modes
| Issue | Problem |
|---|---|
| n = 1 per group | Cannot estimate biological variability. DE is not possible. |
| Batch β‘ condition | Perfectly confounded. No method can separate the effects. |
| Unmodeled pairing | Treating paired samples as independent inflates error and kills power. |
| Overfitting covariates | Adding irrelevant covariates wastes degrees of freedom with small n. |
| No pre-filtering | Inflates multiple-testing burden and distorts FDR. |
| Intersecting two tools | Not validation β both use NB GLMs. Reduces sensitivity for false comfort. |
Rules of Thumb
- If unsure and doing standard bulk RNA-seq, DESeq2 is the lowest-risk default.
- If sample sizes are small or you need conservative inference, edgeR + QL is a strong choice.
- Do not intersect results from both packages and call it validation. Use diagnostics and validate the biology.
- Spend more time on design, contrasts, and QC than on package selection. The formula you write matters more than the function you call.
Test Yourself: DESeq2 vs edgeR Quiz
Think youβve got it? Click an answer to check.
References
-
Guo, L. βDESeq2, limma & edgeR for RNA-seq Differential Expression: Which to Use?β YouTube.
-
Starmer, J. βStatQuest: DESeq2, part 1, Library Normalization.β StatQuest with Josh Starmer. YouTube.
-
Starmer, J. βStatQuest: edgeR and DESeq2, part 2 - Independent Filtering.β StatQuest with Josh Starmer. YouTube.
-
Starmer, J. βStatQuest: edgeR, part 1, Library Normalization.β StatQuest with Josh Starmer. YouTube.
Covers bulk RNA-seq and pseudo-bulk DE as of current best practices. Consult DESeq2 and edgeR vignettes for the latest recommendations.
Comments
Leave a comment using your GitHub account. Your feedback is appreciated!