5 Scanpy Tips I Wish I Knew Earlier

After years of working with single-cell data, I’ve accumulated some tricks that have significantly improved my workflow. Here are five Scanpy tips that I wish someone had told me when I started.

1. Use Backed Mode for Large Datasets

When working with datasets that don’t fit in memory:

adata = sc.read_h5ad('large_dataset.h5ad', backed='r')

This loads the data lazily, only reading what you need.

2. Cache Your Preprocessing

Save intermediate results to avoid recomputing:

import os

cache_file = 'preprocessed.h5ad'
if os.path.exists(cache_file):
    adata = sc.read_h5ad(cache_file)
else:
    # preprocessing steps
    sc.pp.normalize_total(adata)
    sc.pp.log1p(adata)
    adata.write(cache_file)

3. Parallel Processing with scanpy

Use the n_jobs parameter when available:

sc.pp.neighbors(adata, n_neighbors=30, n_jobs=8)

4. Better UMAP Reproducibility

Set random state everywhere:

sc.tl.umap(adata, random_state=42)
sc.tl.leiden(adata, random_state=42)

5. Memory-Efficient Gene Selection

Instead of keeping all genes in memory:

sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

What are your favorite Scanpy tips? Share them in the comments below!

1. Use Backed Mode for Large Datasets

2. Cache Your Preprocessing

3. Parallel Processing with scanpy

4. Better UMAP Reproducibility

5. Memory-Efficient Gene Selection

Comments