Highly variable genes - best practice?

Can someone clarify what current best practice is regarding highly variable genes, specifically, should I be filtering my dataset by highly variable genes prior to UMAP etc?


You can find our current best-practices in the recent publication here. Typically the pre-processing steps in an analysis workflow would be:

  1. Cell & Gene QC
  2. Normalization
  3. Batch correction (or data integration)
  4. HVG selection
  5. Dimensionality reduction (including visualization)

Sometimes HVG selection is necessary for data integration as well depending on the method. Ideally, you’d try to do it afterwards though as you would then avoid selecting genes that are highly variable due to the batch effect.

What are thoughts on finding HVGs in each separate batch, then find those that are common in at least “x” number of batches?

That was done here: https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html

Hi @jayypaul,

That’s more or less what we do in scanpy with the batch_key parameter. And we also have this implemented in the single-cell integration benchmarking preprocessing functions here: https://github.com/theislab/scib/blob/master/scIB/preprocessing.py#L275