Highly variable genes - best practice?

Can someone clarify what current best practice is regarding highly variable genes, specifically, should I be filtering my dataset by highly variable genes prior to UMAP etc?


You can find our current best-practices in the recent publication here. Typically the pre-processing steps in an analysis workflow would be:

  1. Cell & Gene QC
  2. Normalization
  3. Batch correction (or data integration)
  4. HVG selection
  5. Dimensionality reduction (including visualization)

Sometimes HVG selection is necessary for data integration as well depending on the method. Ideally, you’d try to do it afterwards though as you would then avoid selecting genes that are highly variable due to the batch effect.

What are thoughts on finding HVGs in each separate batch, then find those that are common in at least “x” number of batches?

That was done here: https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html

Hi @jayypaul,

That’s more or less what we do in scanpy with the batch_key parameter. And we also have this implemented in the single-cell integration benchmarking preprocessing functions here: https://github.com/theislab/scib/blob/master/scIB/preprocessing.py#L275


I am relatively new to scRNA-seq analysis and scanpy. I just found this stream, and I have similar question whether I should filter my data by selecting the highly variable genes before any further clustering.

My current project is trying to identify some genes that can be used as markers for a specific anatomical structure. I’ve been using scanpy to analyze a mouse forelimb dataset (p.s. thank you so much for establishing scanpy! it is so amazingly useful, and the scanpy forum is very informative and helpful for new comers like me:)), and I noticed if I do the highly_var_genes feature selection, some of our candidate genes would be dropout.

I’ve been digging into literature for this, some papers do mention using highly_var_genes may exclude genes that may be useful for identifying rare cell populations.

So I want to know do I must perform highly_var_genes feature selection? If not, will the subsequent results (any candidate genes we found) be considered as valid? (for sure I will perform other experiments to confirm, but I would like to know whether these information are solid or not in bioinformatics view too!)

Sorry for the long paragraph for the question!

Bumping thread to see if anyone has thoughts on @irislee1106 's question, I have the same issue.