How to calculate the average gene expression within each cluster?

Hello everyone.

I’m a new in Scanpy and impressed by its speed and user-friendly. Since I used to be a big fan of Seurat, the most popular R package for snRNA-seq analysis, I don’t know how to do some operations I often do in Seurat with Scanpy. The next is an example.

In Seurat, I could get the average gene expression of each cluster easily by the code showed in the picture.

I think Scanpy can do the same thing as well, but I don’t know how to do right now. I would be grateful if anyone could offer me some help.


If you just want the average expression within each cluster, you could do:

import pandas as pd                                                                                                                                                                         
import scanpy as sc

adata = sc.datasets.pbmc68k_reduced()

res = pd.DataFrame(columns=adata.var_names, index=adata.obs['bulk_labels'].cat.categories)                                                                                                 

for clust in 
    res.loc[clust] = adata[adata.obs['bulk_labels'].isin([clust]),:].X.mean(0)

The average expression values are then stored in res. Just replace 'bulk_labels' with your cluster label.

If you want the average z-scores of the expression values you can use sc.pp.scale(adata) before running this.

1 Like

Got it! Thank you for your help!

Hi @LuckyMD, how would you get the number of cells that have expression for a gene per cluster, rather than the average?

Hi @sgmccalla,

You should be able to do that via:
(adata[adata.obs['bulk_labels'].isin([clust]),:][:,'GeneA'].X > 0).mean(0)

However the dotplot already gives you that value. So you if you want to be sure you could check the code for that plotting function.

Hi @LuckyMD,

Thank you for posting, this is very helpful. :slight_smile:
Is it possible to calculate p value instead of z score for the average expression of genes within a cluster?

Hey @Edd

You can get p-values with the tl.rank_genes_groups function. Docs on that are here:

e.g.:, 'cluster_label', method='t-test')
pval_table = pd.DataFrame(
            {group + '_' + key[:2]: result[key][group]
            for group in groups for key in ['names', 'pvals_adj']})

Hi @Edd,
A p-value necessitates a hypothesis test. What hypothesis would you be interested in testing? the rank_genes_groups test that @alexcwsmith pointed out would test whether the gene is expressed higher in a particular cluster than in all other cells. If you want to test a different hypothesis, you’ll need a different test.

Hi @LuckyMD & @alexcwsmith,

Thank you both for getting back to me ! :slight_smile:
I am familiar with retrieving logfoldchanges and pvals from
I may be trying to do something different, that may not be possible.

I was wondering is it possible to test for average gene expression for genes within a cluster with an associated p value. Or is this only possible with logfoldchange?

Thanks again for all your time and attention.