Question: average exression of normalized linear counts?


I would like to calculate the average gene expression per category (e.g clusters).
My question is if the correct way to do it would be from the normalized linear (not log transformed) counts and then tranform back to log scale. Therefore, I would like to know if the following code (taken from the dotplot function) does that or whether is calculating the mean directly from log values:

obs = adata.raw[:,gene_ids].X.toarray()
obs = pd.DataFrame(obs,columns=gene_ids,index=adata.obs[‘louvain’])
average_obs = obs.groupby(level=0).mean()

Thanks a lot.

Hi @llumdi,

The code you posted just takes the mean of particular gene_ids of whatever format of the data is stored in adata.raw.X. In scanpy, you control what is stored there by freezing a version of your data in adata.raw via adata.raw = adata. You could do the same for the data stored in adata.X or adata.layers['counts'] (if you chose to store something in that layer) by just running either of:

obs = adata[:,gene_ids].X.toarray()
obs = adata.layers['counts'][:,gene_ids].X.toarray()

instead of the obs assignment you posted. It’s important to keep track of what you do to adata.X when you are running a Scanpy pipeline, so I cannot tell you what you decided to store in adata.X (or adata.raw.X for that matter).

With regards to your other question: there is no standard way that determines how the mean expression should be calculated (log normalized, or just normalized). We typically work with log-normalized gene expression, however linear functions, f(), like the mean will give you different results if you do: f(log(Data)) or log(f(Data)). In most cases that I can recall, people determine one unit of expression they will work with and then perform means or other functions on that unit of expression. If you decide to work on log-normalized data, then you would run mean(log(Data)) in that case. I guess one might regard log-normalized expression as a more meaningful biological unit for expression data (given that differences are fold changes on this scale) and therefore directly work on this scale. I guess it would be technically better to logarithmize the mean again though, although I have not come across this.

Hi @LuckyMD,
Thanks a lot for your reply. This is very helpful.