Create a metainfo for obs based on vars

Hi all,

I’m pretty new to scanpy and anndata objects.

I have counts for 3 species (human, mouse and dog). I determined the species for each gene based on the gene ID (which is the “var” I loaded):

an.var["Species"] = an.var_names.str[0:4]  # ENSG / ENSM / ENSC

Then I’d like to determine the species of the cells based on the percentage of reads per species genes, a cell with > 70% of its reads assigned to one species would be labelled with it.
My ugly solution for now is:

for c in adata.obs.index:
    for s in adata.var["Species"].unique():
        adata.obs.loc[c, s + "_count"] = adata[c, adata.var["Species"] == s].X.sum()
        adata.obs.loc[c, s + "_perc"] = adata[c, adata.var["Species"] == s].X.sum() / adata[c, :].X.sum() * 100
        if adata.obs.loc[c, s + "_perc"] > 60:
            adata.obs.loc[c, "Species"] = s[0:4]   # ENSG / ENSM / ENSC

I’ve tried different things but I feel really far from anndata philosphy and I don’t have a clear view of all this for now.
Can someone help me please?

Also I have side questions:

  • I’m not sure I understand the difference between “var” and “vars”? The former is some kind of index for the latter?
  • Is there a easy hands-on tuto that you could recommend?

Cheers,
Mathieu

This seems mostly right to me. The one thing I would say is that this will be significantly faster if you don’t loop over the observations. I would probably write:

species_fracs = pd.DataFrame(index=adata.obs_names)
for species, var_indices in adata.var.groupby("species").indices.items():
    species_fracs[species] = adata[:, var_indices].X.sum(axis=1)
species_fracs = species_fracs / adata.X.sum(axis=1)
adata.obs["species"] = species_fracs.idxmax(axis=1)

For your other questions:

  • I’m not sure what you mean by vars. Could you elaborate?
  • @LuckyMD would your “Best Practices” notebooks be useful here?
1 Like

I haven’t covered this in the analysis best practices as it’s not a standard analysis pipeline component.

Hi,

Thanks for your answer, I could easily adapt your code to my context!

Sorry for the vars, that’s something I found two days ago but I can’t get my hands on it anymore…
Where are these analyses best practices? It would probably be great for me to take a look.

Cheers,
Mathieu

Hey @mbahin,

Check out www.github.com/theislab/single-cell-tutorial or the accompanying paper