Getting Cells for list of genes ~ covid

Hi,
I have 10X scRNA data from human + SARSCov2.
After alignment with the correct genome, and loading in scanpy, I can view the counts for viral genes.
I am interested in cells that have viral counts aka infected cells.
For retrieving cells with viral genes, I have a list of genes which belong to virus.

I need to make a new adata but with only infected cells.
How I retrieve the adata.obs only for certain adata.var_names?

Specific instructions will be super helpful !!!

So far i have this :
covid_genes = adata.var_names.str.startswith((‘ORF1a’,‘S’, ‘ORF3a’, ‘E’, ‘M’, ‘ORF6’, ‘ORF7a’, ‘ORF7b’, ‘ORF8’, ‘N’, ‘ORF10’))

print(sum(covid_genes))
I added the columns counting the percentage of viral read expression
adata.obs[‘covid’] = np.sum(
adata[:, covid_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

One option could be to calculate a score for all the COVID genes, then create a new dataset with a threshold:

covid_genes = ['ORF1a','S', 'ORF3a', 'E', 'M', 'ORF6', 'ORF7a', 'ORF7b', 'ORF8', 'N', 'ORF10']
sc.tl.score_genes(adata, gene_list=covid_genes, score_name='COVID_score')

Here’s a hypothetical of what your scores may look like:

import seaborn as sns
sns.distplot(adata.obs['COVID_score'])

covid_score

At this point, you can create a new AnnData object with a threshold filter:

adata_covid = adata[adata.obs['COVID_score'] > 0.25]

There are many ways to go about it. Envious that you have an exciting dataset to explore! Good luck!

Hi @preetiaahir,

I assume you want the expression values only for those genes? In that case it’s not the adata.obs you want, but the adata.X. You can get that via:
adata[:,covid_genes].X.

However, the way you select your covid genes is currenly incorrect. There are a lot of genes that start with e.g., “S”. Thus, you will have many more genes than the viral genome in your list. You could instead just use a list as was suggested by @jpreall below.

If you want to use only infected cells (assuming no other cells have any ambient reads of the SARS-CoV-2 genes), then you should be able to do something like:

covid_genes = [‘ORF1a’,‘S’, ‘ORF3a’, ‘E’, ‘M’, ‘ORF6’, ‘ORF7a’, ‘ORF7b’, ‘ORF8’, ‘N’, ‘ORF10’]
adata_infected = adata[((adata[:,covid_genes].X > 0).mean(0) > 0),:]

Remember that you only have a view of the object when subsetting without using .copy() at the end.

Thank you for your response.
With above method i get the index error.
IndexError Traceback (most recent call last)
in
1 covid_genes = [‘ORF1ab’,‘S’, ‘ORF3a’, ‘M’, ‘ORF6’, ‘ORF7a’,‘ORF8’, ‘N’, ‘ORF10’]
----> 2 adata_infected = adata[((adata[:,covid_genes].X > 0).mean(0) > 0),:]

~/my_virtualenv/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
88 if indexer.shape != index.shape:
89 raise IndexError(
—> 90 f"Boolean index does not match AnnData’s shape along this "
91 f"dimension. Boolean index has shape {indexer.shape} while "
92 f"AnnData index has shape {index.shape}."

IndexError: Boolean index does not match AnnData’s shape along this dimension. Boolean index has shape (9,) while AnnData index has shape (6650,).

You may need to do sth like:

covid_genes = [‘ORF1ab’,‘S’, ‘ORF3a’, ‘M’, ‘ORF6’, ‘ORF7a’,‘ORF8’, ‘N’, ‘ORF10’]
adata_infected = adata[((adata[:,[g in covid_genes for g in adata.var_names]].X > 0).mean(0) > 0),:]