Getting Cells for list of genes ~ covid

I have 10X scRNA data from human + SARSCov2.
After alignment with the correct genome, and loading in scanpy, I can view the counts for viral genes.
I am interested in cells that have viral counts aka infected cells.
For retrieving cells with viral genes, I have a list of genes which belong to virus.

I need to make a new adata but with only infected cells.
How I retrieve the adata.obs only for certain adata.var_names?

Specific instructions will be super helpful !!!

So far i have this :
covid_genes = adata.var_names.str.startswith((‘ORF1a’,‘S’, ‘ORF3a’, ‘E’, ‘M’, ‘ORF6’, ‘ORF7a’, ‘ORF7b’, ‘ORF8’, ‘N’, ‘ORF10’))

I added the columns counting the percentage of viral read expression
adata.obs[‘covid’] = np.sum(
adata[:, covid_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

One option could be to calculate a score for all the COVID genes, then create a new dataset with a threshold:

covid_genes = ['ORF1a','S', 'ORF3a', 'E', 'M', 'ORF6', 'ORF7a', 'ORF7b', 'ORF8', 'N', 'ORF10'], gene_list=covid_genes, score_name='COVID_score')

Here’s a hypothetical of what your scores may look like:

import seaborn as sns


At this point, you can create a new AnnData object with a threshold filter:

adata_covid = adata[adata.obs['COVID_score'] > 0.25]

There are many ways to go about it. Envious that you have an exciting dataset to explore! Good luck!

Hi @preetiaahir,

I assume you want the expression values only for those genes? In that case it’s not the adata.obs you want, but the adata.X. You can get that via:

However, the way you select your covid genes is currenly incorrect. There are a lot of genes that start with e.g., “S”. Thus, you will have many more genes than the viral genome in your list. You could instead just use a list as was suggested by @jpreall below.

If you want to use only infected cells (assuming no other cells have any ambient reads of the SARS-CoV-2 genes), then you should be able to do something like:

covid_genes = [‘ORF1a’,‘S’, ‘ORF3a’, ‘E’, ‘M’, ‘ORF6’, ‘ORF7a’, ‘ORF7b’, ‘ORF8’, ‘N’, ‘ORF10’]
adata_infected = adata[((adata[:,covid_genes].X > 0).mean(0) > 0),:]

Remember that you only have a view of the object when subsetting without using .copy() at the end.

Thank you for your response.
With above method i get the index error.
IndexError Traceback (most recent call last)
1 covid_genes = [‘ORF1ab’,‘S’, ‘ORF3a’, ‘M’, ‘ORF6’, ‘ORF7a’,‘ORF8’, ‘N’, ‘ORF10’]
----> 2 adata_infected = adata[((adata[:,covid_genes].X > 0).mean(0) > 0),:]

~/my_virtualenv/lib/python3.7/site-packages/anndata/_core/ in _normalize_index(indexer, index)
88 if indexer.shape != index.shape:
89 raise IndexError(
—> 90 f"Boolean index does not match AnnData’s shape along this "
91 f"dimension. Boolean index has shape {indexer.shape} while "
92 f"AnnData index has shape {index.shape}."

IndexError: Boolean index does not match AnnData’s shape along this dimension. Boolean index has shape (9,) while AnnData index has shape (6650,).

You may need to do sth like:

covid_genes = [‘ORF1ab’,‘S’, ‘ORF3a’, ‘M’, ‘ORF6’, ‘ORF7a’,‘ORF8’, ‘N’, ‘ORF10’]
adata_infected = adata[((adata[:,[g in covid_genes for g in adata.var_names]].X > 0).mean(0) > 0),:]

Hi, I have a similar question as to how to filter cells based on expression of a list of genes. I have a list of about 900 genes I am interested in, and would like to subset my adata object to just contain cells that express any combination of these genes.

I have tried both of the code suggestions provided by @LuckyMD , but they both give me the same error: IndexError: Boolean index does not match AnnData’s shape along this dimension. I am not sure why this is not working (especially @LuckyMD’s second suggestion). I am working with scanpy v1.8.1.

I also tried @jpreall’s suggestion as an alternative, but my scores range from -0.04 to 0.00 and I am not quite sure how to interpret this or use it to set a threshold.

Ultimately, what I am trying to do is determine how many of my cells that express some combination of my 900 genes express these genes in a predominant way (i.e., we expect that in this group of genes, cells should have preferentially expression of just a few, and likely just one, of these genes) and above a certain threshold (how can I determine the expression is “real” and not just background?). And similarly, I want to know how many cells express multiple genes from my list of genes at a roughly equivalent level (and what is that expression level).

Any thoughts on the best way to approach these questions would be really appreciated! I figured the first step would be to subset my data to get the cells expressing my genes of interest.

EDIT: I have found a clunky workaround for subsetting my adata object. If anyone has a more elegant/efficient way to do this, that would be great!

adata_sub = adata[:,[g in genes_list for g in adata.var_names]].copy()
sc.pp.filter_cells(adata_sub, min_genes=1)
barcodes = list(adata_sub.obs.index)
adata_sub = adata[barcodes].copy()