Getting rid of bottom .vars

Hi all,

I have a 10x dataset where someone listed gene programs at the bottom of the genelist. When I try to find markers these gene programs get in the way. I can manually remove them like a data frame, but then when I save/load the anndata object I get an error because the matrix isn’t even. I was wondering how do I go about excluding the bottom 203 rows of my .vars during analysis? Or how do I split it into a different .var? I hope that makes sense.

Here is a picture of what my .var list looks like. The bottom 203 gene-ids are gene programs that shouldn’t be there! Also should my gene_ids be split into two columns gene_names and gene_ids? How do I do that?

You can split the anndata using standard indexing. For example:

top_adata = adata[:,:-203] # only gene names
bot_adata = adata[:,203:] # only gene programs

The gene names can be accessed by calling adata.var_names() which returns the index of the dataframe that is .var or you can return gene ids with column indexing adata.var[‘gene_ids’], so they should be good the way they are.

Hope I answered the questions you are asking?

Hi Chuck,

Thank you so much for your help. The first line of code worked perfectly. For the second part about calling var names…what I was trying to say is the original genes.tsv file has two columns (one with gene_names and one with gene ensemble IDs) but when I read in my 10x mtx file, it creates a .var with 1 column. I guess I’m not sure if this is normal. In the picture above it looks like the gene_ids got mashed up with the gene_names into one column. Does that make sense? Do you think it looks normal as one column? Will I be able to run enrichr, etc. gene analysis later on?

I think I understand your question better now, it looks like the two naming conventions have been combined into one column and you are worried this might affect downstream analysis?

It may look like only one column, but on read in, scanpy places the common names (or whatever the first column in your tsv is) into the index of the var dataframe (df), this makes it a lot cleaner to reference rows of the df. The only other entry in the df is the column for gene_ids. When you reference a gene with any of scanpy’s functions by default it will be expecting the naming convention of the index in var. Scanpy will handle data the same way regardless of naming used, the only difference is how you would reference the gene names you are interested in seeing overlay-ed on umap plot or etc.


Thank you for the explanation, Chuck! The gseapy.enrichr code is giving me an error ‘cannot read gene list’ but I will have to look into the code a little bit more to fully understand what the problem is.