Renaming genes and subsetting by new names, python crashes


I need to change some gene names in my AnnData object to enable data integration between two different species. However, whenever I try to subset the renamed data set, python dies.

Here’s what I do:

  1. import the query data set from an .h5ad file. Here’s the data set structure:
    AnnData object with n_obs × n_vars = 19759 × 1280
    obs: ‘n_counts’, ‘n_genes’, ‘percent_mito’, ‘sample’, ‘scale_gene0’, … ‘scale_gene1279’, ‘leiden’
    var: ‘highly_variable’, ‘means’, ‘dispersions’, ‘dispersions_norm’
    uns: ‘leiden’, ‘leiden_colors’, ‘neighbors’, ‘pca’, ‘umap’
    obsm: ‘X_pca’, ‘X_scvi’, ‘X_umap’
    varm: ‘PCs’

  2. import a list of orthologs from a text file

  3. loop through adata_query.var_names; if adata_query.var_names[i] is in the list of orthologs, I assign adata_query.var_names.values[i] = ortholog

  4. sometimes different genes have the same predicted ortholog, so I identify duplicates in adata_query.var_names.values and rename them

  5. import the reference and intersect the gene lists in the query and the reference data set:
    common_genes = adata_reference.var_names.intersection(adata_query.var_names)

  6. subset the reference data set:
    adata_reference = adata_reference[:,common_genes]

  7. try to subset the query data set:
    adata_query = adata_query[:,common_genes]

  8. python crashes.

I would be grateful for any help!

Here’s a reproducible example which results in python crashing with the following error message:

[IPKernelApp] WARNING | No such comm: 52941e088e3211eaac1bacde48001122

The downsampled files are here:

The code:

import pandas as pd
import scanpy as sc
from collections import Counter
from itertools import tee, count

# import files
adata_reference ='/Users/bsierieb/Desktop/adata_reference.h5ad')
adata_query ='/Users/bsierieb/Desktop/adata_query.h5ad')
orthology = pd.read_csv('/Users/bsierieb/Desktop/orthologs.txt',

# rename genes in adata_query (takes a couple of minutes)
for i in range(0,adata_query.var_names.shape[0]):
    for line in range(0,orthology.iloc[:,0].size):
        if adata_query.var_names[i] in orthology.iloc[line,0]:
            ortholog = orthology.iloc[line,1]
            ortholog = (ortholog.split(sep='|')[1]).split(sep='-')[0]
            adata_query.var_names.values[i] = ortholog

# identify duplicate gene names and make them unique
def uniquify(seq, suffs = count(1)):
    """Make all the items unique by adding a suffix (1, 2, etc).

    `seq` is mutable sequence of strings.
    `suffs` is an optional alternative suffix iterable.
    not_unique = [k for k,v in Counter(seq).items() if v>1] # so we have: ['name', 'zip']
    # suffix generator dict - e.g., {'name': <my_gen>, 'zip': <my_gen>}
    suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))  
    for idx,s in enumerate(seq):
            suffix = str(next(suff_gens[s]))
        except KeyError:
            # s was unique
            seq[idx] += suffix


# intersect gene names between data sets
common_genes = adata_reference.var_names.intersection(adata_query.var_names)

# subset both data sets
adata_reference = adata_reference[:,common_genes]
adata_query = adata_query[:,common_genes]