Issues with very large data set

The following script attempts to import the 1.3M neuron data set downloaded from 10X, but I was forced to kill it when all 128GB ram (and a 256GB swap file) was almost exhausted. How can I fix this?

import numpy as np
import pandas as pd
import scanpy as sc
import gzip
import matplotlib.pyplot as plt

sc.settings.verbosity = 3
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

# # pbmc - sc
tenx_fn = "data/BRAIN-LARGE/1M_neurons_filtered_gene_bc_matrices_h5.h5"
adata = sc.read_10x_h5(tenx_fn)
adata.var_names_make_unique()

# # remove the ercc spike and gfp signal from xin
gene_names = adata.var_names.tolist()
ERCC_hits = list(filter(lambda x: 'ERCC' in x, gene_names))
adata = adata[:, [x for x in gene_names if not (x in ERCC_hits)]]

gene_names = adata.var_names.tolist()
eGFP_hits = list(filter(lambda x: 'eGFP' in x, gene_names))
adata = adata[:, [x for x in gene_names if not (x in ERCC_hits)]]

sc.pp.filter_cells(adata, min_counts=10)
sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=100)

adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

# adata = adata[adata.obs.n_genes_by_counts < 8000, :]
# adata = adata[adata.obs.pct_counts_mt < 5, :]

adata.raw = adata
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

sc.pl.highly_variable_genes(adata)

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts'],
             jitter=0.4, multi_panel=True)

sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

adata.layers["counts_raw"] = adata.raw.X.toarray() # preserve counts
adata.layers["counts"] = adata.X.toarray() # preserve counts
adata.X = adata.X.toarray()
adata.write("data/saves/pbmc68k.h5ad")

Could you track down where exactly you are running into memory issues?

My assumption is that it’s happening here:

adata.layers["counts_raw"] = adata.raw.X.toarray() # preserve counts
adata.layers["counts"] = adata.X.toarray() # preserve counts
adata.X = adata.X.toarray()
adata.write("data/saves/pbmc68k.h5ad")

I would consider this expected. If it’s happening before this, it would be really good to know where. These dense arrays should be roughly:

\frac{4 \text{ bytes per value }\cdot 1.4 \text{ million samples } \cdot 20 \text{ thousand genes} } {1024^3 \text{ bytes per gb }}

Or >100gb each. Probably good not to have these in memory. First, can you avoid this? If not, we do allow you to write .X and .raw.X as dense if they’re stored as sparse in memory:

adata.write_h5ad("path/to/file.h5ad", compression="lzf", as_dense=["X", "raw/X"])
1 Like

Thanks for the quick reply, @ivirshup.

Removing the .toarray() fixed the issue.

Can you provide some advice on how to deal with the resulting scipy.sparse.csr_matrix in pytorch?
It seems that the pytorch sparse (coo) tensor constructor cannot be called on scipy csr matrices directly.

Thanks, Matt

I’m not sure I can. My best guess would be to transform the matrix to COO (e.g. X.tocoo()), then pass the coordinate and data array from that?

I think @Koncopd or @adamgayoso may be better able to help here.

1 Like

For the most part, densifying each minibatch during stochastic inference has worked in large scale cases for scRNA-seq. We haven’t explicitly explored using PyTorch sparse objects.

This sounds reasonable to me. But I haven’t looked too deeply into PyTorch sparse.