import numpy as np import pandas as pd import scanpy as sc import gzip import matplotlib.pyplot as plt sc.settings.verbosity = 3 sc.logging.print_header() sc.settings.set_figure_params(dpi=80, facecolor='white') # # pbmc - sc tenx_fn = "data/BRAIN-LARGE/1M_neurons_filtered_gene_bc_matrices_h5.h5" adata = sc.read_10x_h5(tenx_fn) adata.var_names_make_unique() # # remove the ercc spike and gfp signal from xin gene_names = adata.var_names.tolist() ERCC_hits = list(filter(lambda x: 'ERCC' in x, gene_names)) adata = adata[:, [x for x in gene_names if not (x in ERCC_hits)]] gene_names = adata.var_names.tolist() eGFP_hits = list(filter(lambda x: 'eGFP' in x, gene_names)) adata = adata[:, [x for x in gene_names if not (x in ERCC_hits)]] sc.pp.filter_cells(adata, min_counts=10) sc.pp.filter_genes(adata, min_cells=10) sc.pp.filter_cells(adata, min_genes=100) adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt' sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt') sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts') # adata = adata[adata.obs.n_genes_by_counts < 8000, :] # adata = adata[adata.obs.pct_counts_mt < 5, :] adata.raw = adata sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata) sc.pl.highly_variable_genes(adata) sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts'], jitter=0.4, multi_panel=True) sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts') adata.layers["counts_raw"] = adata.raw.X.toarray() # preserve counts adata.layers["counts"] = adata.X.toarray() # preserve counts adata.X = adata.X.toarray() adata.write("data/saves/pbmc68k.h5ad")
Could you track down where exactly you are running into memory issues?
My assumption is that it’s happening here:
adata.layers["counts_raw"] = adata.raw.X.toarray() # preserve counts adata.layers["counts"] = adata.X.toarray() # preserve counts adata.X = adata.X.toarray() adata.write("data/saves/pbmc68k.h5ad")
I would consider this expected. If it’s happening before this, it would be really good to know where. These dense arrays should be roughly:
Or >100gb each. Probably good not to have these in memory. First, can you avoid this? If not, we do allow you to write
.raw.X as dense if they’re stored as sparse in memory:
adata.write_h5ad("path/to/file.h5ad", compression="lzf", as_dense=["X", "raw/X"])
Thanks for the quick reply, @ivirshup.
.toarray() fixed the issue.
Can you provide some advice on how to deal with the resulting scipy.sparse.csr_matrix in pytorch?
It seems that the pytorch sparse (coo) tensor constructor cannot be called on scipy csr matrices directly.
I’m not sure I can. My best guess would be to transform the matrix to COO (e.g.
X.tocoo()), then pass the coordinate and data array from that?
For the most part, densifying each minibatch during stochastic inference has worked in large scale cases for scRNA-seq. We haven’t explicitly explored using PyTorch sparse objects.
This sounds reasonable to me. But I haven’t looked too deeply into PyTorch sparse.