Comparing samples within a merged dataset


I have a dataset composed of 2 samples, one is control and the other is experimental. I am having trouble figuring out how to use to compare the samples with respect to the Louvain clustering. For example, in Cluster 1, I want to determine DEGs from experimental with respect to control… for cluster 2, I want to determine DEGs from experimental with respect to control. When I try to edit the ‘reference’ criteria, it does not allow me to pass an adata.obs variable. Could someone give some advice? Thanks!

Here’s my solution to the problem you had. It’s maybe not the prettiest (a lot of loops and conditionals), but it is not that slow on a moderate dataset (~20k cells).

adata.obs['sample_louvain'] = adata.obs['louvain']
adata.obs['sample_louvain'] = adata.obs['sample_louvain'].cat.add_categories(['control_Cluster1', 'experimental_Cluster1'])
for i in range(len(adata.obs)):
    if adata.obs['louvain'][i] == 'Cluster1':
        if adata.obs['sample'][i] == 'control':
            adata.obs['sample_louvain'][i] = "control_Cluster1"
            adata.obs['sample_louvain'][i] = "experimental_Cluster1"
adata.obs['sample_louvain'] = adata.obs['sample_louvain'].values.remove_unused_categories(), groupby='sample_louvain', method='wilcoxon', groups=['experimental_Cluster1'], reference='control_Cluster1')

Now, this is the code to do it only for one particular cluster. If you’d like to do it for all the clusters at once, then I guess the easiest would be add one more loop going over all the clusters (instead of the first if conditional).

Another way that I’ve gone about this is to subset the data by each cluster into its own AnnData object, then running rank_genes_groups between samples on each object iteratively. This assumes you have an .obs key that annotates the sample id (eg. ‘Sample’):

for i in adata.obs['louvain'].unique():
    temp = adata[adata.obs['louvain'] == i], groupby='Sample')

You can also throw in a couple lines into the loop to save the list of differentially expressed genes using sc.get.rank_genes_groups_df(temp, ...).

I had a similar question, and the code provided above works great for me, so thank you @GMaciag and @jpreall !

I had a related follow-up question. I also have an integrated dataset that is composed of a WT and mutant sample where I want to know what genes are differentially expressed between the WT and mutant cells in a particular cluster. Based on discussion I’ve seen on scanpy’s Github, it seems like is probably sufficient to get at least a general idea of DEGs, but it also seems like using diffxpy is a (potentially better?) option as well. Curious as to what other people have been doing for data like this, and if one approach is considered better (and why).