# PAGA initialization

Hello everyone,

My team and I have been using scanpy to study a cell differentiation process. But we have some questions about the functioning of some tools proposed by scanpy.

1. What is meant by PAGA initialization of UMAP? Is it done at the level of the similarity matrix or further in the optimization process and the cost function of UMAP ?

2. What is the difference between Spectral initialization and PAGA initialization in UMAP?

3. What is the difference between scanpy PAGA and the two versions of PAGA in dynverse?

To put it simply I don’t understand the links between Leiden, PAGA and UMAP except maybe the fact that they use the same KNN graph in the first place…
I have looked at the UMAP and PAGA papers, but I honestly have a hard time understanding everything.

Would it be possible to give me the outline necessary to better understand the logical beside these tools ?

Thank you

Best Regards

Nicolas

Hi,

I can answer the first two questions about UMAP initialization, and have pinged a `dynverse` author about the third.

1. What is meant by PAGA initialization of UMAP? Is it done at the level of the similarity matrix or further in the optimization process and the cost function of UMAP ?

It’s just the initial positions in the embedding space. If you’ve built the coarse grained PAGA embedding from the neighbors network, you can then do a force directed layout of that PAGA graph. Calling `sc.tl.umap(adata, init_pos="paga", ...)` will just use the coordinates from that layout as its initial position for all points in each of the reference clusters.

1. What is the difference between Spectral initialization and PAGA initialization in UMAP?

Just the initial position of the points before optimizing the layout. If you use `spectral`, a spectral decomposition of the nearest neighbor graph is used to get the initial positions before UMAP optimizes the 2d embedding.

To put it simply I don’t understand the links between Leiden, PAGA and UMAP except maybe the fact that they use the same KNN graph in the first place…
I have looked at the UMAP and PAGA papers, but I honestly have a hard time understanding everything.

These can be quite dense – especially the UMAP publication – but all you should need is an intuition. I highly recommend this description from the UMAP documentation for a high level understanding. This should give you an understanding of how UMAP weights the KNN graph and uses it to generate a 2d layout.

Leiden is a clustering algorithm which partitions the nodes in the KNN graph into communities/ clusters. PAGA will build a “coarse grained” representation of the full KNN graph by summarizing each of the clusters into a single node. Edges between PAGA nodes are summarized from connectivities between the groups of nodes on the full graph.

Does this help?

Thank you !

Okay, so

1. We build the KNN graph (by the same method used by UMAP) : scanpy.pp.neighbors
2. We segment the KNN graph (no matter the method) : sc.tl.leiden(adata)
3. We calculate the connectivity between clusters with PAGA : sc.tl.paga(adata, groups=‘clusters’)
4. We display this paga graph with for example force atlas : sc.pl.paga(adata, groups=‘clusters’).
Assign to all points of a cluster the force atlas coordinates of the centroid. The coordinates are save
in the adata. (does sc.pl.paga is a necessary step ?)
5. We initialize UMAP with this representation: sc.tl.umap(adata)
6. Pseudotime can be computed with “scanpy.tl.dpt” on the KNN computed in 1)

Is it exact ?

Nicolas

I would like to have some clarification regarding the intitilisation step. As indicated in the PAGA paper, the positions of the nodes of the fine structure graph that belong to a group corresponding to a node of the coarse structure graph are randomly distributed in a rectangle located around the region of this node.
Why this choice of a rectangle? Why not, for example, choose a circle or even the same coordinates of a centroid for all the spectra present in a group?

Thank you

Nicolas

Hey @colapili !

For all the TI methods we included in dynverse, the output they generate eventually gets translated into two objects (or three, for probabilistic trajectories); the `milestone_network` and the `progressions` . The milestone network is the graph topology of the trajectory, represented by a data frame containing the columns “from”, “to”, “length”. The progressions represents a mapping of each cell to a location in between two milestones and is a data frame containing the columns “cell_id”, “from”, “to”, and “percentage” (percentage along the edge).

All of the TI methods produce very different types of output, and we wrote 7 different types of output wrappers which translate the output produced by a method in the data structures mentioned above. an overview of the different types of output are shown here: https://raw.githubusercontent.com/dynverse/dynwrap/master/man/figures/overview_wrapping_v3.png . (The first one assumes that the method produces a milestone network and progressions.)

In the ti_paga run.py ( ti_paga/run.py at master · dynverse/ti_paga · GitHub ) we run PAGA on the given dataset and extract the relevant data structures from the anndata object by setting a cutoff on the connectivities object.

The `ti_paga_tree` uses the `connectivities_tree` object instead ( ti_paga_tree/run.py at master · dynverse/ti_paga_tree · GitHub ). we found that `ti_paga_tree` is less flexible (since it can’t, for instance, create trajectories with cycles), but its performance sometimes a bit better, so we decided to leave both in.

Hope this answers your question.

(Edit: Previous response was split up into two posts because I can only post 2 URLs in a single post because I’m a new user )

Hi,
Thank you for your answer, it is indeed much more understandable for me.
Nicolas