plot

Collection of plotting functions.

See the big picture

These functions are intended to help you find your way around the data. I use them a lot when generating reports for collaborators, to set the stage for the more detailed analyses. In particular, I want to highlight which clusters we are comparing and where they are in the context of both species as well as the comparison.

source

highlighted_dimplot

 highlighted_dimplot (adata:anndata._core.anndata.AnnData, species:str,
                      clustering:str, cluster:str, embedding:str='X_umap',
                      highlight:str='red', figsize:tuple=(10, 10),
                      save:Optional[str]=None)

Plot a low-dimensional embedding and highlight a chosen cluster with a superimposed circle.

	Type	Default	Details
adata	AnnData		AnnData object to plot.
species	str		Species name. Will be used in the title, and removed from the cluster names if present.
clustering	str		Clustering to plot. Must be present in `adata.obs`.
cluster	str		Cluster to highlight.
embedding	str	X_umap	Embedding to plot (default: “X_umap”).
highlight	str	red	Color of the circle (default: “red”).
figsize	tuple	(10, 10)	Figure size (default: (10, 10)).
save	typing.Optional[str]	None	Path to save the figure (default: None).
Returns	None

planarian = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "planarian.h5ad")
planarian.var.index = "pl_" + planarian.var.index

highlighted_dimplot(planarian, "Smed", "tissue", "Pharynx", figsize=(6, 6))

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Of course, if a cluster is too spread out, the plot will not work as well, but it will still be useful to see the general location of the cluster.

highlighted_dimplot(planarian, "Smed", "tissue", "Neural", figsize=(6, 6))

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Another case that might not work so well is if the shape of a cluster is not Gaussian. Still, it should help us spot the cluster, especially if the color contrasts enough.

highlighted_dimplot(
    planarian, "Smed", "tissue", "Muscle", figsize=(6, 6), highlight="lightgreen"
)

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

source

highlighted_heatmap

 highlighted_heatmap (to_plot, celltype_from, celltype_to, figheight=20,
                      save=None)

Plot a heatmap of pairwise similarities between cell types, with a red box highlighting the query cell type.

	Type	Default	Details
to_plot	pd.DataFrame		A dataframe of pairwise similarities between cell types.
celltype_from	str		Cell type of the query species to highlight. Must be in `to_plot.columns`.
celltype_to	str		Cell type of the target species to highlight. Must be in `to_plot.index`.
figheight	int	20	Height of the resulting plot in inches. Width will be calculated automatically (default: 20).
save	NoneType	None	Path to result figure; if None, the figure will be plotted but not saved (default: None).
Returns	None

hysc = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hysc_similarity_table.csv",
    index_col=0,
)

highlighted_heatmap(
    hysc, celltype_from="hy_i_SC", celltype_to="sc_Neoblast: 0", figheight=12
)

source

annotated_heatmap

 annotated_heatmap (sm:samap.mapping.SAMAP,
                    similarity:pandas.core.frame.DataFrame,
                    query_species:str, target_species:str,
                    query_clustering:str, target_clustering:str,
                    query_coarse:Optional[str]=None,
                    target_coarse:Optional[str]=None,
                    interactive:bool=False,
                    figsize:Optional[Tuple[float,float]]=None,
                    save:Optional[str]=None, **kwargs:Any)

Plot the similarity matrix as an annotated heatmap.

	Type	Default	Details
sm	SAMAP		SAMAP object
similarity	DataFrame		Similarity matrix. Contains query species clusters as columns and target species clusters as rows.
query_species	str		Query species ID. Will be used in the title. Should prepend the similarity matrix column names.
target_species	str		Target species ID. Will be used in the title. Should prepend the similarity matrix row names.
query_clustering	str		Query species clustering. Must be present in `sm.sams[query_species].adata.obs`.
target_clustering	str		Target species clustering. Must be present in `sm.sams[target_species].adata.obs`.
query_coarse	typing.Optional[str]	None
target_coarse	typing.Optional[str]	None	Query species coarse clustering. Must be present in `sm.sams[query_species].adata.obs`. If None, will be set to `query_clustering` (default: None).
interactive	bool	False	If True, will return a plotly figure. Otherwise, will return a matplotlib figure (default: False).
figsize	typing.Optional[typing.Tuple[float, float]]	None
save	typing.Optional[str]	None	Figure size. If None, will be guessed from the size of the similarity matrix (default: None).
kwargs	typing.Any
Returns	typing.Optional[plotly.graph_objs._figure.Figure]		If not None, will save the figure to the specified path (default: None). Additional arguments to pass to `seaborn.heatmap` (matplotlib) or `plotly.graph_objects.Figure` (plotly). Among them: dpi (int), which is only used if `interactive=True` to set the figure size in pixels.

Read the requisite files: the SAMap object and the pairwise cluster similarity matrix.

file = open(os.environ["EXAMPLE_DATA_PATH"] + "hypl.pkl", "rb")
sm = pickle.load(file)

hypl = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hypl_similarity_table.csv",
    index_col=0,
)

First improve the coarse level assignments for the data we have here:

sm.sams["hy"].adata.obs["coarse"] = (
    sm.sams["hy"].adata.obs["Cluster"].str.split("_").str[0]
)
sm.sams["pl"].adata.obs["coarse"] = (
    sm.sams["pl"].adata.obs["tissue_smedwi"].str.split("_").str[0]
)

annotated_heatmap(
    sm,
    hypl,
    "hy",
    "pl",
    query_clustering="Cluster",
    target_clustering="cluster",
    query_coarse="coarse",
    target_coarse="coarse",
    interactive=False,
    save="hypl.pdf",
)

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

This sort of plot does a much better job of showing the relationships between clusters than the Sankey diagram. Clear patches of similarity emerge that would be otherwise invisible. This is a result of the hierarchical relationship between cell types (see Arendt/Musser/Wagner in 2016 and 2019). By combining maps like this with the per-species cell type family trees we can start to get a better idea of how cell types may have evolved over time.

assert os.path.exists("hypl.pdf")

We can make the same plot in interactive form. Unfortunately, the interactive version is a bit more finicky with color usage, so for now I’ll let plotly choose the colors, but any plotly wizards out there are welcome to improve this. Right now hovering over a point will show the cluster names and SAMap score; hovering over the colored bars at the margins should show the coarse cluster.

plotly_overview = annotated_heatmap(
    sm,
    hypl,
    "hy",
    "pl",
    query_clustering="Cluster",
    target_clustering="cluster",
    query_coarse="coarse",
    target_coarse="coarse",
    figsize=(10, 10),
    interactive=True,
    dpi=100,
    save="hypl",
)

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1234: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/var/folders/md/d6lwwbv97xb6g6ddypntnprh0000gp/T/ipykernel_73405/1709870291.py:41: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  query_map[query_coarse] = query_map[query_coarse].replace(to_replace=query_lut)
/var/folders/md/d6lwwbv97xb6g6ddypntnprh0000gp/T/ipykernel_73405/1709870291.py:42: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  target_map[target_coarse] = target_map[target_coarse].replace(to_replace=target_lut)

assert os.path.exists("hypl.json")
assert os.path.exists("hypl.html")

You can visualize the plot in a notebook with:

# not run
plotly_overview.show()

Pairwise comparisons

When comparing two clusters, it is most informative to have a look at the genes they are expressing. If the clusters are in the same species we can use a simple dotplot that combines the two clusters’ marker genes; things are not so simple when comparing between species. By plotting two dotplots side by side, and connecting homologous genes across the dotplots, we can get a better sense of how the clusters are connected.

A lot of the heavy lifting for this idea was already done by the scanpy team, who write excellent, well-documented code. I’ve adapted their code to produce the paired dotplots, fiddled with it to make sure dot size and color are consistent across plots, and added the homology lines. The latter part is rather tricky, and I would be thankful to anyone who can improve it; currently it involves a lot of heuristics to translate the coordinates from one side of the plot to the other.

source

paired_dotplot

 paired_dotplot (query:anndata._core.anndata.AnnData,
                 target:anndata._core.anndata.AnnData, connections:<built-
                 infunctionarray>, query_clustering:str,
                 target_clustering:str, query_species:str,
                 target_species:str, query_cluster:Optional[str]=None,
                 target_cluster:Optional[str]=None, pad:bool=True,
                 x_offset:float=1, y_offset:float=0, grid_offset:float=30,
                 query_gene_names:Optional[str]=None,
                 target_gene_names:Optional[str]=None,
                 output:str='./paired_dotplot.png', center:bool=True,
                 title:Optional[str]=None, title_font_size:float=16,
                 scale:bool=False,
                 cmap:matplotlib.colors.Colormap='magma_r',
                 layer:Optional[str]=None)

	Type	Default	Details
query	AnnData		query species AnnData object
target	AnnData		target species AnnData object
connections	array		array of connected genes. Each row has at least two columns containing the query species gene and corresponding target species gene, and optionally their connection strength. Genes are allowed to be repeated on both sides.
query_clustering	str		`.obs` column in the query AnnData object containing the query species clustering.
target_clustering	str		`.obs` column in the target AnnData object containing the target species clustering.
query_species	str		query species name. Will only be used in the title, so does not have to conform with the query species ID in the similarity matrix/SAMap object.
target_species	str		target species name. Will only be used in the title, so does not have to conform with the target species ID in the similarity matrix/SAMap object.
query_cluster	typing.Optional[str]	None
target_cluster	typing.Optional[str]	None	the cell type/cluster of the query species that is being compared (default: None).
pad	bool	True	whether to pad the gene names with spaces to make them all of a similar length (default: True).
x_offset	float	1	Number of inches to add to the horizontal size of the canvas (default: 1).
y_offset	float	0	Number of inches to add to the vertical size of the canvas (default: 0).
grid_offset	float	30	Grid segments to add between the two dotplots. Might be useful if the gene names are not legible/lines overlap (default: 30).
query_gene_names	typing.Optional[str]	None
target_gene_names	typing.Optional[str]	None	`.var` column that holds unique gene names for the query species (default: None).
output	str	./paired_dotplot.png	path to save the plot to (default: “./paired_dotplot.png”).
center	bool	True	whether to center the dotplot (default: True).
title	typing.Optional[str]	None	overall title of the plot (default: None).
title_font_size	float	16	font size of the overall plot title (default: 16).
scale	bool	False	whether to scale the expression values to be between 0 and 1 for each gene (default: False).
cmap	Colormap	magma_r	colormap to use for the dotplot (default: “viridis”).
layer	typing.Optional[str]	None	layer : Union[str, None], optional The layer to use for the average expression calculation. If not specified, it will use the `.X` slot of the `AnnData` objects. It is vital to set this correctly to avoid calculating average expression on log1p-transformed data (default: None).
Returns	None

Recall that scanpy dotplots need three inputs: the AnnData object, the names of genes to plot, and the name of the clustering. For paired dotplots we need this information for both sides of the plot.

However, genes are input via the connections slot; an array where each row contains two genes and (optionally) the strength of their connection. The function will reorder the genes in order to plot the densely connected gene groups first, and will also color-code each group to make reading the plot easier.

Additionally, since plots like this may become busy very fast, the function also accepts the names of the two clusters to be compared; it will then highlight the cluster names in the plot.

We are going to prepare the inputs first:

hydra = sm.sams["hy"].adata  # the query dataset
planarian = sm.sams["pl"].adata  # the target dataset
planarian.raw.var.index = "pl_" + planarian.raw.var.index

# the .obs column that contains the cluster labels
query_clustering = "Cluster"
target_clustering = "cluster"

# the species IDs
query_species = "hy"
target_species = "pl"

# the cluster labels to highlight
cluster_from = "ecEp_SC2"
cluster_to = "Epidermal: 2"

The connections option is provided in this form deliberately, in order to allow the user to decide which genes to connect and how. This may cost a little extra effort for the most common cases, but it means that the function is more flexible.

The function currently supports three types of connections, corresponding to connection strength of [0, 1, 2], accordingly. It’s up to the user to assign meaning to the levels. The linestyles that correspond to these are

dotted for 0 (linestyle = (0, (1, 5)))
dashed for 1 (linestyle = (0, (5, 5)))
solid for 2

In the future I would like to let the user specify the linestyles, but for now this is what we have.

Click here for some possible applications

when comparing a dataset to itself, we could connect a gene to all its paralogs and visualize patterns of paralog substitution in cell type complement of an organism.

hydra_genes = np.array(
    ["hy_t13309aep", "hy_t14973aep", "hy_t10876aep", "hy_t38670aep", "hy_t19278aep"]
)
plan_genes = np.array(
    [
        "pl_dd_Smed_v4_361_0_1",
        "pl_dd_Smed_v4_361_0_1",
        "pl_dd_Smed_v4_1215_0_1",
        "pl_dd_Smed_v4_1131_0_1",
        "pl_dd_Smed_v4_1971_0_1",
    ]
)
orth_scores = np.array([0, 0, 1, 2, 2])

# connections without orthology scores
connections_plain = np.array([hydra_genes, plan_genes]).T

# connections with orthology scores
connections_orth = np.array([hydra_genes, plan_genes, orth_scores], dtype=object).T

Let’s have a look at what we feed into the function:

pd.DataFrame(connections_plain)

	0	1
0	hy_t13309aep	pl_dd_Smed_v4_361_0_1
1	hy_t14973aep	pl_dd_Smed_v4_361_0_1
2	hy_t10876aep	pl_dd_Smed_v4_1215_0_1
3	hy_t38670aep	pl_dd_Smed_v4_1131_0_1
4	hy_t19278aep	pl_dd_Smed_v4_1971_0_1

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_plain,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
)

Gene expression can be very different between species, and plotting on the same scale can therefore be misleading. To address this, the function has a scale parameter that will normalize each row (gene) to a range of [0, 1]. This will make the plots more comparable, but it will also obscure the absolute expression levels. Use with caution.

Another thing to pay attention to: the scale function is not aware of log-transformed data, so if you applied log1p to your data it will calculate the an average-of-log instead of the log-of-average. This is a problem, and to avoid it please always use a layer that holds raw or normalised counts when using scale=True.

hydra.X.data = np.exp(hydra.X.data) - 1
hydra.layers["norm_counts"] = hydra.X.copy()
sc.pp.log1p(hydra)

planarian.X.data = np.exp(planarian.X.data) - 1
planarian.layers["norm_counts"] = planarian.X.copy()
sc.pp.log1p(planarian)

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_plain,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

Since we didn’t supply any weights for the connections, the function uses a weight of 0, and plots all connections we provided as dotted lines. Let’s try again, this time with weights:

WARNING - Gene name padding:

We used pad=False here because we used the gene IDs instead of gene symbols. Gene symbols tend to have the same length, and so pad doesn’t help here. In fact pad=True will throw an issue here, indicating possibly sloppy code. Unfortunately fixing this is not a priority.

pd.DataFrame(connections_orth)

	0	1	2
0	hy_t13309aep	pl_dd_Smed_v4_361_0_1	0
1	hy_t14973aep	pl_dd_Smed_v4_361_0_1	0
2	hy_t10876aep	pl_dd_Smed_v4_1215_0_1	1
3	hy_t38670aep	pl_dd_Smed_v4_1131_0_1	2
4	hy_t19278aep	pl_dd_Smed_v4_1971_0_1	2

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_orth,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

With weighting we can add an additional layer of information to the plot, such as orthology.

The next level would be to include the gene names, making this plot actually useful. For this to work reliably you need to have a column in the .var slot of your AnnData object that contains unique gene names. This is easily achieved by prepending the gene ID to the gene name/description.

The following code block is how I usually do this with EggNOG-mapper annotations; feel free to use whatever works for you.

Code

query = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv", sep="\t", engine="python"
)

query = query[["Unnamed: 0", "5", "eggNOG_OGs", "21"]].copy()
query.columns = ["gene_id", "gene_name", "eggNOG_OGs", "description"]
query["gene_id"] = "hy_" + query["gene_id"].astype(str)
query.set_index("gene_id", inplace=True)
query["name"] = (
    query.index.astype(str)
    + " | "
    + query["gene_name"].fillna("")
    + " | "
    + query["description"].fillna("")
)

hydra.var["name"] = hydra.var.index
hydra.var["name"] = hydra.var["name"].replace(query["name"].to_dict())

hy_t33417aep = "hy_t33417aep | EPT1 | diacylglycerol cholinephosphotransferase activity"
assert hydra.var.loc["hy_t33417aep"]["name"] == hy_t33417aep

I have now created a column called name in the .var slot of the AnnData object. I can use this to plot the gene names instead of the gene IDs:

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_orth,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    query_gene_names="name",
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

The right side of the plot is left as an exercise to the reader :) Notice how the lines on the left side are not starting from the same spot. This is where the util.procrustes function tries to guess how to pad using space characters so that the plotted strings have approximately the same length and the lines don’t go over some longer gene names. It’s not perfect, but it’s better than nothing.

From here on I will hide the lines of code that stay the same to make the notebook more readable. If you are trying to reproduce the plots, make sure you copy the code from the blocks above and just add the new lines.

The other options of the function should be self-explanatory, but I’ll go over them anyway:

x_offset and y_offset control the canvas size beyond the size absolutely necessary for the plots to be rendered correctly. Increase x_offset to give the plots more x-axis space, and y_offset to give them more space on the vertical side. This is not an exact science, and you will find combinations that distort the plot to the point of making it unreadable.

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    x_offset=21,
    y_offset=5,
    scale=True,
    cmap="RdYlBu_r",
)

The grid_offset parameter tries to control the space between the dotplots. Play with it if the lines connecting the genes are too short or too long. Try to stick to odd numbers, for some reason they work better. This gives you an additional lever for when increasing x_offset doesn’t work as intended.

I will also stop including the scale=True option, but will keep showing different color maps.

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    grid_offset=131,
    x_offset=21,
    cmap="viridis"
)

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    grid_offset=13,
    x_offset=21,
    cmap="cividis",
)

The parameters title and title_font_size do what you expect them to do.

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    x_offset=31,
    y_offset=2,
    title="If my title is long and in big font size I need to use the function\nparameters to make the plot bigger and push the plots to the sides",
    title_font_size=40,
    grid_offset=71,
    cmap="plasma"
)

Finally, output controls where the plot is saved, and center=True will try to arrange the dotplots so that their centers on the y-axis are aligned. You probably want center=False if you need to print a title.

Real-life applications

When comparing cell types between species we are mostly interested in figuring out which cell types are homologous, and we want to find conserved gene expression patterns as evidence.

One way to get started with this is to look at the top marker genes of a cluster in one species and their homologs in the other animal.

For DEG detection I will use logistic regression, since I am interested in finding genes that are specifically expressed in my cluster of interest. I will take the "enEp_foot" cluster.

sc.tl.rank_genes_groups(hydra, "Cluster", method="logreg")
sc.tl.rank_genes_groups(planarian, "cluster", method="logreg")

For the sake of illustration let’s have a look at the top 30 genes:

top30 = hydra.uns["rank_genes_groups"]["names"]["ecEp_SC3"][:30]

sc.pl.dotplot(
    hydra,
    hydra.var["name"].loc[top30],
    gene_symbols="name",
    groupby="Cluster",
    cmap="magma_r",
    swap_axes=True,
    use_raw=False,
)

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:168: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:178: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

This might be too much information on the plot - many of the true marker genes are only expresed in the endodermal clusters anyway. I would like to collapse the other coarse clusters to make it more legible. I have a function for that:

util.collapse_unrelated_clusters(hydra, "ecEp_SC3", "Cluster", "coarse")

sc.pl.dotplot(
    hydra,
    hydra.var["name"].loc[top30],
    gene_symbols="name",
    groupby="Cluster_collapsed",
    cmap="magma_r",
    swap_axes=True,
    use_raw=False,
)

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:168: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:178: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

/opt/homebrew/Caskroom/mambaforge/base/envs/ascc24/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:747: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored

This is much more legible. Use this at your own risk though, as it may hide important information.

Now, let’s find the orthology information for these genes:

orthology = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hypl_orthology.tsv", sep="\t", index_col=0
)

keep = np.intersect1d(
    top30, orthology.index
)  # for how many genes do we have orthology information?
available = orthology.loc[keep]  # subset to only include those
available_long = available.reset_index().melt(
    "gene_id"
)  # pivot the table to long format
available_long

	gene_id	variable	value
0	hy_t12447aep	pl_dd_Smed_v4_10001_0_1	0
1	hy_t12561aep	pl_dd_Smed_v4_10001_0_1	0
2	hy_t12614aep	pl_dd_Smed_v4_10001_0_1	0
3	hy_t12863aep	pl_dd_Smed_v4_10001_0_1	0
4	hy_t14554aep	pl_dd_Smed_v4_10001_0_1	0
...	...	...	...
189970	hy_t25822aep	pl_dd_Smed_v4_99_0_1	0
189971	hy_t27670aep	pl_dd_Smed_v4_99_0_1	0
189972	hy_t3105aep	pl_dd_Smed_v4_99_0_1	0
189973	hy_t3139aep	pl_dd_Smed_v4_99_0_1	0
189974	hy_t3141aep	pl_dd_Smed_v4_99_0_1	0

189975 rows × 3 columns

Now we can only keep the pairs with a score greater than zero:

connections = available_long[available_long["value"] > 0].to_numpy()

We will now also name the planarian genes so that the plot makes more sense:

Code

target = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv", sep="\t", engine="python"
)

target = target[["Unnamed: 0", "5", "eggNOG_OGs", "21"]].copy()
target.columns = ["gene_id", "gene_name", "eggNOG_OGs", "description"]
target["gene_id"] = "pl_" + target["gene_id"].astype(str)
target.set_index("gene_id", inplace=True)
target["name"] = (
    target.index.astype(str)
    + " | "
    + target["gene_name"].fillna("")
    + " | "
    + target["description"].fillna("")
)

planarian.var["name"] = planarian.var.index
planarian.var["name"] = planarian.var["name"].replace(target["name"].to_dict())

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections,
    query_clustering="Cluster_collapsed",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster="ecEp_SC3",
    target_cluster="Epidermal: 2",
    query_gene_names="name",
    target_gene_names="name",
    x_offset=15,
)

This approach will be sensitive but not very specific, as we will be plotting entire gene families on the target side. A more specific approach would be to look for genes that are markers on both sides and also homologous:

hydra_top100 = hydra.uns["rank_genes_groups"]["names"]["ecEp_SC3"][:100]
planarian_top100 = planarian.uns["rank_genes_groups"]["names"]["Epidermal: 2"][:100]

connections = genes.get_orthologs_overlap(
    hydra_top100, planarian_top100, hydra, planarian, orthology
)

connections

array([['hy_t10711aep', 'pl_dd_Smed_v4_299_0_1', '1'],
       ['hy_t10711aep', 'pl_dd_Smed_v4_414_0_1', '1'],
       ['hy_t22760aep', 'pl_dd_Smed_v4_433_0_1', '1'],
       ['hy_t22760aep', 'pl_dd_Smed_v4_7131_0_1', '1'],
       ['hy_t22760aep', 'pl_dd_Smed_v4_762_0_1', '1']], dtype='<U22')

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster="ecEp_SC3",
    target_cluster="Epidermal: 2",
    query_gene_names="name",
    target_gene_names="name",
    pad=False,
    x_offset=-5,
)

I clearly chose the wrong example here, but I hope you can see the idea behind the method, and how that could be useful when analysing your own data. Hopefully by comparing two species that are somewhat closer related than Hydra and Schmidtea.

Or, since the methods I describe here are so flexible, you could use them to manually compile lists of marker genes and visualize those, to much greater effect than this half-baked attempt at finding needles in a haystack. You probably know your favorite species much better than I know Hydra and Schmidtea, after all!

Or, and hear me out here, cnidarians might have a fundamentally different way of building their cell types compared to other metazoans, and so SAMap can’t find bonafide homologs. If we do all the pairwise comparisons and can’t find compelling evidence for homology, we can start to think about alternatives.

DISCLAIMER

Yes, I know that cnidarians and the rest of the metazoans had a common origin and that there is compelling evidence that many of their cell types are probably, in fact, homologous. I also know your in-situ data probably shows stuff that slightly disagrees with the single-cell data. I am just using hyperbole to make a point. Please don’t sue me.