plot

Collection of plotting functions.

See the big picture

These functions are intended to help you find your way around the data. I use them a lot when generating reports for collaborators, to set the stage for the more detailed analyses. In particular, I want to highlight which clusters we are comparing and where they are in the context of both species as well as the comparison.


source

highlighted_dimplot


def highlighted_dimplot(
    adata:AnnData, # AnnData object to plot.
    species:str, # Species name. Will be used in the title, and removed from the cluster names if present.
    clustering:str, # Clustering to plot. Must be present in `adata.obs`.
    cluster:str, # Cluster to highlight.
    embedding:str='X_umap', # Embedding to plot (default: "X_umap").
    highlight:str='red', # Color of the circle (default: "red").
    figsize:tuple=(10, 10), # Figure size (default: (10, 10)).
    save:Optional=None, # Path to save the figure (default: None).
):

Plot a low-dimensional embedding and highlight a chosen cluster with a superimposed circle.

planarian = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "planarian.h5ad")
planarian.var.index = "pl_" + planarian.var.index
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 planarian = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "planarian.h5ad")
      2 planarian.var.index = "pl_" + planarian.var.index

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/anndata/_io/h5ad.py:263, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    257         raise NotImplementedError(msg)
    259 rdasp = partial(
    260     read_dense_as_sparse, sparse_format=as_sparse_fmt, axis_chunk=chunk_size
    261 )
--> 263 with h5py.File(filename, "r") as f:
    265     def callback(read_func, elem_name: str, elem: StorageType, iospec: IOSpec):
    266         if iospec.encoding_type == "anndata" or elem_name.endswith("/"):

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/h5py/_hl/files.py:566, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, track_times, **kwds)
    557     fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
    558                      locking, page_buf_size, min_meta_keep, min_raw_keep,
    559                      alignment_threshold=alignment_threshold,
    560                      alignment_interval=alignment_interval,
    561                      meta_block_size=meta_block_size,
    562                      **kwds)
    563     fcpl = make_fcpl(track_order=track_order, track_times=track_times,
    564                      fs_strategy=fs_strategy, fs_persist=fs_persist,
    565                      fs_threshold=fs_threshold, fs_page_size=fs_page_size)
--> 566     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    568 if isinstance(libver, tuple):
    569     self._libver = libver

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/h5py/_hl/files.py:241, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    239     if swmr and swmr_support:
    240         flags |= h5f.ACC_SWMR_READ
--> 241     fid = h5f.open(name, flags, fapl=fapl)
    242 elif mode == 'r+':
    243     fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5f.pyx:104, in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/Users/npapadop/Documents/repos/comandos/example_data/planarian.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
highlighted_dimplot(planarian, "Smed", "tissue", "Pharynx", figsize=(6, 6))

Of course, if a cluster is too spread out, the plot will not work as well, but it will still be useful to see the general location of the cluster.

highlighted_dimplot(planarian, "Smed", "tissue", "Neural", figsize=(6, 6))

Another case that might not work so well is if the shape of a cluster is not Gaussian. Still, it should help us spot the cluster, especially if the color contrasts enough.

highlighted_dimplot(
    planarian, "Smed", "tissue", "Muscle", figsize=(6, 6), highlight="lightgreen"
)

source

highlighted_heatmap


def highlighted_heatmap(
    to_plot, # A dataframe of pairwise similarities between cell types.
    celltype_from, # Cell type of the query species to highlight. Must be in `to_plot.columns`.
    celltype_to, # Cell type of the target species to highlight. Must be in `to_plot.index`.
    figheight:int=20, # Height of the resulting plot in inches. Width will be calculated automatically (default:
20).
    save:NoneType=None, # Path to result figure; if None, the figure will be plotted but not saved (default: None).
):

Plot a heatmap of pairwise similarities between cell types, with a red box highlighting the query cell type.

hysc = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hysc_similarity_table.csv",
    index_col=0,
)
highlighted_heatmap(
    hysc, celltype_from="hy_i_SC", celltype_to="sc_Neoblast: 0", figheight=12
)

source

annotated_heatmap


def annotated_heatmap(
    sm:SAMAP, # SAMAP object
    similarity:DataFrame, # Similarity matrix. Contains query species clusters as columns and target species clusters as rows.
    query_species:str, # Query species ID. Will be used in the title. Should prepend the similarity matrix column names.
    target_species:str, # Target species ID. Will be used in the title. Should prepend the similarity matrix row names.
    query_clustering:str, # Query species clustering. Must be present in `sm.sams[query_species].adata.obs`.
    target_clustering:str, # Target species clustering. Must be present in `sm.sams[target_species].adata.obs`.
    query_coarse:Optional=None,
    target_coarse:Optional=None, # Query species coarse clustering. Must be present in `sm.sams[query_species].adata.obs`. If None, will be set to `query_clustering` (default: None).
    interactive:bool=False, # If True, will return a plotly figure. Otherwise, will return a matplotlib figure (default: False).
    figsize:Optional=None,
    save:Optional=None, # Figure size. If None, will be guessed from the size of the similarity matrix (default: None).
    kwargs:Any
)->Optional: # If not None, will save the figure to the specified path (default: None).
Additional arguments to pass to `seaborn.heatmap` (matplotlib) or `plotly.graph_objects.Figure` (plotly). Among them: dpi (int), which is only used if `interactive=True` to set the figure size in pixels.

Plot the similarity matrix as an annotated heatmap.

Read the requisite files: the SAMap object and the pairwise cluster similarity matrix.

file = open(os.environ["EXAMPLE_DATA_PATH"] + "hypl.pkl", "rb")
sm = pickle.load(file)

hypl = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hypl_similarity_table.csv",
    index_col=0,
)

First improve the coarse level assignments for the data we have here:

sm.sams["hy"].adata.obs["coarse"] = (
    sm.sams["hy"].adata.obs["Cluster"].str.split("_").str[0]
)
sm.sams["pl"].adata.obs["coarse"] = (
    sm.sams["pl"].adata.obs["tissue_smedwi"].str.split("_").str[0]
)
annotated_heatmap(
    sm,
    hypl,
    "hy",
    "pl",
    query_clustering="Cluster",
    target_clustering="cluster",
    query_coarse="coarse",
    target_coarse="coarse",
    interactive=False,
    save="hypl.pdf",
)

This sort of plot does a much better job of showing the relationships between clusters than the Sankey diagram. Clear patches of similarity emerge that would be otherwise invisible. This is a result of the hierarchical relationship between cell types (see Arendt/Musser/Wagner in 2016 and 2019). By combining maps like this with the per-species cell type family trees we can start to get a better idea of how cell types may have evolved over time.

assert os.path.exists("hypl.pdf")

We can make the same plot in interactive form. Unfortunately, the interactive version is a bit more finicky with color usage, so for now I’ll let plotly choose the colors, but any plotly wizards out there are welcome to improve this. Right now hovering over a point will show the cluster names and SAMap score; hovering over the colored bars at the margins should show the coarse cluster.

plotly_overview = annotated_heatmap(
    sm,
    hypl,
    "hy",
    "pl",
    query_clustering="Cluster",
    target_clustering="cluster",
    query_coarse="coarse",
    target_coarse="coarse",
    figsize=(10, 10),
    interactive=True,
    dpi=100,
    save="hypl",
)
assert os.path.exists("hypl.json")
assert os.path.exists("hypl.html")

You can visualize the plot in a notebook with:

# not run
plotly_overview.show()

Pairwise comparisons

When comparing two clusters, it is most informative to have a look at the genes they are expressing. If the clusters are in the same species we can use a simple dotplot that combines the two clusters’ marker genes; things are not so simple when comparing between species. By plotting two dotplots side by side, and connecting homologous genes across the dotplots, we can get a better sense of how the clusters are connected.

A lot of the heavy lifting for this idea was already done by the scanpy team, who write excellent, well-documented code. I’ve adapted their code to produce the paired dotplots, fiddled with it to make sure dot size and color are consistent across plots, and added the homology lines. The latter part is rather tricky, and I would be thankful to anyone who can improve it; currently it involves a lot of heuristics to translate the coordinates from one side of the plot to the other.


source

paired_dotplot


def paired_dotplot(
    query:AnnData, # query species AnnData object
    target:AnnData, # target species AnnData object
    connections:array, # array of connected genes. Each row has at least two columns containing the query species gene and corresponding target species gene, and optionally their connection strength. Genes are allowed to be repeated on both sides.
    query_clustering:str, # `.obs` column in the query AnnData object containing the query species clustering.
    target_clustering:str, # `.obs` column in the target AnnData object containing the target species clustering.
    query_species:str, # query species name. Will only be used in the title, so does not have to conform with the query species ID in the similarity matrix/SAMap object.
    target_species:str, # target species name. Will only be used in the title, so does not have to conform with the target species ID in the similarity matrix/SAMap object.
    query_cluster:Optional=None,
    target_cluster:Optional=None, # the cell type/cluster of the query species that is being compared (default: None).
    pad:bool=True, # whether to pad the gene names with spaces to make them all of a similar length (default: True).
    x_offset:float=1, # Number of inches to add to the horizontal size of the canvas (default: 1).
    y_offset:float=0, # Number of inches to add to the vertical size of the canvas (default: 0).
    grid_offset:float=30, # Grid segments to add between the two dotplots. Might be useful if the gene names are not legible/lines overlap (default: 30).
    query_gene_names:Optional=None,
    target_gene_names:Optional=None, # `.var` column that holds unique gene names for the query species (default: None).
    output:str='./paired_dotplot.png', # path to save the plot to (default: "./paired_dotplot.png").
    center:bool=True, # whether to center the dotplot (default: True).
    title:Optional=None, # overall title of the plot (default: None).
    title_font_size:float=16, # font size of the overall plot title (default: 16).
    scale:bool=False, # whether to scale the expression values to be between 0 and 1 for each gene (default: False).
    cmap:Colormap='magma_r', # colormap to use for the dotplot (default: "viridis").
    layer:Optional=None, # layer : Union[str, None], optional The layer to use for the average expression calculation. If not specified, it will use the `.X` slot of the `AnnData` objects. It is vital to set this correctly to avoid calculating average expression on log1p-transformed data (default: None).
)->None:

Recall that scanpy dotplots need three inputs: the AnnData object, the names of genes to plot, and the name of the clustering. For paired dotplots we need this information for both sides of the plot.

However, genes are input via the connections slot; an array where each row contains two genes and (optionally) the strength of their connection. The function will reorder the genes in order to plot the densely connected gene groups first, and will also color-code each group to make reading the plot easier.

Additionally, since plots like this may become busy very fast, the function also accepts the names of the two clusters to be compared; it will then highlight the cluster names in the plot.

We are going to prepare the inputs first:

hydra = sm.sams["hy"].adata  # the query dataset
planarian = sm.sams["pl"].adata  # the target dataset
planarian.raw.var.index = "pl_" + planarian.raw.var.index

# the .obs column that contains the cluster labels
query_clustering = "Cluster"
target_clustering = "cluster"

# the species IDs
query_species = "hy"
target_species = "pl"

# the cluster labels to highlight
cluster_from = "ecEp_SC2"
cluster_to = "Epidermal: 2"

The connections option is provided in this form deliberately, in order to allow the user to decide which genes to connect and how. This may cost a little extra effort for the most common cases, but it means that the function is more flexible.

The function currently supports three types of connections, corresponding to connection strength of [0, 1, 2], accordingly. It’s up to the user to assign meaning to the levels. The linestyles that correspond to these are

  • dotted for 0 (linestyle = (0, (1, 5)))
  • dashed for 1 (linestyle = (0, (5, 5)))
  • solid for 2

In the future I would like to let the user specify the linestyles, but for now this is what we have.

Click here for some possible applications
  • when comparing a dataset to itself, we could connect a gene to all its paralogs and visualize patterns of paralog substitution in cell type complement of an organism.
hydra_genes = np.array(
    ["hy_t13309aep", "hy_t14973aep", "hy_t10876aep", "hy_t38670aep", "hy_t19278aep"]
)
plan_genes = np.array(
    [
        "pl_dd_Smed_v4_361_0_1",
        "pl_dd_Smed_v4_361_0_1",
        "pl_dd_Smed_v4_1215_0_1",
        "pl_dd_Smed_v4_1131_0_1",
        "pl_dd_Smed_v4_1971_0_1",
    ]
)
orth_scores = np.array([0, 0, 1, 2, 2])

# connections without orthology scores
connections_plain = np.array([hydra_genes, plan_genes]).T

# connections with orthology scores
connections_orth = np.array([hydra_genes, plan_genes, orth_scores], dtype=object).T

Let’s have a look at what we feed into the function:

pd.DataFrame(connections_plain)
paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_plain,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
)

Gene expression can be very different between species, and plotting on the same scale can therefore be misleading. To address this, the function has a scale parameter that will normalize each row (gene) to a range of [0, 1]. This will make the plots more comparable, but it will also obscure the absolute expression levels. Use with caution.

Another thing to pay attention to: the scale function is not aware of log-transformed data, so if you applied log1p to your data it will calculate the an average-of-log instead of the log-of-average. This is a problem, and to avoid it please always use a layer that holds raw or normalised counts when using scale=True.

hydra.X.data = np.exp(hydra.X.data) - 1
hydra.layers["norm_counts"] = hydra.X.copy()
sc.pp.log1p(hydra)

planarian.X.data = np.exp(planarian.X.data) - 1
planarian.layers["norm_counts"] = planarian.X.copy()
sc.pp.log1p(planarian)
paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_plain,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

Since we didn’t supply any weights for the connections, the function uses a weight of 0, and plots all connections we provided as dotted lines. Let’s try again, this time with weights:

WARNING - Gene name padding:

We used pad=False here because we used the gene IDs instead of gene symbols. Gene symbols tend to have the same length, and so pad doesn’t help here. In fact pad=True will throw an issue here, indicating possibly sloppy code. Unfortunately fixing this is not a priority.

pd.DataFrame(connections_orth)
paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_orth,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    pad=False,
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

With weighting we can add an additional layer of information to the plot, such as orthology.

The next level would be to include the gene names, making this plot actually useful. For this to work reliably you need to have a column in the .var slot of your AnnData object that contains unique gene names. This is easily achieved by prepending the gene ID to the gene name/description.

The following code block is how I usually do this with EggNOG-mapper annotations; feel free to use whatever works for you.

Code
query = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv", sep="\t", engine="python"
)

query = query[["Unnamed: 0", "5", "eggNOG_OGs", "21"]].copy()
query.columns = ["gene_id", "gene_name", "eggNOG_OGs", "description"]
query["gene_id"] = "hy_" + query["gene_id"].astype(str)
query.set_index("gene_id", inplace=True)
query["name"] = (
    query.index.astype(str)
    + " | "
    + query["gene_name"].fillna("")
    + " | "
    + query["description"].fillna("")
)

hydra.var["name"] = hydra.var.index
hydra.var["name"] = hydra.var["name"].replace(query["name"].to_dict())

hy_t33417aep = "hy_t33417aep | EPT1 | diacylglycerol cholinephosphotransferase activity"
assert hydra.var.loc["hy_t33417aep"]["name"] == hy_t33417aep

I have now created a column called name in the .var slot of the AnnData object. I can use this to plot the gene names instead of the gene IDs:

paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections_orth,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster=cluster_from,
    target_cluster=cluster_to,
    query_gene_names="name",
    scale=True,
    cmap="RdYlBu_r",
    layer="norm_counts",
)

The right side of the plot is left as an exercise to the reader :) Notice how the lines on the left side are not starting from the same spot. This is where the util.procrustes function tries to guess how to pad using space characters so that the plotted strings have approximately the same length and the lines don’t go over some longer gene names. It’s not perfect, but it’s better than nothing.

From here on I will hide the lines of code that stay the same to make the notebook more readable. If you are trying to reproduce the plots, make sure you copy the code from the blocks above and just add the new lines.

The other options of the function should be self-explanatory, but I’ll go over them anyway:

  • x_offset and y_offset control the canvas size beyond the size absolutely necessary for the plots to be rendered correctly. Increase x_offset to give the plots more x-axis space, and y_offset to give them more space on the vertical side. This is not an exact science, and you will find combinations that distort the plot to the point of making it unreadable.
paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    x_offset=21,
    y_offset=5,
    scale=True,
    cmap="RdYlBu_r",
)

The grid_offset parameter tries to control the space between the dotplots. Play with it if the lines connecting the genes are too short or too long. Try to stick to odd numbers, for some reason they work better. This gives you an additional lever for when increasing x_offset doesn’t work as intended.

I will also stop including the scale=True option, but will keep showing different color maps.

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    grid_offset=131,
    x_offset=21,
    cmap="viridis"
)
paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    grid_offset=13,
    x_offset=21,
    cmap="cividis",
)

The parameters title and title_font_size do what you expect them to do.

paired_dotplot(
    query=hydra,  # |hide_line
    target=planarian,  # |hide_line
    connections=connections_orth,  # |hide_line
    query_clustering="Cluster",  # |hide_line
    target_clustering="cluster",  # |hide_line
    query_species="hy",  # |hide_line
    target_species="pl",  # |hide_line
    query_cluster=cluster_from,  # |hide_line
    target_cluster=cluster_to,  # |hide_line
    query_gene_names="name",  # |hide_line
    x_offset=31,
    y_offset=2,
    title="If my title is long and in big font size I need to use the function\nparameters to make the plot bigger and push the plots to the sides",
    title_font_size=40,
    grid_offset=71,
    cmap="plasma"
)

Finally, output controls where the plot is saved, and center=True will try to arrange the dotplots so that their centers on the y-axis are aligned. You probably want center=False if you need to print a title.

Real-life applications

When comparing cell types between species we are mostly interested in figuring out which cell types are homologous, and we want to find conserved gene expression patterns as evidence.

One way to get started with this is to look at the top marker genes of a cluster in one species and their homologs in the other animal.

For DEG detection I will use logistic regression, since I am interested in finding genes that are specifically expressed in my cluster of interest. I will take the "enEp_foot" cluster.

sc.tl.rank_genes_groups(hydra, "Cluster", method="logreg")
sc.tl.rank_genes_groups(planarian, "cluster", method="logreg")

For the sake of illustration let’s have a look at the top 30 genes:

top30 = hydra.uns["rank_genes_groups"]["names"]["ecEp_SC3"][:30]
sc.pl.dotplot(
    hydra,
    hydra.var["name"].loc[top30],
    gene_symbols="name",
    groupby="Cluster",
    cmap="magma_r",
    swap_axes=True,
    use_raw=False,
)

This might be too much information on the plot - many of the true marker genes are only expresed in the endodermal clusters anyway. I would like to collapse the other coarse clusters to make it more legible. I have a function for that:

util.collapse_unrelated_clusters(hydra, "ecEp_SC3", "Cluster", "coarse")
sc.pl.dotplot(
    hydra,
    hydra.var["name"].loc[top30],
    gene_symbols="name",
    groupby="Cluster_collapsed",
    cmap="magma_r",
    swap_axes=True,
    use_raw=False,
)

This is much more legible. Use this at your own risk though, as it may hide important information.

Now, let’s find the orthology information for these genes:

orthology = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "hypl_orthology.tsv", sep="\t", index_col=0
)
keep = np.intersect1d(
    top30, orthology.index
)  # for how many genes do we have orthology information?
available = orthology.loc[keep]  # subset to only include those
available_long = available.reset_index().melt(
    "gene_id"
)  # pivot the table to long format
available_long

Now we can only keep the pairs with a score greater than zero:

connections = available_long[available_long["value"] > 0].to_numpy()

We will now also name the planarian genes so that the plot makes more sense:

Code
target = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv", sep="\t", engine="python"
)

target = target[["Unnamed: 0", "5", "eggNOG_OGs", "21"]].copy()
target.columns = ["gene_id", "gene_name", "eggNOG_OGs", "description"]
target["gene_id"] = "pl_" + target["gene_id"].astype(str)
target.set_index("gene_id", inplace=True)
target["name"] = (
    target.index.astype(str)
    + " | "
    + target["gene_name"].fillna("")
    + " | "
    + target["description"].fillna("")
)

planarian.var["name"] = planarian.var.index
planarian.var["name"] = planarian.var["name"].replace(target["name"].to_dict())
paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections,
    query_clustering="Cluster_collapsed",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster="ecEp_SC3",
    target_cluster="Epidermal: 2",
    query_gene_names="name",
    target_gene_names="name",
    x_offset=15,
)

This approach will be sensitive but not very specific, as we will be plotting entire gene families on the target side. A more specific approach would be to look for genes that are markers on both sides and also homologous:

hydra_top100 = hydra.uns["rank_genes_groups"]["names"]["ecEp_SC3"][:100]
planarian_top100 = planarian.uns["rank_genes_groups"]["names"]["Epidermal: 2"][:100]
connections = genes.get_orthologs_overlap(
    hydra_top100, planarian_top100, hydra, planarian, orthology
)
connections
paired_dotplot(
    query=hydra,
    target=planarian,
    connections=connections,
    query_clustering="Cluster",
    target_clustering="cluster",
    query_species="hy",
    target_species="pl",
    query_cluster="ecEp_SC3",
    target_cluster="Epidermal: 2",
    query_gene_names="name",
    target_gene_names="name",
    pad=False,
    x_offset=-5,
)

I clearly chose the wrong example here, but I hope you can see the idea behind the method, and how that could be useful when analysing your own data. Hopefully by comparing two species that are somewhat closer related than Hydra and Schmidtea.

Or, since the methods I describe here are so flexible, you could use them to manually compile lists of marker genes and visualize those, to much greater effect than this half-baked attempt at finding needles in a haystack. You probably know your favorite species much better than I know Hydra and Schmidtea, after all!

Or, and hear me out here, cnidarians might have a fundamentally different way of building their cell types compared to other metazoans, and so SAMap can’t find bonafide homologs. If we do all the pairwise comparisons and can’t find compelling evidence for homology, we can start to think about alternatives.

DISCLAIMER

Yes, I know that cnidarians and the rest of the metazoans had a common origin and that there is compelling evidence that many of their cell types are probably, in fact, homologous. I also know your in-situ data probably shows stuff that slightly disagrees with the single-cell data. I am just using hyperbole to make a point. Please don’t sue me.