util

Collection of helper functions to facilitate plotting

procrustes

 procrustes (x:str, appropriate_length:int=50, pad_with:str=' ',
             side:str='right')

A function to regulate string length.

	Type	Default	Details
x	str		input string
appropriate_length	int	50	desired length
pad_with	str		character to pad with
side	str	right	which side to pad on (“left”, “right”)
Returns	str		string with desired length

We are primarily going to be working with non-model species, so the gene names will always be of the form

XLOC_123456 | emapper-name-or-description-if-we're-lucky

or something similar. This means that we could have extreme variation in the actual length of a gene “name”; this will make it very hard to put gene names on axes as it will distort figure sizes. I wrote a function to either trim or pad strings; even though axis labels are not in monospace fonts, it is much easier to visually reconcile strings with lengths in the same order of magnitude.

too_short = "Niko"
just_right = "Theseus"
too_tall = "The Mountain"

assert procrustes(just_right, appropriate_length=7) == "Theseus"
assert procrustes(too_short, appropriate_length=7) == "Niko   "
assert procrustes(too_tall, appropriate_length=7) == "The Mou"

source

grouped_obs_mean

 grouped_obs_mean (adata:anndata._core.anndata.AnnData, group_key:str,
                   layer:Optional[str]=None)

Helper function to calculate average expression per group in an AnnData object.

	Type	Default	Details
adata	AnnData		AnnData object to analyse
group_key	str		`.obs` category to group by
layer	typing.Optional[str]	None	layer to use. If none, use `.X`
Returns	DataFrame		a groups$\times$genes dataframe with the average expression

Many tasks in single-cell analysis require us to know the average expression of a gene in a certain group of cells. While scanpy does perform that task behind the scenes for, e.g. dotplots, this is not functionality that is exposed to the users. This is an implementation based on ivirshup’s answer to a scanpy issue.

adata = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "hydra.h5ad")

cluster_means = grouped_obs_mean(adata, group_key="Cluster")

If $G$ is the number of genes and $C$ the number of unique clusters in the group_key, the returned array should have the shape $G \times C$:

no_genes = adata.shape[1]
no_clusters = len(np.unique(adata.obs["Cluster"]))
assert cluster_means.shape == (no_genes, no_clusters)

Additionally, each column of the array should contain the average detected expression for cells in that cluster:

belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_average = np.mean(adata[belong_to_ecEp_SC2].X, axis=0)
ecEp_SC2_average = np.array(ecEp_SC2_average)[0]

assert all(np.isclose(cluster_means["ecEp_SC2"], ecEp_SC2_average))

source

grouped_obs_present

 grouped_obs_present (adata, group_key, layer:Optional[str]=None)

Helper function to calculate how many cells express each gene per group in an AnnData object.

	Type	Default	Details
adata	AnnData		AnnData object to analyse.
group_key	str		`.obs` category to group by.
layer	typing.Optional[str]	None	Layer to use. If none, use `.X`.
Returns	pd.DataFrame		A clusters$ imes$genes dataframe with the number of expressing cells per cluster.

Another critical value to know when making dotplots is the fraction of cells expressing a gene in a certain cluster. Again, scanpy performs that task without exposing it to the users. Similar to [grouped_obs_mean](https://galicae.github.io/comandos/util.html#grouped_obs_mean) this is an implementation based on ivirshup’s answer to a scanpy issue. Here we calculate the sum of cells expressing a gene, a table we can use to calculate the fraction later.

num_expressing = grouped_obs_present(adata, group_key="Cluster")

If $G$ is the number of genes and $C$ the number of unique clusters in the group_key, the returned array should have the shape $G \times C$:

assert num_expressing.shape == (no_genes, no_clusters)

Additionally, each column of the array should contain the percentage of cells expressing each gene in that cluster:

belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_expr = np.sum(adata[belong_to_ecEp_SC2].X > 0, axis=0)
ecEp_SC2_expr = np.array(ecEp_SC2_expr)[0]

assert all(num_expressing["ecEp_SC2"] == ecEp_SC2_expr)

source

grouped_obs_percent

 grouped_obs_percent (adata, group_key, layer:Optional[str]=None)

Helper function to calculate what percentage of cells express each gene per group in an AnnData object.

	Type	Default	Details
adata	AnnData		AnnData object to analyse.
group_key	str		`.obs` category to group by.
layer	typing.Optional[str]	None	Layer to use. If none, use `.X`.
Returns	pd.DataFrame		A clusters$ imes$genes dataframe with the percentage of expressing cells per cluster.

Calculating the fraction of cells is of course very straightforward once we have counted the number of cells that express the gene as well as the total number of cells in a cluster.

frac_expressing = grouped_obs_percent(adata, group_key="Cluster")
# use the counts and number of cells we calculated before
frac_ecEp_SC2 = ecEp_SC2_expr / np.sum(belong_to_ecEp_SC2)

assert all(np.isclose(frac_expressing["ecEp_SC2"], frac_ecEp_SC2))

When visualising gene expression in the context of single-cell RNA-seq comparisons between species, absolute expression values are meaningless, as there are too many factors that determine them that we can not account for. It is sometimes more informative to look at the relative expression of a gene within the organism, i.e. to look at it in terms of min/max expression.

Scanpy does not have a provision for this, so here is a small utility function that does that:

source

rescale

 rescale (x:pandas.core.frame.DataFrame)

Rescale a dataframe so that row values are in the range [0, 1].

	Type	Details
x	DataFrame	A two-dimensional array to rescale; preferably the output of `grouped_obs_mean`.
Returns	DataFrame	the rescaled dataframe.

Highlighting clusters

Dimensionality reduction plots can often be rather busy, and searching for the correct cluster can be a bit of a hassle. It would be great if we could highlight the cluster of interest without losing the rest of the clustering information; for instance by drawing a circle around the cluster to highlight.

source

find_center

 find_center (coords)

A function that estimates a Gaussian probability density for the input data and returns the mode. From https://stackoverflow.com/a/60185876.

	Type	Details
coords	np.ndarray	A 2D array with X, Y-coordinates from xs, ys.
Returns	float	The X-coordinate of the mode.

A heuristic to achieve this is to pretend the cluster points are a Gaussian cloud on UMAP/tSNE/PCA/<your favorite embedding> space, and take the position with the highest density (the mode of the 2D distribution). This function is inspired from a StackOverflow answer, and mostly a wrapper around the Gaussian KDE function from scikit-learn.

# create a random number generator
rng = np.random.default_rng(42)
# create a 2D Gaussian dataset
x = rng.normal(loc=2, scale=1, size=2000)
y = rng.normal(loc=-2, scale=1, size=2000)

coords = np.array([x, y]).T

coords_center = find_center(coords)

We’d expect the mode of the kernel density estimate to be very close to the true mean of the data. Since this is for plotting purposes we don’t need to be extremely specific. See plot:highlighted_dimplot for a demonstration of the function.

assert np.isclose(coords_center[0], 2, rtol=0.2)
assert np.isclose(coords_center[1], -2, rtol=0.2)

source

map_fine_to_coarse

 map_fine_to_coarse (sm, species, fine, coarse=None, plot=<function umap>,
                     include_coarse=False)

Extract the mapping of fine to coarse clusters from a SAMap object.

	Type	Default	Details
sm	sm.maps.SAMAP		SAMAP object to process.
species	str		Species ID of the correct SAM object.
fine	str		Fine clustering slot name.
coarse	NoneType	None	Coarse clustering slot name. If None, use the same as `fine`, mapping each cluster to itself. (default: `None`).
plot	function	umap	Plotting function to use; this will correctly set the colors (default: `sc.pl.umap`).
include_coarse	bool	False	If True, preface the fine cluster names with the coarse cluster names (default: `False`).
Returns	fine_to_coarse: pd.DataFrame		A dataframe with the mapping of fine to coarse clusters.

/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: potentially wrong underline length... 
Modifies 
------- in 
This function collapses unrelated clusters by identifying a major cluster category based on the
`coarse` column. It then assigns the major cluster category to all unrelated clusters in the...
  else: warn(msg)

source

collapse_unrelated_clusters

 collapse_unrelated_clusters (adata:anndata._core.anndata.AnnData,
                              cluster:str, fine:str, coarse:str)

This function collapses unrelated clusters by identifying a major cluster category based on the coarse column. It then assigns the major cluster category to all unrelated clusters in the fine column. The resulting collapsed cluster information is stored in a new column named fine + "_collapsed".

	Type	Details
adata	AnnData	The annotation data containing the clusters to be collapsed.
cluster	str	The cluster identifier within the `fine` column that needs to be collapsed.
fine	str	The column name representing the fine-grained clustering.
coarse	str	The column name representing the coarse-grained clustering.
Returns	None

file = open(os.environ["EXAMPLE_DATA_PATH"] + "hypl.pkl", "rb")
sm = pickle.load(file)

sm.sams["hy"].adata.obs["coarse"] = (
    sm.sams["hy"].adata.obs["Cluster"].str.split("_").str[0]
)

collapse_unrelated_clusters(sm.sams["hy"].adata, "ecEp_SC2", "Cluster", "coarse")

sm.sams["hy"].adata.obs["Cluster"].cat.categories

Index(['ecEp_SC2', 'ecEp_SC3', 'ecEp_bat1(mp)', 'ecEp_bat2(mp)', 'ecEp_bd',
       'ecEp_head', 'ecEp_nem1(pd)', 'ecEp_nem2(id)', 'enEp_SC1', 'enEp_SC2',
       'enEp_SC3', 'enEp_foot', 'enEp_head', 'enEp_nem1(pd)', 'enEp_nem2(pd)',
       'enEp_tent', 'enEp_tent(pd)', 'i_SC', 'i_fmgl1', 'i_fmgl2_nurse',
       'i_gmgc', 'i_mgl', 'i_nb1', 'i_nb2', 'i_nb3', 'i_nb4', 'i_nc1', 'i_nc2',
       'i_nc3', 'i_nc4', 'i_nc5', 'i_nc6', 'i_nc7', 'i_nc8', 'i_nc_gc_prog',
       'i_nc_prog', 'i_nem', 'i_smgc1', 'i_smgc2', 'i_zmg1', 'i_zmg2', 'nan'],
      dtype='object')

sm.sams["hy"].adata.obs["Cluster_collapsed"].cat.categories

Index(['ecEp_SC2', 'ecEp_SC3', 'ecEp_bat1(mp)', 'ecEp_bat2(mp)', 'ecEp_bd',
       'ecEp_head', 'ecEp_nem1(pd)', 'ecEp_nem2(id)', 'enEp', 'i', 'nan'],
      dtype='object')

Notice that all subclusters in the "ecEP" group are still visible, but all other clusters are subsumed by their coarse cluster.