util

Collection of helper functions to facilitate plotting

source

procrustes

 procrustes (x:str, appropriate_length:int=50, pad_with:str=' ',
             side:str='right')

A function to regulate string length.

Type Default Details
x str input string
appropriate_length int 50 desired length
pad_with str character to pad with
side str right which side to pad on (“left”, “right”)
Returns str string with desired length

We are primarily going to be working with non-model species, so the gene names will always be of the form

XLOC_123456 | emapper-name-or-description-if-we're-lucky

or something similar. This means that we could have extreme variation in the actual length of a gene “name”; this will make it very hard to put gene names on axes as it will distort figure sizes. I wrote a function to either trim or pad strings; even though axis labels are not in monospace fonts, it is much easier to visually reconcile strings with lengths in the same order of magnitude.

too_short = "Niko"
just_right = "Theseus"
too_tall = "The Mountain"
assert procrustes(just_right, appropriate_length=7) == "Theseus"
assert procrustes(too_short, appropriate_length=7) == "Niko   "
assert procrustes(too_tall, appropriate_length=7) == "The Mou"

source

grouped_obs_mean

 grouped_obs_mean (adata:anndata._core.anndata.AnnData, group_key:str,
                   layer:Optional[str]=None)

Helper function to calculate average expression per group in an AnnData object.

Type Default Details
adata AnnData AnnData object to analyse
group_key str .obs category to group by
layer typing.Optional[str] None layer to use. If none, use .X
Returns DataFrame a groups\(\times\)genes dataframe with the average expression

Many tasks in single-cell analysis require us to know the average expression of a gene in a certain group of cells. While scanpy does perform that task behind the scenes for, e.g. dotplots, this is not functionality that is exposed to the users. This is an implementation based on ivirshup’s answer to a scanpy issue.

adata = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "hydra.h5ad")

cluster_means = grouped_obs_mean(adata, group_key="Cluster")

If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key, the returned array should have the shape \(G \times C\):

no_genes = adata.shape[1]
no_clusters = len(np.unique(adata.obs["Cluster"]))
assert cluster_means.shape == (no_genes, no_clusters)

Additionally, each column of the array should contain the average detected expression for cells in that cluster:

belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_average = np.mean(adata[belong_to_ecEp_SC2].X, axis=0)
ecEp_SC2_average = np.array(ecEp_SC2_average)[0]

assert all(np.isclose(cluster_means["ecEp_SC2"], ecEp_SC2_average))

source

grouped_obs_present

 grouped_obs_present (adata, group_key, layer:Optional[str]=None)

Helper function to calculate how many cells express each gene per group in an AnnData object.

Type Default Details
adata AnnData AnnData object to analyse.
group_key str .obs category to group by.
layer typing.Optional[str] None Layer to use. If none, use .X.
Returns pd.DataFrame A clusters$ imes$genes dataframe with the number of expressing cells per cluster.

Another critical value to know when making dotplots is the fraction of cells expressing a gene in a certain cluster. Again, scanpy performs that task without exposing it to the users. Similar to [grouped_obs_mean](https://galicae.github.io/comandos/util.html#grouped_obs_mean) this is an implementation based on ivirshup’s answer to a scanpy issue. Here we calculate the sum of cells expressing a gene, a table we can use to calculate the fraction later.

num_expressing = grouped_obs_present(adata, group_key="Cluster")

If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key, the returned array should have the shape \(G \times C\):

assert num_expressing.shape == (no_genes, no_clusters)

Additionally, each column of the array should contain the percentage of cells expressing each gene in that cluster:

belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_expr = np.sum(adata[belong_to_ecEp_SC2].X > 0, axis=0)
ecEp_SC2_expr = np.array(ecEp_SC2_expr)[0]

assert all(num_expressing["ecEp_SC2"] == ecEp_SC2_expr)

source

grouped_obs_percent

 grouped_obs_percent (adata, group_key, layer:Optional[str]=None)

Helper function to calculate what percentage of cells express each gene per group in an AnnData object.

Type Default Details
adata AnnData AnnData object to analyse.
group_key str .obs category to group by.
layer typing.Optional[str] None Layer to use. If none, use .X.
Returns pd.DataFrame A clusters$ imes$genes dataframe with the percentage of expressing cells per cluster.

Calculating the fraction of cells is of course very straightforward once we have counted the number of cells that express the gene as well as the total number of cells in a cluster.

frac_expressing = grouped_obs_percent(adata, group_key="Cluster")
# use the counts and number of cells we calculated before
frac_ecEp_SC2 = ecEp_SC2_expr / np.sum(belong_to_ecEp_SC2)

assert all(np.isclose(frac_expressing["ecEp_SC2"], frac_ecEp_SC2))

When visualising gene expression in the context of single-cell RNA-seq comparisons between species, absolute expression values are meaningless, as there are too many factors that determine them that we can not account for. It is sometimes more informative to look at the relative expression of a gene within the organism, i.e. to look at it in terms of min/max expression.

Scanpy does not have a provision for this, so here is a small utility function that does that:


source

rescale

 rescale (x:pandas.core.frame.DataFrame)

Rescale a dataframe so that row values are in the range [0, 1].

Type Details
x DataFrame A two-dimensional array to rescale; preferably the output of grouped_obs_mean.
Returns DataFrame the rescaled dataframe.

Highlighting clusters

Dimensionality reduction plots can often be rather busy, and searching for the correct cluster can be a bit of a hassle. It would be great if we could highlight the cluster of interest without losing the rest of the clustering information; for instance by drawing a circle around the cluster to highlight.


source

find_center

 find_center (coords)

A function that estimates a Gaussian probability density for the input data and returns the mode. From https://stackoverflow.com/a/60185876.

Type Details
coords np.ndarray A 2D array with X, Y-coordinates from xs, ys.
Returns float The X-coordinate of the mode.

A heuristic to achieve this is to pretend the cluster points are a Gaussian cloud on UMAP/tSNE/PCA/<your favorite embedding> space, and take the position with the highest density (the mode of the 2D distribution). This function is inspired from a StackOverflow answer, and mostly a wrapper around the Gaussian KDE function from scikit-learn.

# create a random number generator
rng = np.random.default_rng(42)
# create a 2D Gaussian dataset
x = rng.normal(loc=2, scale=1, size=2000)
y = rng.normal(loc=-2, scale=1, size=2000)

coords = np.array([x, y]).T
coords_center = find_center(coords)

We’d expect the mode of the kernel density estimate to be very close to the true mean of the data. Since this is for plotting purposes we don’t need to be extremely specific. See plot:highlighted_dimplot for a demonstration of the function.

assert np.isclose(coords_center[0], 2, rtol=0.2)
assert np.isclose(coords_center[1], -2, rtol=0.2)

source

map_fine_to_coarse

 map_fine_to_coarse (sm, species, fine, coarse=None, plot=<function umap>,
                     include_coarse=False)

Extract the mapping of fine to coarse clusters from a SAMap object.

Type Default Details
sm sm.maps.SAMAP SAMAP object to process.
species str Species ID of the correct SAM object.
fine str Fine clustering slot name.
coarse NoneType None Coarse clustering slot name. If None, use the same as fine, mapping each cluster to
itself. (default: None).
plot function umap Plotting function to use; this will correctly set the colors (default: sc.pl.umap).
include_coarse bool False If True, preface the fine cluster names with the coarse cluster names (default: False).
Returns fine_to_coarse: pd.DataFrame A dataframe with the mapping of fine to coarse clusters.
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: potentially wrong underline length... 
Modifies 
------- in 
This function collapses unrelated clusters by identifying a major cluster category based on the
`coarse` column. It then assigns the major cluster category to all unrelated clusters in the...
  else: warn(msg)

source

collapse_unrelated_clusters

 collapse_unrelated_clusters (adata:anndata._core.anndata.AnnData,
                              cluster:str, fine:str, coarse:str)

This function collapses unrelated clusters by identifying a major cluster category based on the coarse column. It then assigns the major cluster category to all unrelated clusters in the fine column. The resulting collapsed cluster information is stored in a new column named fine + "_collapsed".

Type Details
adata AnnData The annotation data containing the clusters to be collapsed.
cluster str The cluster identifier within the fine column that needs to be collapsed.
fine str The column name representing the fine-grained clustering.
coarse str The column name representing the coarse-grained clustering.
Returns None
file = open(os.environ["EXAMPLE_DATA_PATH"] + "hypl.pkl", "rb")
sm = pickle.load(file)

sm.sams["hy"].adata.obs["coarse"] = (
    sm.sams["hy"].adata.obs["Cluster"].str.split("_").str[0]
)
collapse_unrelated_clusters(sm.sams["hy"].adata, "ecEp_SC2", "Cluster", "coarse")
sm.sams["hy"].adata.obs["Cluster"].cat.categories
Index(['ecEp_SC2', 'ecEp_SC3', 'ecEp_bat1(mp)', 'ecEp_bat2(mp)', 'ecEp_bd',
       'ecEp_head', 'ecEp_nem1(pd)', 'ecEp_nem2(id)', 'enEp_SC1', 'enEp_SC2',
       'enEp_SC3', 'enEp_foot', 'enEp_head', 'enEp_nem1(pd)', 'enEp_nem2(pd)',
       'enEp_tent', 'enEp_tent(pd)', 'i_SC', 'i_fmgl1', 'i_fmgl2_nurse',
       'i_gmgc', 'i_mgl', 'i_nb1', 'i_nb2', 'i_nb3', 'i_nb4', 'i_nc1', 'i_nc2',
       'i_nc3', 'i_nc4', 'i_nc5', 'i_nc6', 'i_nc7', 'i_nc8', 'i_nc_gc_prog',
       'i_nc_prog', 'i_nem', 'i_smgc1', 'i_smgc2', 'i_zmg1', 'i_zmg2', 'nan'],
      dtype='object')
sm.sams["hy"].adata.obs["Cluster_collapsed"].cat.categories
Index(['ecEp_SC2', 'ecEp_SC3', 'ecEp_bat1(mp)', 'ecEp_bat2(mp)', 'ecEp_bd',
       'ecEp_head', 'ecEp_nem1(pd)', 'ecEp_nem2(id)', 'enEp', 'i', 'nan'],
      dtype='object')

Notice that all subclusters in the "ecEP" group are still visible, but all other clusters are subsumed by their coarse cluster.