too_short = "Niko"
just_right = "Theseus"
too_tall = "The Mountain"util
procrustes
def procrustes(
x:str, # input string
appropriate_length:int=50, # desired length
pad_with:str=' ', # character to pad with
side:str='right', # which side to pad on ("left", "right")
)->str: # string with desired length
A function to regulate string length.
We are primarily going to be working with non-model species, so the gene names will always be of the form
XLOC_123456 | emapper-name-or-description-if-we're-lucky
or something similar. This means that we could have extreme variation in the actual length of a gene “name”; this will make it very hard to put gene names on axes as it will distort figure sizes. I wrote a function to either trim or pad strings; even though axis labels are not in monospace fonts, it is much easier to visually reconcile strings with lengths in the same order of magnitude.
assert procrustes(just_right, appropriate_length=7) == "Theseus"
assert procrustes(too_short, appropriate_length=7) == "Niko "
assert procrustes(too_tall, appropriate_length=7) == "The Mou"grouped_obs_mean
def grouped_obs_mean(
adata:AnnData, # AnnData object to analyse
group_key:str, # `.obs` category to group by
layer:Optional=None, # layer to use. If none, use `.X`
)->DataFrame: # a groups$\times$genes dataframe with the average expression
Helper function to calculate average expression per group in an AnnData object.
Many tasks in single-cell analysis require us to know the average expression of a gene in a certain group of cells. While scanpy does perform that task behind the scenes for, e.g. dotplots, this is not functionality that is exposed to the users. This is an implementation based on ivirshup’s answer to a scanpy issue.
adata = sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "hydra.h5ad")
cluster_means = grouped_obs_mean(adata, group_key="Cluster")If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key, the returned array should have the shape \(G \times C\):
no_genes = adata.shape[1]
no_clusters = len(np.unique(adata.obs["Cluster"]))
assert cluster_means.shape == (no_genes, no_clusters)Additionally, each column of the array should contain the average detected expression for cells in that cluster:
belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_average = np.mean(adata[belong_to_ecEp_SC2].X, axis=0)
ecEp_SC2_average = np.array(ecEp_SC2_average)[0]
assert all(np.isclose(cluster_means["ecEp_SC2"], ecEp_SC2_average))grouped_obs_present
def grouped_obs_present(
adata, # AnnData object to analyse.
group_key, # `.obs` category to group by.
layer:Optional=None, # Layer to use. If none, use `.X`.
): # A clusters$ imes$genes dataframe with the number of expressing cells per cluster.
Helper function to calculate how many cells express each gene per group in an AnnData object.
Another critical value to know when making dotplots is the fraction of cells expressing a gene in a certain cluster. Again, scanpy performs that task without exposing it to the users. Similar to [grouped_obs_mean](https://galicae.github.io/comandos/util.html#grouped_obs_mean) this is an implementation based on ivirshup’s answer to a scanpy issue. Here we calculate the sum of cells expressing a gene, a table we can use to calculate the fraction later.
num_expressing = grouped_obs_present(adata, group_key="Cluster")If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key, the returned array should have the shape \(G \times C\):
assert num_expressing.shape == (no_genes, no_clusters)Additionally, each column of the array should contain the percentage of cells expressing each gene in that cluster:
belong_to_ecEp_SC2 = adata.obs["Cluster"] == "ecEp_SC2"
ecEp_SC2_expr = np.sum(adata[belong_to_ecEp_SC2].X > 0, axis=0)
ecEp_SC2_expr = np.array(ecEp_SC2_expr)[0]
assert all(num_expressing["ecEp_SC2"] == ecEp_SC2_expr)grouped_obs_percent
def grouped_obs_percent(
adata, # AnnData object to analyse.
group_key, # `.obs` category to group by.
layer:Optional=None, # Layer to use. If none, use `.X`.
): # A clusters$ imes$genes dataframe with the percentage of expressing cells per cluster.
Helper function to calculate what percentage of cells express each gene per group in an AnnData object.
Calculating the fraction of cells is of course very straightforward once we have counted the number of cells that express the gene as well as the total number of cells in a cluster.
frac_expressing = grouped_obs_percent(adata, group_key="Cluster")
# use the counts and number of cells we calculated before
frac_ecEp_SC2 = ecEp_SC2_expr / np.sum(belong_to_ecEp_SC2)
assert all(np.isclose(frac_expressing["ecEp_SC2"], frac_ecEp_SC2))When visualising gene expression in the context of single-cell RNA-seq comparisons between species, absolute expression values are meaningless, as there are too many factors that determine them that we can not account for. It is sometimes more informative to look at the relative expression of a gene within the organism, i.e. to look at it in terms of min/max expression.
Scanpy does not have a provision for this, so here is a small utility function that does that:
rescale
def rescale(
x:DataFrame, # A two-dimensional array to rescale; preferably the output of [`grouped_obs_mean`](https://galicae.github.io/comandos/util.html#grouped_obs_mean).
)->DataFrame: # the rescaled dataframe.
Rescale a dataframe so that row values are in the range [0, 1].
Highlighting clusters
Dimensionality reduction plots can often be rather busy, and searching for the correct cluster can be a bit of a hassle. It would be great if we could highlight the cluster of interest without losing the rest of the clustering information; for instance by drawing a circle around the cluster to highlight.
find_center
def find_center(
coords, # A 2D array with X, Y-coordinates from xs, ys.
): # The X-coordinate of the mode.
A function that estimates a Gaussian probability density for the input data and returns the mode. From https://stackoverflow.com/a/60185876.
A heuristic to achieve this is to pretend the cluster points are a Gaussian cloud on UMAP/tSNE/PCA/<your favorite embedding> space, and take the position with the highest density (the mode of the 2D distribution). This function is inspired from a StackOverflow answer, and mostly a wrapper around the Gaussian KDE function from scikit-learn.
# create a random number generator
rng = np.random.default_rng(42)
# create a 2D Gaussian dataset
x = rng.normal(loc=2, scale=1, size=2000)
y = rng.normal(loc=-2, scale=1, size=2000)
coords = np.array([x, y]).Tcoords_center = find_center(coords)We’d expect the mode of the kernel density estimate to be very close to the true mean of the data. Since this is for plotting purposes we don’t need to be extremely specific. See plot:highlighted_dimplot for a demonstration of the function.
assert np.isclose(coords_center[0], 2, rtol=0.2)
assert np.isclose(coords_center[1], -2, rtol=0.2)map_fine_to_coarse
def map_fine_to_coarse(
sm, # SAMAP object to process.
species, # Species ID of the correct SAM object.
fine, # Fine clustering slot name.
coarse:NoneType=None, # Coarse clustering slot name. If None, use the same as `fine`, mapping each cluster to
itself. (default: `None`).
plot:function=umap, # Plotting function to use; this will correctly set the colors (default: `sc.pl.umap`).
include_coarse:bool=False, # If True, preface the fine cluster names with the coarse cluster names (default: `False`).
): # A dataframe with the mapping of fine to coarse clusters.
Extract the mapping of fine to coarse clusters from a SAMap object.
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/fastcore/docscrape.py:259: UserWarning: potentially wrong underline length...
Modifies
------- in
This function collapses unrelated clusters by identifying a major cluster category based on the
`coarse` column. It then assigns the major cluster category to all unrelated clusters in the...
else: warn(msg)