= "Niko"
too_short = "Theseus"
just_right = "The Mountain" too_tall
util
procrustes
procrustes (x:str, appropriate_length:int=50, pad_with:str=' ', side:str='right')
A function to regulate string length.
Type | Default | Details | |
---|---|---|---|
x | str | input string | |
appropriate_length | int | 50 | desired length |
pad_with | str | character to pad with | |
side | str | right | which side to pad on (“left”, “right”) |
Returns | str | string with desired length |
We are primarily going to be working with non-model species, so the gene names will always be of the form
XLOC_123456 | emapper-name-or-description-if-we're-lucky
or something similar. This means that we could have extreme variation in the actual length of a gene “name”; this will make it very hard to put gene names on axes as it will distort figure sizes. I wrote a function to either trim or pad strings; even though axis labels are not in monospace fonts, it is much easier to visually reconcile strings with lengths in the same order of magnitude.
assert procrustes(just_right, appropriate_length=7) == "Theseus"
assert procrustes(too_short, appropriate_length=7) == "Niko "
assert procrustes(too_tall, appropriate_length=7) == "The Mou"
grouped_obs_mean
grouped_obs_mean (adata:anndata._core.anndata.AnnData, group_key:str, layer:Optional[str]=None)
Helper function to calculate average expression per group in an AnnData
object.
Type | Default | Details | |
---|---|---|---|
adata | AnnData | AnnData object to analyse | |
group_key | str | .obs category to group by |
|
layer | typing.Optional[str] | None | layer to use. If none, use .X |
Returns | DataFrame | a groups\(\times\)genes dataframe with the average expression |
Many tasks in single-cell analysis require us to know the average expression of a gene in a certain group of cells. While scanpy
does perform that task behind the scenes for, e.g. dotplots, this is not functionality that is exposed to the users. This is an implementation based on ivirshup’s answer to a scanpy issue.
= sc.read_h5ad(os.environ["EXAMPLE_DATA_PATH"] + "hydra.h5ad")
adata
= grouped_obs_mean(adata, group_key="Cluster") cluster_means
If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key
, the returned array should have the shape \(G \times C\):
= adata.shape[1]
no_genes = len(np.unique(adata.obs["Cluster"]))
no_clusters assert cluster_means.shape == (no_genes, no_clusters)
Additionally, each column of the array should contain the average detected expression for cells in that cluster:
= adata.obs["Cluster"] == "ecEp_SC2"
belong_to_ecEp_SC2 = np.mean(adata[belong_to_ecEp_SC2].X, axis=0)
ecEp_SC2_average = np.array(ecEp_SC2_average)[0]
ecEp_SC2_average
assert all(np.isclose(cluster_means["ecEp_SC2"], ecEp_SC2_average))
grouped_obs_present
grouped_obs_present (adata, group_key, layer:Optional[str]=None)
Helper function to calculate how many cells express each gene per group in an AnnData
object.
Type | Default | Details | |
---|---|---|---|
adata | AnnData | AnnData object to analyse. | |
group_key | str | .obs category to group by. |
|
layer | typing.Optional[str] | None | Layer to use. If none, use .X . |
Returns | pd.DataFrame | A clusters$ imes$genes dataframe with the number of expressing cells per cluster. |
Another critical value to know when making dotplots is the fraction of cells expressing a gene in a certain cluster. Again, scanpy
performs that task without exposing it to the users. Similar to [grouped_obs_mean
](https://galicae.github.io/comandos/util.html#grouped_obs_mean) this is an implementation based on ivirshup’s answer to a scanpy issue. Here we calculate the sum of cells expressing a gene, a table we can use to calculate the fraction later.
= grouped_obs_present(adata, group_key="Cluster") num_expressing
If \(G\) is the number of genes and \(C\) the number of unique clusters in the group_key
, the returned array should have the shape \(G \times C\):
assert num_expressing.shape == (no_genes, no_clusters)
Additionally, each column of the array should contain the percentage of cells expressing each gene in that cluster:
= adata.obs["Cluster"] == "ecEp_SC2"
belong_to_ecEp_SC2 = np.sum(adata[belong_to_ecEp_SC2].X > 0, axis=0)
ecEp_SC2_expr = np.array(ecEp_SC2_expr)[0]
ecEp_SC2_expr
assert all(num_expressing["ecEp_SC2"] == ecEp_SC2_expr)
grouped_obs_percent
grouped_obs_percent (adata, group_key, layer:Optional[str]=None)
Helper function to calculate what percentage of cells express each gene per group in an AnnData
object.
Type | Default | Details | |
---|---|---|---|
adata | AnnData | AnnData object to analyse. | |
group_key | str | .obs category to group by. |
|
layer | typing.Optional[str] | None | Layer to use. If none, use .X . |
Returns | pd.DataFrame | A clusters$ imes$genes dataframe with the percentage of expressing cells per cluster. |
Calculating the fraction of cells is of course very straightforward once we have counted the number of cells that express the gene as well as the total number of cells in a cluster.
= grouped_obs_percent(adata, group_key="Cluster")
frac_expressing # use the counts and number of cells we calculated before
= ecEp_SC2_expr / np.sum(belong_to_ecEp_SC2)
frac_ecEp_SC2
assert all(np.isclose(frac_expressing["ecEp_SC2"], frac_ecEp_SC2))
When visualising gene expression in the context of single-cell RNA-seq comparisons between species, absolute expression values are meaningless, as there are too many factors that determine them that we can not account for. It is sometimes more informative to look at the relative expression of a gene within the organism, i.e. to look at it in terms of min/max expression.
Scanpy does not have a provision for this, so here is a small utility function that does that:
rescale
rescale (x:pandas.core.frame.DataFrame)
Rescale a dataframe so that row values are in the range [0, 1].
Type | Details | |
---|---|---|
x | DataFrame | A two-dimensional array to rescale; preferably the output of grouped_obs_mean . |
Returns | DataFrame | the rescaled dataframe. |
Highlighting clusters
Dimensionality reduction plots can often be rather busy, and searching for the correct cluster can be a bit of a hassle. It would be great if we could highlight the cluster of interest without losing the rest of the clustering information; for instance by drawing a circle around the cluster to highlight.
find_center
find_center (coords)
A function that estimates a Gaussian probability density for the input data and returns the mode. From https://stackoverflow.com/a/60185876.
Type | Details | |
---|---|---|
coords | np.ndarray | A 2D array with X, Y-coordinates from xs, ys. |
Returns | float | The X-coordinate of the mode. |
A heuristic to achieve this is to pretend the cluster points are a Gaussian cloud on UMAP/tSNE/PCA/<your favorite embedding> space, and take the position with the highest density (the mode of the 2D distribution). This function is inspired from a StackOverflow answer, and mostly a wrapper around the Gaussian KDE function from scikit-learn
.
# create a random number generator
= np.random.default_rng(42)
rng # create a 2D Gaussian dataset
= rng.normal(loc=2, scale=1, size=2000)
x = rng.normal(loc=-2, scale=1, size=2000)
y
= np.array([x, y]).T coords
= find_center(coords) coords_center
We’d expect the mode of the kernel density estimate to be very close to the true mean of the data. Since this is for plotting purposes we don’t need to be extremely specific. See plot:highlighted_dimplot
for a demonstration of the function.
assert np.isclose(coords_center[0], 2, rtol=0.2)
assert np.isclose(coords_center[1], -2, rtol=0.2)
map_fine_to_coarse
map_fine_to_coarse (sm, species, fine, coarse=None, plot=<function umap>, include_coarse=False)
Extract the mapping of fine to coarse clusters from a SAMap object.
Type | Default | Details | |
---|---|---|---|
sm | sm.maps.SAMAP | SAMAP object to process. | |
species | str | Species ID of the correct SAM object. | |
fine | str | Fine clustering slot name. | |
coarse | NoneType | None | Coarse clustering slot name. If None, use the same as fine , mapping each cluster toitself. (default: None ). |
plot | function | umap | Plotting function to use; this will correctly set the colors (default: sc.pl.umap ). |
include_coarse | bool | False | If True, preface the fine cluster names with the coarse cluster names (default: False ). |
Returns | fine_to_coarse: pd.DataFrame | A dataframe with the mapping of fine to coarse clusters. |
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: potentially wrong underline length...
Modifies
------- in
This function collapses unrelated clusters by identifying a major cluster category based on the
`coarse` column. It then assigns the major cluster category to all unrelated clusters in the...
else: warn(msg)