genes

Prepare and annotate genes and gene sets.

SAMap is built on pairwise gene similarity. For cross-species comparisons it is often very interesting to know the level of conservation of two genes - mostly, whether they are orthologs or paralogs. I will provide functions to annotate genes with orthology information from EggNOG-mapper, but you can also use your own orthology annotation, provided it returns the same thing: a \(G_1 \times G_2\) table where \(G_1\) and \(G_2\) are the genes in the two species you are comparing, and each cell \((g_1, g_2)\) contains the orthology relationship between \(g_1\) and \(g_2\): 2, if they are orthologs, 1 if they are paralogs/in the same gene family, 0 if they are unrelated.

This is an extremely inefficient way of saving this matrix, and it would be relatively easy to code this as a sparse matrix, but I don’t think it’s worth the effort. The matrices are not too large (especially after the gene filtering that is so common to scRNA-seq analysis), and even home computers commonly pack 16GB of RAM these days. Unless you are working with Frankenstein’d genomes with tens of thousands of “genes” you will be fine.

First we will need to read in the EggNOG-mapper result file.

WARNING - EggNOG format:

Depending on the version of EggNOG you may get a slightly different file; you will need to filter it down to the two columns we need: the query gene ID and the orthogroup assignments. These are a comma-separated string in the format orthogroup_ID@taxonomic_level. The taxonomic level will determine whether genes are orthologs or paralogs, so choose wisely. I am using "Eukaryota" and "Bilateria" as defaults, but it may well be that your version of EggNOG is using NCBI taxonomic IDs instead of verbose names. Please check before applying!

WARNING - Index matching:

For the entries of the EggNOG table to match to the gene names in the SAMap object we need to make sure the index of query matches to the index of sm.sams[query_species].adata.var. If you created the files in the scheme that I am following, this means that you prepended the species ID to the gene IDs; we would need to do the same here.

query = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv",
    sep="\t",
    engine="python",
)
query = query[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
query.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
query["gene_id"] = "hy_" + query["gene_id"].astype(str)

query

	gene_id	eggNOG_OGs
0	hy_t33417aep	38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,C...
1	hy_t33418aep	2CN11@1,2QT83@2759,38D3X@33154,3BDRF@33208,3CX...
2	hy_t37645aep	38CHP@33154,3B9GY@33208,COG2124@1,KOG0157@2759
3	hy_t31628aep	KOG1075@1,KOG1075@2759,KOG4475@1,KOG4475@2759
4	hy_t33265aep	3QB9P@4776,COG3145@1,KOG2731@2759
...	...	...
18265	hy_t24932aep	3A1BV@33154,3BQ1H@33208,KOG1121@1,KOG1121@2759
18266	hy_t24930aep	2E9XV@1,2SG7Z@2759
18267	hy_t24940aep	KOG0118@1,KOG0118@2759
18268	hy_t29557aep	2CZA0@1,2S9AH@2759,3ABW2@33154,3BVK1@33208
18269	hy_t29564aep	COG2340@1,KOG3017@2759

18270 rows × 2 columns

Now we will filter the EggNOG_OGs column and only keep the two levels that we’re interested in. Since we are comparing Hydra to a planarian we should be using the Metazoa level for orthologs. We can see from the table visualization that this table uses NCBI tax IDs, so we should look up the tax ID for Metazoa (33208). Similarly, if we would like to use the Eukaryota level for paralogs, we need its tax ID (2759).

We also need a function that will filter the orthology table to only keep the OGs that belong to the specified levels:

source

filter_OGs

 filter_OGs (x:Union[list,str], paralog:str='Eukaryota',
             ortholog:str='Bilateria')

Find the EggNOG OGs at the the paralog and ortholog level.

	Type	Default	Details
x	typing.Union[list, str]
paralog	str	Eukaryota	the level of the paralog OG
ortholog	str	Bilateria	the level of the ortholog OG
Returns	list		the paralog OG and ortholog OG

This function will filter one EggNOG string (or list) to the specified levels.

query["eggNOG_OGs"].loc[0]

'38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,COG5050@1,KOG2877@2759'

input_str = query["eggNOG_OGs"].loc[0]
paralog_str, ortholog_str = filter_OGs(input_str, paralog="2759", ortholog="33208")
assert paralog_str == "KOG2877@2759"
assert ortholog_str == ""

input_list = query["eggNOG_OGs"].loc[0].split(",")
paralog_list, ortholog_list = filter_OGs(input_list, paralog="2759", ortholog="33208")
assert paralog_list == "KOG2877@2759"
assert ortholog_list == ""

source

assign_homology

 assign_homology (species_OGs, paralog:str='Eukaryota',
                  ortholog:str='Bilateria')

Get the taxonomy of the genes.

	Type	Default	Details
species_OGs			the dataframe with the gene_id and the EggNOG OGs
paralog	str	Eukaryota	the level of the paralog OG
ortholog	str	Bilateria	the level of the ortholog OG)
Returns	DataFrame		the dataframe with the gene_id, paralog OG and ortholog OG

hydra_genes = assign_homology(query, paralog="2759", ortholog="33208")

hydra_genes

	ortholog	paralog
gene_id
hy_t10003aep	3BCY5@33208	KOG3599@2759
hy_t10008aep	None	KOG1075@2759
hy_t10009aep	3BA48@33208	KOG1545@2759
hy_t10011aep	3BFGW@33208	KOG1136@2759
hy_t10012aep	3BKRE@33208	KOG2527@2759
...	...	...
hy_t998aep	3C06S@33208	2S418@2759
hy_t9990aep	None	KOG0490@2759
hy_t9992aep	3B9JS@33208	KOG0573@2759
hy_t999aep	None	2RZ1Z@2759
hy_t99aep	3BH2P@33208	KOG2861@2759

18270 rows × 2 columns

assert hydra_genes.loc["hy_t10003aep"]["paralog"] == "KOG3599@2759"
assert hydra_genes.loc["hy_t10003aep"]["ortholog"] == "3BCY5@33208"

Repeat for the target species (planarian):

target = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv",
    sep="\t",
    engine="python",
)
target = target[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
target.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
target["gene_id"] = "pl_" + target["gene_id"].astype(str)

planarian_genes = assign_homology(target, paralog="2759", ortholog="33208")

Given the orthology assignments (orthogroup membership), it is now very easy to calculate which cross-species genes are orthologs or paralogs; we just need to compare the columns of the orthology tables and keep score.

source

calculate_orthology_score

 calculate_orthology_score (query:pandas.core.frame.DataFrame,
                            target:pandas.core.frame.DataFrame)

	Type	Details
query	DataFrame	the dataframe with the gene_id, paralog OG and ortholog OG for the query species
target	DataFrame
Returns	DataFrame

orthology_score = calculate_orthology_score(hydra_genes, planarian_genes)

gene1 = "pl_dd_Smed_v4_10002_0_1"
gene2 = "hy_t25984aep"

annot1 = target.set_index("gene_id").loc[gene1]
annot2 = query.set_index("gene_id").loc[gene2]

print(annot1["eggNOG_OGs"])
print(annot2["eggNOG_OGs"])

38GXK@33154,3BHNW@33208,3CZKV@33213,47ZE5@7711,48WIG@7742,4CF02@8459,KOG2397@1,KOG2397@2759
38GXK@33154,3BHNW@33208,KOG2397@1,KOG2397@2759

These genes are orthologs, we should therefore expect them to have an orthology score of 2:

assert orthology_score[gene1].loc[gene2] == 2

We can save the orthology table to disk for later use:

# not run
orthology_score.to_csv("path/to/hypl_orthology.tsv", sep="\t")

source

get_orthologs_overlap

 get_orthologs_overlap (genes1, genes2, query, target, orthology)

Returns a DataFrame of homologous gene pairs between two sets of genes based on their presence in an orthology table.

	Type	Details
genes1	numpy.ndarray	A series of gene names.
genes2	numpy.ndarray	A series of gene names.
query	anndata.AnnData	An AnnData object containing the query genes as indices of the `.var` slot.
target	anndata.AnnData	An AnnData object containing the target genes as indices of the `.var` slot.
orthology	pandas.core.frame.DataFrame	A DataFrame containing the orthology information.
Returns	pandas.core.frame.DataFrame	A DataFrame of homologous gene pairs and their degree of conservation. The array has three columns: ‘query’, ‘target’, and ‘degree’, where ‘query’ and ‘target’ are the gene names, and ‘degree’ is the degree of conservation, which can be either 1 or 2.

source

get_orthologs

 get_orthologs (genes:numpy.ndarray,
                orthology:pandas.core.frame.DataFrame,
                target:anndata._core.anndata.AnnData, celltype_to:str)

Get orthologous and paralogous gene connections based on the given genes and orthology information.

	Type	Details
genes	ndarray	Array of gene names.
orthology	DataFrame	Data frame representing the orthology information. The index should contain the query genes, the columns should overlap with the index of target.var, and the values should be 1 for paralogs and 2 for orthologs.
target	AnnData	Target annotation data.
celltype_to	str	The target cell type. Must be a key in `target.uns["rank_genes_groups"]["names"]`.
Returns	ndarray	Array of connections between genes, including orthologous and paralogous connections. Columns are (query, target, degree), where degree is 1 for paralogs and 2 for orthologs.