genes

Prepare and annotate genes and gene sets.

SAMap is built on pairwise gene similarity. For cross-species comparisons it is often very interesting to know the level of conservation of two genes - mostly, whether they are orthologs or paralogs. I will provide functions to annotate genes with orthology information from EggNOG-mapper, but you can also use your own orthology annotation, provided it returns the same thing: a \(G_1 \times G_2\) table where \(G_1\) and \(G_2\) are the genes in the two species you are comparing, and each cell \((g_1, g_2)\) contains the orthology relationship between \(g_1\) and \(g_2\): 2, if they are orthologs, 1 if they are paralogs/in the same gene family, 0 if they are unrelated.

This is an extremely inefficient way of saving this matrix, and it would be relatively easy to code this as a sparse matrix, but I don’t think it’s worth the effort. The matrices are not too large (especially after the gene filtering that is so common to scRNA-seq analysis), and even home computers commonly pack 16GB of RAM these days. Unless you are working with Frankenstein’d genomes with tens of thousands of “genes” you will be fine.

First we will need to read in the EggNOG-mapper result file.

WARNING - EggNOG format:

Depending on the version of EggNOG you may get a slightly different file; you will need to filter it down to the two columns we need: the query gene ID and the orthogroup assignments. These are a comma-separated string in the format orthogroup_ID@taxonomic_level. The taxonomic level will determine whether genes are orthologs or paralogs, so choose wisely. I am using "Eukaryota" and "Bilateria" as defaults, but it may well be that your version of EggNOG is using NCBI taxonomic IDs instead of verbose names. Please check before applying!

WARNING - Index matching:

For the entries of the EggNOG table to match to the gene names in the SAMap object we need to make sure the index of query matches to the index of sm.sams[query_species].adata.var. If you created the files in the scheme that I am following, this means that you prepended the species ID to the gene IDs; we would need to do the same here.

query = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv",
    sep="\t",
    engine="python",
)
query = query[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
query.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
query["gene_id"] = "hy_" + query["gene_id"].astype(str)
query
gene_id eggNOG_OGs
0 hy_t33417aep 38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,C...
1 hy_t33418aep 2CN11@1,2QT83@2759,38D3X@33154,3BDRF@33208,3CX...
2 hy_t37645aep 38CHP@33154,3B9GY@33208,COG2124@1,KOG0157@2759
3 hy_t31628aep KOG1075@1,KOG1075@2759,KOG4475@1,KOG4475@2759
4 hy_t33265aep 3QB9P@4776,COG3145@1,KOG2731@2759
... ... ...
18265 hy_t24932aep 3A1BV@33154,3BQ1H@33208,KOG1121@1,KOG1121@2759
18266 hy_t24930aep 2E9XV@1,2SG7Z@2759
18267 hy_t24940aep KOG0118@1,KOG0118@2759
18268 hy_t29557aep 2CZA0@1,2S9AH@2759,3ABW2@33154,3BVK1@33208
18269 hy_t29564aep COG2340@1,KOG3017@2759

18270 rows × 2 columns

Now we will filter the EggNOG_OGs column and only keep the two levels that we’re interested in. Since we are comparing Hydra to a planarian we should be using the Metazoa level for orthologs. We can see from the table visualization that this table uses NCBI tax IDs, so we should look up the tax ID for Metazoa (33208). Similarly, if we would like to use the Eukaryota level for paralogs, we need its tax ID (2759).

We also need a function that will filter the orthology table to only keep the OGs that belong to the specified levels:


source

filter_OGs

 filter_OGs (x:Union[list,str], paralog:str='Eukaryota',
             ortholog:str='Bilateria')

Find the EggNOG OGs at the the paralog and ortholog level.

Type Default Details
x typing.Union[list, str]
paralog str Eukaryota the level of the paralog OG
ortholog str Bilateria the level of the ortholog OG
Returns list the paralog OG and ortholog OG

This function will filter one EggNOG string (or list) to the specified levels.

query["eggNOG_OGs"].loc[0]
'38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,COG5050@1,KOG2877@2759'
input_str = query["eggNOG_OGs"].loc[0]
paralog_str, ortholog_str = filter_OGs(input_str, paralog="2759", ortholog="33208")
assert paralog_str == "KOG2877@2759"
assert ortholog_str == ""
input_list = query["eggNOG_OGs"].loc[0].split(",")
paralog_list, ortholog_list = filter_OGs(input_list, paralog="2759", ortholog="33208")
assert paralog_list == "KOG2877@2759"
assert ortholog_list == ""

source

assign_homology

 assign_homology (species_OGs, paralog:str='Eukaryota',
                  ortholog:str='Bilateria')

Get the taxonomy of the genes.

Type Default Details
species_OGs the dataframe with the gene_id and the EggNOG OGs
paralog str Eukaryota the level of the paralog OG
ortholog str Bilateria the level of the ortholog OG)
Returns DataFrame the dataframe with the gene_id, paralog OG and ortholog OG
hydra_genes = assign_homology(query, paralog="2759", ortholog="33208")
hydra_genes
ortholog paralog
gene_id
hy_t10003aep 3BCY5@33208 KOG3599@2759
hy_t10008aep None KOG1075@2759
hy_t10009aep 3BA48@33208 KOG1545@2759
hy_t10011aep 3BFGW@33208 KOG1136@2759
hy_t10012aep 3BKRE@33208 KOG2527@2759
... ... ...
hy_t998aep 3C06S@33208 2S418@2759
hy_t9990aep None KOG0490@2759
hy_t9992aep 3B9JS@33208 KOG0573@2759
hy_t999aep None 2RZ1Z@2759
hy_t99aep 3BH2P@33208 KOG2861@2759

18270 rows × 2 columns

assert hydra_genes.loc["hy_t10003aep"]["paralog"] == "KOG3599@2759"
assert hydra_genes.loc["hy_t10003aep"]["ortholog"] == "3BCY5@33208"

Repeat for the target species (planarian):

target = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv",
    sep="\t",
    engine="python",
)
target = target[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
target.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
target["gene_id"] = "pl_" + target["gene_id"].astype(str)

planarian_genes = assign_homology(target, paralog="2759", ortholog="33208")

Given the orthology assignments (orthogroup membership), it is now very easy to calculate which cross-species genes are orthologs or paralogs; we just need to compare the columns of the orthology tables and keep score.


source

calculate_orthology_score

 calculate_orthology_score (query:pandas.core.frame.DataFrame,
                            target:pandas.core.frame.DataFrame)
Type Details
query DataFrame the dataframe with the gene_id, paralog OG and ortholog OG for the query species
target DataFrame
Returns DataFrame
orthology_score = calculate_orthology_score(hydra_genes, planarian_genes)
gene1 = "pl_dd_Smed_v4_10002_0_1"
gene2 = "hy_t25984aep"

annot1 = target.set_index("gene_id").loc[gene1]
annot2 = query.set_index("gene_id").loc[gene2]

print(annot1["eggNOG_OGs"])
print(annot2["eggNOG_OGs"])
38GXK@33154,3BHNW@33208,3CZKV@33213,47ZE5@7711,48WIG@7742,4CF02@8459,KOG2397@1,KOG2397@2759
38GXK@33154,3BHNW@33208,KOG2397@1,KOG2397@2759

These genes are orthologs, we should therefore expect them to have an orthology score of 2:

assert orthology_score[gene1].loc[gene2] == 2

We can save the orthology table to disk for later use:

# not run
orthology_score.to_csv("path/to/hypl_orthology.tsv", sep="\t")

source

get_orthologs_overlap

 get_orthologs_overlap (genes1, genes2, query, target, orthology)

Returns a DataFrame of homologous gene pairs between two sets of genes based on their presence in an orthology table.

Type Details
genes1 numpy.ndarray A series of gene names.
genes2 numpy.ndarray A series of gene names.
query anndata.AnnData An AnnData object containing the query genes as indices of the .var slot.
target anndata.AnnData An AnnData object containing the target genes as indices of the .var slot.
orthology pandas.core.frame.DataFrame A DataFrame containing the orthology information.
Returns pandas.core.frame.DataFrame A DataFrame of homologous gene pairs and their degree of conservation. The array has
three columns: ‘query’, ‘target’, and ‘degree’, where ‘query’ and ‘target’ are the gene
names, and ‘degree’ is the degree of conservation, which can be either 1 or 2.

source

get_orthologs

 get_orthologs (genes:numpy.ndarray,
                orthology:pandas.core.frame.DataFrame,
                target:anndata._core.anndata.AnnData, celltype_to:str)

Get orthologous and paralogous gene connections based on the given genes and orthology information.

Type Details
genes ndarray Array of gene names.
orthology DataFrame Data frame representing the orthology information. The index should contain the query genes,
the columns should overlap with the index of target.var, and the values should be 1 for
paralogs and 2 for orthologs.
target AnnData Target annotation data.
celltype_to str The target cell type. Must be a key in target.uns["rank_genes_groups"]["names"].
Returns ndarray Array of connections between genes, including orthologous and paralogous connections.
Columns are (query, target, degree), where degree is 1 for paralogs and 2 for orthologs.