= pd.read_csv(
query "EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv",
os.environ[="\t",
sep="python",
engine
)= query[
query "Unnamed: 0", "eggNOG_OGs"]
[# I am only keeping the columns I need
].copy() = ["gene_id", "eggNOG_OGs"] # rename so that it is easier to work with
query.columns "gene_id"] = "hy_" + query["gene_id"].astype(str) query[
genes
SAMap is built on pairwise gene similarity. For cross-species comparisons it is often very interesting to know the level of conservation of two genes - mostly, whether they are orthologs or paralogs. I will provide functions to annotate genes with orthology information from EggNOG-mapper, but you can also use your own orthology annotation, provided it returns the same thing: a \(G_1 \times G_2\) table where \(G_1\) and \(G_2\) are the genes in the two species you are comparing, and each cell \((g_1, g_2)\) contains the orthology relationship between \(g_1\) and \(g_2\): 2, if they are orthologs, 1 if they are paralogs/in the same gene family, 0 if they are unrelated.
This is an extremely inefficient way of saving this matrix, and it would be relatively easy to code this as a sparse matrix, but I don’t think it’s worth the effort. The matrices are not too large (especially after the gene filtering that is so common to scRNA-seq analysis), and even home computers commonly pack 16GB of RAM these days. Unless you are working with Frankenstein’d genomes with tens of thousands of “genes” you will be fine.
First we will need to read in the EggNOG-mapper result file.
WARNING - EggNOG format:
Depending on the version of EggNOG you may get a slightly different file; you will need to filter it down to the two columns we need: the query gene ID and the orthogroup assignments. These are a comma-separated string in the format orthogroup_ID@taxonomic_level
. The taxonomic level will determine whether genes are orthologs or paralogs, so choose wisely. I am using "Eukaryota"
and "Bilateria"
as defaults, but it may well be that your version of EggNOG is using NCBI taxonomic IDs instead of verbose names. Please check before applying!
WARNING - Index matching:
For the entries of the EggNOG table to match to the gene names in the SAMap object we need to make sure the index of query
matches to the index of sm.sams[query_species].adata.var
. If you created the files in the scheme that I am following, this means that you prepended the species ID to the gene IDs; we would need to do the same here.
query
gene_id | eggNOG_OGs | |
---|---|---|
0 | hy_t33417aep | 38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,C... |
1 | hy_t33418aep | 2CN11@1,2QT83@2759,38D3X@33154,3BDRF@33208,3CX... |
2 | hy_t37645aep | 38CHP@33154,3B9GY@33208,COG2124@1,KOG0157@2759 |
3 | hy_t31628aep | KOG1075@1,KOG1075@2759,KOG4475@1,KOG4475@2759 |
4 | hy_t33265aep | 3QB9P@4776,COG3145@1,KOG2731@2759 |
... | ... | ... |
18265 | hy_t24932aep | 3A1BV@33154,3BQ1H@33208,KOG1121@1,KOG1121@2759 |
18266 | hy_t24930aep | 2E9XV@1,2SG7Z@2759 |
18267 | hy_t24940aep | KOG0118@1,KOG0118@2759 |
18268 | hy_t29557aep | 2CZA0@1,2S9AH@2759,3ABW2@33154,3BVK1@33208 |
18269 | hy_t29564aep | COG2340@1,KOG3017@2759 |
18270 rows × 2 columns
Now we will filter the EggNOG_OGs
column and only keep the two levels that we’re interested in. Since we are comparing Hydra to a planarian we should be using the Metazoa level for orthologs. We can see from the table visualization that this table uses NCBI tax IDs, so we should look up the tax ID for Metazoa (33208). Similarly, if we would like to use the Eukaryota level for paralogs, we need its tax ID (2759).
We also need a function that will filter the orthology table to only keep the OGs that belong to the specified levels:
filter_OGs
filter_OGs (x:Union[list,str], paralog:str='Eukaryota', ortholog:str='Bilateria')
Find the EggNOG OGs at the the paralog and ortholog level.
Type | Default | Details | |
---|---|---|---|
x | typing.Union[list, str] | ||
paralog | str | Eukaryota | the level of the paralog OG |
ortholog | str | Bilateria | the level of the ortholog OG |
Returns | list | the paralog OG and ortholog OG |
This function will filter one EggNOG string (or list) to the specified levels.
"eggNOG_OGs"].loc[0] query[
'38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,COG5050@1,KOG2877@2759'
= query["eggNOG_OGs"].loc[0]
input_str = filter_OGs(input_str, paralog="2759", ortholog="33208")
paralog_str, ortholog_str assert paralog_str == "KOG2877@2759"
assert ortholog_str == ""
= query["eggNOG_OGs"].loc[0].split(",")
input_list = filter_OGs(input_list, paralog="2759", ortholog="33208")
paralog_list, ortholog_list assert paralog_list == "KOG2877@2759"
assert ortholog_list == ""
assign_homology
assign_homology (species_OGs, paralog:str='Eukaryota', ortholog:str='Bilateria')
Get the taxonomy of the genes.
Type | Default | Details | |
---|---|---|---|
species_OGs | the dataframe with the gene_id and the EggNOG OGs | ||
paralog | str | Eukaryota | the level of the paralog OG |
ortholog | str | Bilateria | the level of the ortholog OG) |
Returns | DataFrame | the dataframe with the gene_id, paralog OG and ortholog OG |
= assign_homology(query, paralog="2759", ortholog="33208") hydra_genes
hydra_genes
ortholog | paralog | |
---|---|---|
gene_id | ||
hy_t10003aep | 3BCY5@33208 | KOG3599@2759 |
hy_t10008aep | None | KOG1075@2759 |
hy_t10009aep | 3BA48@33208 | KOG1545@2759 |
hy_t10011aep | 3BFGW@33208 | KOG1136@2759 |
hy_t10012aep | 3BKRE@33208 | KOG2527@2759 |
... | ... | ... |
hy_t998aep | 3C06S@33208 | 2S418@2759 |
hy_t9990aep | None | KOG0490@2759 |
hy_t9992aep | 3B9JS@33208 | KOG0573@2759 |
hy_t999aep | None | 2RZ1Z@2759 |
hy_t99aep | 3BH2P@33208 | KOG2861@2759 |
18270 rows × 2 columns
assert hydra_genes.loc["hy_t10003aep"]["paralog"] == "KOG3599@2759"
assert hydra_genes.loc["hy_t10003aep"]["ortholog"] == "3BCY5@33208"
Repeat for the target species (planarian):
= pd.read_csv(
target "EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv",
os.environ[="\t",
sep="python",
engine
)= target[
target "Unnamed: 0", "eggNOG_OGs"]
[# I am only keeping the columns I need
].copy() = ["gene_id", "eggNOG_OGs"] # rename so that it is easier to work with
target.columns "gene_id"] = "pl_" + target["gene_id"].astype(str)
target[
= assign_homology(target, paralog="2759", ortholog="33208") planarian_genes
Given the orthology assignments (orthogroup membership), it is now very easy to calculate which cross-species genes are orthologs or paralogs; we just need to compare the columns of the orthology tables and keep score.
calculate_orthology_score
calculate_orthology_score (query:pandas.core.frame.DataFrame, target:pandas.core.frame.DataFrame)
Type | Details | |
---|---|---|
query | DataFrame | the dataframe with the gene_id, paralog OG and ortholog OG for the query species |
target | DataFrame | |
Returns | DataFrame |
= calculate_orthology_score(hydra_genes, planarian_genes) orthology_score
= "pl_dd_Smed_v4_10002_0_1"
gene1 = "hy_t25984aep"
gene2
= target.set_index("gene_id").loc[gene1]
annot1 = query.set_index("gene_id").loc[gene2]
annot2
print(annot1["eggNOG_OGs"])
print(annot2["eggNOG_OGs"])
38GXK@33154,3BHNW@33208,3CZKV@33213,47ZE5@7711,48WIG@7742,4CF02@8459,KOG2397@1,KOG2397@2759
38GXK@33154,3BHNW@33208,KOG2397@1,KOG2397@2759
These genes are orthologs, we should therefore expect them to have an orthology score of 2:
assert orthology_score[gene1].loc[gene2] == 2
We can save the orthology table to disk for later use:
# not run
"path/to/hypl_orthology.tsv", sep="\t") orthology_score.to_csv(
get_orthologs_overlap
get_orthologs_overlap (genes1, genes2, query, target, orthology)
Returns a DataFrame of homologous gene pairs between two sets of genes based on their presence in an orthology table.
Type | Details | |
---|---|---|
genes1 | numpy.ndarray | A series of gene names. |
genes2 | numpy.ndarray | A series of gene names. |
query | anndata.AnnData | An AnnData object containing the query genes as indices of the .var slot. |
target | anndata.AnnData | An AnnData object containing the target genes as indices of the .var slot. |
orthology | pandas.core.frame.DataFrame | A DataFrame containing the orthology information. |
Returns | pandas.core.frame.DataFrame | A DataFrame of homologous gene pairs and their degree of conservation. The array has three columns: ‘query’, ‘target’, and ‘degree’, where ‘query’ and ‘target’ are the gene names, and ‘degree’ is the degree of conservation, which can be either 1 or 2. |
get_orthologs
get_orthologs (genes:numpy.ndarray, orthology:pandas.core.frame.DataFrame, target:anndata._core.anndata.AnnData, celltype_to:str)
Get orthologous and paralogous gene connections based on the given genes and orthology information.
Type | Details | |
---|---|---|
genes | ndarray | Array of gene names. |
orthology | DataFrame | Data frame representing the orthology information. The index should contain the query genes, the columns should overlap with the index of target.var, and the values should be 1 for paralogs and 2 for orthologs. |
target | AnnData | Target annotation data. |
celltype_to | str | The target cell type. Must be a key in target.uns["rank_genes_groups"]["names"] . |
Returns | ndarray | Array of connections between genes, including orthologous and paralogous connections. Columns are (query, target, degree), where degree is 1 for paralogs and 2 for orthologs. |