genes

Prepare and annotate genes and gene sets.

SAMap is built on pairwise gene similarity. For cross-species comparisons it is often very interesting to know the level of conservation of two genes - mostly, whether they are orthologs or paralogs. I will provide functions to annotate genes with orthology information from EggNOG-mapper, but you can also use your own orthology annotation, provided it returns the same thing: a \(G_1 \times G_2\) table where \(G_1\) and \(G_2\) are the genes in the two species you are comparing, and each cell \((g_1, g_2)\) contains the orthology relationship between \(g_1\) and \(g_2\): 2, if they are orthologs, 1 if they are paralogs/in the same gene family, 0 if they are unrelated.

This is an extremely inefficient way of saving this matrix, and it would be relatively easy to code this as a sparse matrix, but I don’t think it’s worth the effort. The matrices are not too large (especially after the gene filtering that is so common to scRNA-seq analysis), and even home computers commonly pack 16GB of RAM these days. Unless you are working with Frankenstein’d genomes with tens of thousands of “genes” you will be fine.

First we will need to read in the EggNOG-mapper result file.

WARNING - EggNOG format:

Depending on the version of EggNOG you may get a slightly different file; you will need to filter it down to the two columns we need: the query gene ID and the orthogroup assignments. These are a comma-separated string in the format orthogroup_ID@taxonomic_level. The taxonomic level will determine whether genes are orthologs or paralogs, so choose wisely. I am using "Eukaryota" and "Bilateria" as defaults, but it may well be that your version of EggNOG is using NCBI taxonomic IDs instead of verbose names. Please check before applying!

WARNING - Index matching:

For the entries of the EggNOG table to match to the gene names in the SAMap object we need to make sure the index of query matches to the index of sm.sams[query_species].adata.var. If you created the files in the scheme that I am following, this means that you prepended the species ID to the gene IDs; we would need to do the same here.

query = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv",
    sep="\t",
    engine="python",
)
query = query[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
query.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
query["gene_id"] = "hy_" + query["gene_id"].astype(str)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[5], line 1
----> 1 query = pd.read_csv(
      2     os.environ["EXAMPLE_DATA_PATH"] + "eggnog/hydra.tsv",
      3     sep="\t",
      4     engine="python",
      5 )
      6 query = query[
      7     ["Unnamed: 0", "eggNOG_OGs"]
      8 ].copy()  # I am only keeping the columns I need
      9 query.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /opt/homebrew/Caskroom/miniforge/base/envs/comandos_dev/lib/python3.14/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/npapadop/Documents/repos/comandos/example_data/eggnog/hydra.tsv'
query
gene_id eggNOG_OGs
0 hy_t33417aep 38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,C...
1 hy_t33418aep 2CN11@1,2QT83@2759,38D3X@33154,3BDRF@33208,3CX...
2 hy_t37645aep 38CHP@33154,3B9GY@33208,COG2124@1,KOG0157@2759
3 hy_t31628aep KOG1075@1,KOG1075@2759,KOG4475@1,KOG4475@2759
4 hy_t33265aep 3QB9P@4776,COG3145@1,KOG2731@2759
... ... ...
18265 hy_t24932aep 3A1BV@33154,3BQ1H@33208,KOG1121@1,KOG1121@2759
18266 hy_t24930aep 2E9XV@1,2SG7Z@2759
18267 hy_t24940aep KOG0118@1,KOG0118@2759
18268 hy_t29557aep 2CZA0@1,2S9AH@2759,3ABW2@33154,3BVK1@33208
18269 hy_t29564aep COG2340@1,KOG3017@2759

18270 rows × 2 columns

Now we will filter the EggNOG_OGs column and only keep the two levels that we’re interested in. Since we are comparing Hydra to a planarian we should be using the Metazoa level for orthologs. We can see from the table visualization that this table uses NCBI tax IDs, so we should look up the tax ID for Metazoa (33208). Similarly, if we would like to use the Eukaryota level for paralogs, we need its tax ID (2759).

We also need a function that will filter the orthology table to only keep the OGs that belong to the specified levels:


source

filter_OGs


def filter_OGs(
    x:Union, paralog:str='Eukaryota', # the level of the paralog OG
    ortholog:str='Bilateria', # the level of the ortholog OG
)->list: # the paralog OG and ortholog OG

Find the EggNOG OGs at the the paralog and ortholog level.

This function will filter one EggNOG string (or list) to the specified levels.

query["eggNOG_OGs"].loc[0]
'38ERC@33154,3NUD8@4751,3QR2W@4890,3RR2D@4891,COG5050@1,KOG2877@2759'
input_str = query["eggNOG_OGs"].loc[0]
paralog_str, ortholog_str = filter_OGs(input_str, paralog="2759", ortholog="33208")
assert paralog_str == "KOG2877@2759"
assert ortholog_str == ""
input_list = query["eggNOG_OGs"].loc[0].split(",")
paralog_list, ortholog_list = filter_OGs(input_list, paralog="2759", ortholog="33208")
assert paralog_list == "KOG2877@2759"
assert ortholog_list == ""

source

assign_homology


def assign_homology(
    species_OGs, # the dataframe with the gene_id and the EggNOG OGs
    paralog:str='Eukaryota', # the level of the paralog OG
    ortholog:str='Bilateria', # the level of the ortholog OG)
)->DataFrame: # the dataframe with the gene_id, paralog OG and ortholog OG

Get the taxonomy of the genes.

hydra_genes = assign_homology(query, paralog="2759", ortholog="33208")
hydra_genes
ortholog paralog
gene_id
hy_t10003aep 3BCY5@33208 KOG3599@2759
hy_t10008aep None KOG1075@2759
hy_t10009aep 3BA48@33208 KOG1545@2759
hy_t10011aep 3BFGW@33208 KOG1136@2759
hy_t10012aep 3BKRE@33208 KOG2527@2759
... ... ...
hy_t998aep 3C06S@33208 2S418@2759
hy_t9990aep None KOG0490@2759
hy_t9992aep 3B9JS@33208 KOG0573@2759
hy_t999aep None 2RZ1Z@2759
hy_t99aep 3BH2P@33208 KOG2861@2759

18270 rows × 2 columns

assert hydra_genes.at["hy_t10003aep", "paralog"] == "KOG3599@2759"
assert hydra_genes.at["hy_t10003aep", "ortholog"] == "3BCY5@33208"

Repeat for the target species (planarian):

target = pd.read_csv(
    os.environ["EXAMPLE_DATA_PATH"] + "eggnog/planarian.tsv",
    sep="\t",
    engine="python",
)
target = target[
    ["Unnamed: 0", "eggNOG_OGs"]
].copy()  # I am only keeping the columns I need
target.columns = ["gene_id", "eggNOG_OGs"]  # rename so that it is easier to work with
target["gene_id"] = "pl_" + target["gene_id"].astype(str)

planarian_genes = assign_homology(target, paralog="2759", ortholog="33208")

Given the orthology assignments (orthogroup membership), it is now very easy to calculate which cross-species genes are orthologs or paralogs; we just need to compare the columns of the orthology tables and keep score.


source

calculate_orthology_score


def calculate_orthology_score(
    query:DataFrame, # the dataframe with the gene_id, paralog OG and ortholog OG for the query species
    target:DataFrame
)->DataFrame:
orthology_score = calculate_orthology_score(hydra_genes, planarian_genes)
gene1 = "pl_dd_Smed_v4_10002_0_1"
gene2 = "hy_t25984aep"

annot1 = target.set_index("gene_id").loc[gene1]
annot2 = query.set_index("gene_id").loc[gene2]

print(annot1["eggNOG_OGs"])
print(annot2["eggNOG_OGs"])
38GXK@33154,3BHNW@33208,3CZKV@33213,47ZE5@7711,48WIG@7742,4CF02@8459,KOG2397@1,KOG2397@2759
38GXK@33154,3BHNW@33208,KOG2397@1,KOG2397@2759

These genes are orthologs, we should therefore expect them to have an orthology score of 2:

assert orthology_score[gene1].loc[gene2] == 2

We can save the orthology table to disk for later use:

# not run
orthology_score.to_csv("path/to/hypl_orthology.tsv", sep="\t")

source

get_orthologs_overlap


def get_orthologs_overlap(
    genes1, # A series of gene names.
    genes2, # A series of gene names.
    query, # An AnnData object containing the query genes as indices of the `.var` slot.
    target, # An AnnData object containing the target genes as indices of the `.var` slot.
    orthology, # A DataFrame containing the orthology information.
): # A DataFrame of homologous gene pairs and their degree of conservation. The array has
three columns: 'query', 'target', and 'degree', where 'query' and 'target' are the gene
names, and 'degree' is the degree of conservation, which can be either 1 or 2.

Returns a DataFrame of homologous gene pairs between two sets of genes based on their presence in an orthology table.


source

get_orthologs


def get_orthologs(
    genes:ndarray, # Array of gene names.
    orthology:DataFrame, # Data frame representing the orthology information. The index should contain the query genes,
the columns should overlap with the index of target.var, and the values should be 1 for
paralogs and 2 for orthologs.
    target:AnnData, # Target annotation data.
    celltype_to:str, # The target cell type. Must be a key in `target.uns["rank_genes_groups"]["names"]`.
)->ndarray: # Array of connections between genes, including orthologous and paralogous connections.
Columns are (query, target, degree), where degree is 1 for paralogs and 2 for orthologs.

Get orthologous and paralogous gene connections based on the given genes and orthology information.