util

I/O functions

The information about the location of genes in a genome can come from anywhere, but it is typically stored in a GFF3 file. While this is far from a standardized format, there exist some guidelines about how a (syntactically) valid GFF3 file should look like; they can be found here. Such a typical, 9-column, tab-separated file is what I have mostly been working with. If your GFF3 looks different you will have to figure out how to read it in yourself - or open an issue and I can try to help you.


source

read_gff

 read_gff (loc:str|pathlib.Path, gff_columns:list=['seqid', 'source',
           'type', 'start', 'end', 'score', 'strand', 'phase',
           'attributes'], skiprows:int=1, header:Union[int,collections.abc
           .Sequence[int],NoneType,Literal['infer']]=None, sep:str='\t',
           **kwargs)

A function to read a GFF3 file. Expects 9 tab-separated fields, and will try to name the columns according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Type Default Details
loc str | pathlib.Path input filepath
gff_columns list [‘seqid’, ‘source’, ‘type’, ‘start’, ‘end’, ‘score’, ‘strand’, ‘phase’, ‘attributes’]
skiprows int 1 how many rows to skip in the beginning
header Union None
sep str separator for the table
kwargs VAR_KEYWORD
Returns DataFrame the GFF3 file in DataFrame form

For instance, let’s try and read an example. This is a fake GFF3 file based on the Pycnogonum litorale Hox gene cluster; for the real one, please refer to our work on the sea spider.

gff = read_gff(os.environ["EXAMPLE_DATA_PATH"] + "plit.gff3")
gff
seqid source type start end score strand phase attributes
0 pseudochrom_56 PacBio gene 1927066 1936157 . - . ID=PB.8615;function=Homeobox domain;gene=Hox1-...
1 pseudochrom_56 PacBio mRNA 1927066 1936157 . - . ID=PB.8615.1;Parent=PB.8615;function=Homeobox ...
2 pseudochrom_56 PacBio exon 1927066 1928028 . - . ID=PB.8615.1.exon1;Parent=PB.8615.1;function=H...
3 pseudochrom_56 PacBio exon 1935229 1936157 . - . ID=PB.8615.1.exon2;Parent=PB.8615.1;function=H...
4 pseudochrom_56 PacBio CDS 1927066 1928028 . - 1 ID=PB.8615.1.CDS1;Parent=PB.8615.1;function=Ho...
... ... ... ... ... ... ... ... ... ...
134 scaffold_44 AUGUSTUS CDS 1998922 2000654 0.72 - 2 ID=g13061.t1.CDS1;Parent=g13061.t1;function=se...
135 scaffold_44 AUGUSTUS CDS 2023821 2024148 0.53 - 0 ID=g13061.t1.CDS2;Parent=g13061.t1;function=se...
136 scaffold_44 GeneMark.hmm3 mRNA 2023807 2024148 . - . ID=g13061.t2;Parent=g13061;function=sequence-s...
137 scaffold_44 GeneMark.hmm3 exon 2023807 2024148 . - 0 ID=g13061.t2.exon1;Parent=g13061.t2;function=s...
138 scaffold_44 GeneMark.hmm3 CDS 2023807 2024148 . - 0 ID=g13061.t2.CDS1;Parent=g13061.t2;function=se...

139 rows × 9 columns

To begin with, we would like to visualize the ‘real’ Hox genes. We are also only interested in plotting entire genes (no exon structure), so we can filter the GFF rows based on that. We should also extract the gene IDs from the table to help us filter. Furthermore, we should be extracting gene names, in case we want to plot them too.


source

gff_attribute_selector

 gff_attribute_selector (line, sep:str=';', select='ID')

A function that extracts the value of a specified field from the attributes of a GFF3 line. Should be used on the attributes field of the corresponding pandas DataFrame.

Type Default Details
line the attributes field of a GFF3 line
sep str ; the field separator. Should be a semicolon for a GFF3 file.
select str ID the field ID. Should be one that is included in the GFF3 file. Refer to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for possibilities, or choose a manually defined tag that you know is present in the file.
Returns str the value of field select
gff["gene_id"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="ID")
)
gff["gene_name"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="gene")
)
hox_genes = [
    "PB.8615",
    "g9718",
    "PB.8616",
    "g9720",
    "g9721",
    "PB.8617",
    "g9723",
    "g9724",
    "g9725",
]
is_hox = gff["gene_id"].isin(hox_genes)
is_gene = gff["type"] == "gene"

slim = gff[is_gene & is_hox]
slim
seqid source type start end score strand phase attributes gene_id gene_name
0 pseudochrom_56 PacBio gene 1927066 1936157 . - . ID=PB.8615;function=Homeobox domain;gene=Hox1-... PB.8615 Hox1-A
6 pseudochrom_56 AUGUSTUS gene 1998922 2024148 . - . ID=g9718;function=sequence-specific DNA bindin... g9718 Hox2-A
15 pseudochrom_56 PacBio gene 2058396 2065953 . - . ID=PB.8616;function=homeobox protein;gene=Hox3... PB.8616 Hox3-A
56 pseudochrom_56 AUGUSTUS gene 2195412 2206712 . - . ID=g9720;function=sequence-specific DNA bindin... g9720 Hox4-A
62 pseudochrom_56 AUGUSTUS gene 2351936 2354374 . - . ID=g9721;function=sequence-specific DNA bindin... g9721 Hox5-A
68 pseudochrom_56 PacBio gene 2373415 2375678 . - . ID=PB.8617;function=sequence-specific DNA bind... PB.8617 Hox6-A
79 pseudochrom_56 AUGUSTUS gene 2565196 2594468 . - . ID=g9723;function=sequence-specific DNA bindin... g9723 Hox7-A
85 pseudochrom_56 AUGUSTUS gene 2916314 2926445 . - . ID=g9724;function=sequence-specific DNA bindin... g9724 Hox8-A
91 pseudochrom_56 AUGUSTUS gene 2986021 2996225 . - . ID=g9725;function=sequence-specific DNA bindin... g9725 Hox10-A

This table basically already contains all the information we need: * the name of the chromosome * the location of each gene * the strand of each gene * the directionality of each gene (given by relative start/end positions) * the name of each gene (hidden in the attributes)

The only thing that’s left is to extract this information in the way that is needed for the plotter:

hox = slim[["seqid", "gene_name", "gene_id", "start", "end"]].reset_index(drop=True)
hox
seqid gene_name gene_id start end
0 pseudochrom_56 Hox1-A PB.8615 1927066 1936157
1 pseudochrom_56 Hox2-A g9718 1998922 2024148
2 pseudochrom_56 Hox3-A PB.8616 2058396 2065953
3 pseudochrom_56 Hox4-A g9720 2195412 2206712
4 pseudochrom_56 Hox5-A g9721 2351936 2354374
5 pseudochrom_56 Hox6-A PB.8617 2373415 2375678
6 pseudochrom_56 Hox7-A g9723 2565196 2594468
7 pseudochrom_56 Hox8-A g9724 2916314 2926445
8 pseudochrom_56 Hox10-A g9725 2986021 2996225

source

filter

 filter (gff, filter_by_type=True, filter_type='gene',
         filter_by_field=True, field='gene_id', field_values=None)

source

decorate

 decorate (gff:pandas.core.frame.DataFrame, attributes:dict={'gene_id':
           'ID', 'gene_name': 'gene'})

A function that

Type Default Details
gff DataFrame a GFF file in Pandas dataframe form
attributes dict {‘gene_id’: ‘ID’, ‘gene_name’: ‘gene’}

source

syntenic_block_borders

 syntenic_block_borders (gff:pandas.core.frame.DataFrame,
                         flank_length:int=None, start:str='start',
                         end:str='end')

A function to calculate the boundaries of a syntenic block. It automatically pads the boundaries by an additional 5% of total length on both ends.

Type Default Details
gff DataFrame a GFF in Pandas dataframe form. Only includes the genes of the syntenic block in question.
flank_length int None the amount of space to be granted on both sides of the syntenic region, in basepairs. If unspecified, it will be set to 5% of the syntenic block length.
start str start the GFF column with the start position of the gene (“start”).
end str end the GFF column with the end position of the gene (“end”).
Returns (<class ‘int’>, <class ‘int’>)
test_fail(
    syntenic_block_borders,
    contains="The parameter `flank_length` has to be an integer",
    args=(gff,),
    kwargs=dict(flank_length=5.2),
)

The process of reading in a GFF3 file can be expedited with the


source

read_aln

 read_aln (m8:str, id_sep:str|None=None, **kwargs)

Reads

Type Default Details
m8 str the path to the MMseqs2 alignment table file
id_sep str | None None
kwargs VAR_KEYWORD
Returns DataFrame the tabulated form of the alignment results.

source

estimate_plot_size

 estimate_plot_size (gff, width_factor:int=3, height:int=2)
assert estimate_plot_size(gff[gff["type"] == "gene"]) == (45, 2)

source

insert_gap

 insert_gap (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
             identifier='gene_id', purge_columns=None, no_gaps=1)

This function inserts a number of dummy entries between two loci (lines) in the GFF DataFrame.

test_fail(
    insert_gap,
    contains="The two loci are not consecutive;",
    args=(hox,),
    kwargs={"locus1": "PB.8615", "locus2": "g9720", "identifier": "gene_id"},
)
hox = insert_gap(
    hox,
    locus1="PB.8615",
    locus2="g9718",
    identifier="gene_id",
    no_gaps=4,
    purge_columns=["gene_name"],
)
hox["strand"] = "-"
hox
seqid gene_name gene_id start end strand
0 pseudochrom_56 Hox1-A PB.8615 1927066 1936157 -
1 pseudochrom_56 gap_PB.8615-0 1936158 1936159 -
2 pseudochrom_56 gap_PB.8615-1 1936160 1936161 -
3 pseudochrom_56 gap_PB.8615-2 1936162 1936163 -
4 pseudochrom_56 gap_PB.8615-3 1936164 1936165 -
5 pseudochrom_56 Hox2-A g9718 1998922 2024148 -
6 pseudochrom_56 Hox3-A PB.8616 2058396 2065953 -
7 pseudochrom_56 Hox4-A g9720 2195412 2206712 -
8 pseudochrom_56 Hox5-A g9721 2351936 2354374 -
9 pseudochrom_56 Hox6-A PB.8617 2373415 2375678 -
10 pseudochrom_56 Hox7-A g9723 2565196 2594468 -
11 pseudochrom_56 Hox8-A g9724 2916314 2926445 -
12 pseudochrom_56 Hox10-A g9725 2986021 2996225 -
hox = insert_gap(hox, locus2="PB.8615", no_gaps=1, purge_columns=["gene_name"])
hox = insert_gap(hox, locus1="g9725", no_gaps=1, purge_columns=["gene_name"])

source

flip

 flip (gff)
flip(hox)
seqid source type start end score strand phase attributes gene_id gene_name
0 pseudochrom_56 PacBio gene -1927066 -1936157 . + . ID=PB.8615;function=Homeobox domain;gene=Hox1-... PB.8615 Hox1-A
1 pseudochrom_56 AUGUSTUS gene -1998922 -2024148 . + . ID=g9718;function=sequence-specific DNA bindin... g9718 Hox2-A
2 pseudochrom_56 PacBio gene -2058396 -2065953 . + . ID=PB.8616;function=homeobox protein;gene=Hox3... PB.8616 Hox3-A
3 pseudochrom_56 AUGUSTUS gene -2195412 -2206712 . + . ID=g9720;function=sequence-specific DNA bindin... g9720 Hox4-A
4 pseudochrom_56 AUGUSTUS gene -2351936 -2354374 . + . ID=g9721;function=sequence-specific DNA bindin... g9721 Hox5-A
5 pseudochrom_56 PacBio gene -2373415 -2375678 . + . ID=PB.8617;function=sequence-specific DNA bind... PB.8617 Hox6-A
6 pseudochrom_56 AUGUSTUS gene -2565196 -2594468 . + . ID=g9723;function=sequence-specific DNA bindin... g9723 Hox7-A
7 pseudochrom_56 AUGUSTUS gene -2916314 -2926445 . + . ID=g9724;function=sequence-specific DNA bindin... g9724 Hox8-A
8 pseudochrom_56 AUGUSTUS gene -2986021 -2996225 . + . ID=g9725;function=sequence-specific DNA bindin... g9725 Hox10-A

source

insert_break

 insert_break (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
               identifier='gene_id')

This function inserts a molecule break between two loci (lines) in the GFF DataFrame.

keep = gff["gene_id"].isin(hox_genes)
hox = gff[keep].reset_index(drop=True)
decorate(hox)

on_scaff44 = gff["seqid"] == "scaffold_44"
is_gene = gff["type"] == "gene"
hoxc = gff[is_gene & on_scaff44].reset_index(drop=True)
decorate(hoxc)
interrupted = pd.concat((hox.loc[6:], hoxc)).reset_index(drop=True)
interrupted
seqid source type start end score strand phase attributes gene_id gene_name
0 pseudochrom_56 AUGUSTUS gene 2565196 2594468 . - . ID=g9723;function=sequence-specific DNA bindin... g9723 Hox7-A
1 pseudochrom_56 AUGUSTUS gene 2916314 2926445 . - . ID=g9724;function=sequence-specific DNA bindin... g9724 Hox8-A
2 pseudochrom_56 AUGUSTUS gene 2986021 2996225 . - . ID=g9725;function=sequence-specific DNA bindin... g9725 Hox10-A
3 scaffold_44 PacBio gene 1927066 1936157 . - . ID=PB.1762;function=Homeobox domain;gene=Hox11... PB.1762 Hox11
4 scaffold_44 AUGUSTUS gene 1998922 2024148 . - . ID=g13061;function=sequence-specific DNA bindi... g13061 Hox12
interrupted = insert_break(interrupted, locus1="g9725", locus2="PB.1762")
interrupted
seqid source type start end score strand phase attributes gene_id gene_name
0 pseudochrom_56 AUGUSTUS gene 2565196 2594468 . - . ID=g9723;function=sequence-specific DNA bindin... g9723 Hox7-A
1 pseudochrom_56 AUGUSTUS gene 2916314 2926445 . - . ID=g9724;function=sequence-specific DNA bindin... g9724 Hox8-A
2 pseudochrom_56 AUGUSTUS gene 2986021 2996225 . - . ID=g9725;function=sequence-specific DNA bindin... g9725 Hox10-A
3 break
4 scaffold_44 PacBio gene 1927066 1936157 . - . ID=PB.1762;function=Homeobox domain;gene=Hox11... PB.1762 Hox11
5 scaffold_44 AUGUSTUS gene 1998922 2024148 . - . ID=g13061;function=sequence-specific DNA bindi... g13061 Hox12