util

I/O functions

The information about the location of genes in a genome can come from anywhere, but it is typically stored in a GFF3 file. While this is far from a standardized format, there exist some guidelines about how a (syntactically) valid GFF3 file should look like; they can be found here. Such a typical, 9-column, tab-separated file is what I have mostly been working with. If your GFF3 looks different you will have to figure out how to read it in yourself - or open an issue and I can try to help you.

source

read_gff

 read_gff (loc:str|pathlib.Path, gff_columns:list=['seqid', 'source',
           'type', 'start', 'end', 'score', 'strand', 'phase',
           'attributes'], skiprows:int=1, header:Union[int,collections.abc
           .Sequence[int],NoneType,Literal['infer']]=None, sep:str='\t',
           **kwargs)

A function to read a GFF3 file. Expects 9 tab-separated fields, and will try to name the columns according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

	Type	Default	Details
loc	str \| pathlib.Path		input filepath
gff_columns	list	[‘seqid’, ‘source’, ‘type’, ‘start’, ‘end’, ‘score’, ‘strand’, ‘phase’, ‘attributes’]
skiprows	int	1	how many rows to skip in the beginning
header	Union	None
sep	str		separator for the table
kwargs	VAR_KEYWORD
Returns	DataFrame		the GFF3 file in DataFrame form

For instance, let’s try and read an example. This is a fake GFF3 file based on the Pycnogonum litorale Hox gene cluster; for the real one, please refer to our work on the sea spider.

gff = read_gff(os.environ["EXAMPLE_DATA_PATH"] + "plit.gff3")
gff

	seqid	source	type	start	end	score	strand	phase	attributes
0	pseudochrom_56	PacBio	gene	1927066	1936157	.	-	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...
1	pseudochrom_56	PacBio	mRNA	1927066	1936157	.	-	.	ID=PB.8615.1;Parent=PB.8615;function=Homeobox ...
2	pseudochrom_56	PacBio	exon	1927066	1928028	.	-	.	ID=PB.8615.1.exon1;Parent=PB.8615.1;function=H...
3	pseudochrom_56	PacBio	exon	1935229	1936157	.	-	.	ID=PB.8615.1.exon2;Parent=PB.8615.1;function=H...
4	pseudochrom_56	PacBio	CDS	1927066	1928028	.	-	1	ID=PB.8615.1.CDS1;Parent=PB.8615.1;function=Ho...
...	...	...	...	...	...	...	...	...	...
134	scaffold_44	AUGUSTUS	CDS	1998922	2000654	0.72	-	2	ID=g13061.t1.CDS1;Parent=g13061.t1;function=se...
135	scaffold_44	AUGUSTUS	CDS	2023821	2024148	0.53	-	0	ID=g13061.t1.CDS2;Parent=g13061.t1;function=se...
136	scaffold_44	GeneMark.hmm3	mRNA	2023807	2024148	.	-	.	ID=g13061.t2;Parent=g13061;function=sequence-s...
137	scaffold_44	GeneMark.hmm3	exon	2023807	2024148	.	-	0	ID=g13061.t2.exon1;Parent=g13061.t2;function=s...
138	scaffold_44	GeneMark.hmm3	CDS	2023807	2024148	.	-	0	ID=g13061.t2.CDS1;Parent=g13061.t2;function=se...

139 rows × 9 columns

To begin with, we would like to visualize the ‘real’ Hox genes. We are also only interested in plotting entire genes (no exon structure), so we can filter the GFF rows based on that. We should also extract the gene IDs from the table to help us filter. Furthermore, we should be extracting gene names, in case we want to plot them too.

source

gff_attribute_selector

 gff_attribute_selector (line, sep:str=';', select='ID')

A function that extracts the value of a specified field from the attributes of a GFF3 line. Should be used on the attributes field of the corresponding pandas DataFrame.

	Type	Default	Details
line			the attributes field of a GFF3 line
sep	str	;	the field separator. Should be a semicolon for a GFF3 file.
select	str	ID	the field ID. Should be one that is included in the GFF3 file. Refer to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for possibilities, or choose a manually defined tag that you know is present in the file.
Returns	str		the value of field `select`

gff["gene_id"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="ID")
)
gff["gene_name"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="gene")
)

hox_genes = [
    "PB.8615",
    "g9718",
    "PB.8616",
    "g9720",
    "g9721",
    "PB.8617",
    "g9723",
    "g9724",
    "g9725",
]
is_hox = gff["gene_id"].isin(hox_genes)
is_gene = gff["type"] == "gene"

slim = gff[is_gene & is_hox]
slim

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	PacBio	gene	1927066	1936157	.	-	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...	PB.8615	Hox1-A
6	pseudochrom_56	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g9718;function=sequence-specific DNA bindin...	g9718	Hox2-A
15	pseudochrom_56	PacBio	gene	2058396	2065953	.	-	.	ID=PB.8616;function=homeobox protein;gene=Hox3...	PB.8616	Hox3-A
56	pseudochrom_56	AUGUSTUS	gene	2195412	2206712	.	-	.	ID=g9720;function=sequence-specific DNA bindin...	g9720	Hox4-A
62	pseudochrom_56	AUGUSTUS	gene	2351936	2354374	.	-	.	ID=g9721;function=sequence-specific DNA bindin...	g9721	Hox5-A
68	pseudochrom_56	PacBio	gene	2373415	2375678	.	-	.	ID=PB.8617;function=sequence-specific DNA bind...	PB.8617	Hox6-A
79	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
85	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
91	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A

This table basically already contains all the information we need: * the name of the chromosome * the location of each gene * the strand of each gene * the directionality of each gene (given by relative start/end positions) * the name of each gene (hidden in the attributes)

The only thing that’s left is to extract this information in the way that is needed for the plotter:

hox = slim[["seqid", "gene_name", "gene_id", "start", "end"]].reset_index(drop=True)
hox

	seqid	gene_name	gene_id	start	end
0	pseudochrom_56	Hox1-A	PB.8615	1927066	1936157
1	pseudochrom_56	Hox2-A	g9718	1998922	2024148
2	pseudochrom_56	Hox3-A	PB.8616	2058396	2065953
3	pseudochrom_56	Hox4-A	g9720	2195412	2206712
4	pseudochrom_56	Hox5-A	g9721	2351936	2354374
5	pseudochrom_56	Hox6-A	PB.8617	2373415	2375678
6	pseudochrom_56	Hox7-A	g9723	2565196	2594468
7	pseudochrom_56	Hox8-A	g9724	2916314	2926445
8	pseudochrom_56	Hox10-A	g9725	2986021	2996225

source

filter

 filter (gff, filter_by_type=True, filter_type='gene',
         filter_by_field=True, field='gene_id', field_values=None)

source

decorate

 decorate (gff:pandas.core.frame.DataFrame, attributes:dict={'gene_id':
           'ID', 'gene_name': 'gene'})

A function that

	Type	Default	Details
gff	DataFrame		a GFF file in Pandas dataframe form
attributes	dict	{‘gene_id’: ‘ID’, ‘gene_name’: ‘gene’}

source

syntenic_block_borders

 syntenic_block_borders (gff:pandas.core.frame.DataFrame,
                         flank_length:int=None, start:str='start',
                         end:str='end')

A function to calculate the boundaries of a syntenic block. It automatically pads the boundaries by an additional 5% of total length on both ends.

	Type	Default	Details
gff	DataFrame		a GFF in Pandas dataframe form. Only includes the genes of the syntenic block in question.
flank_length	int	None	the amount of space to be granted on both sides of the syntenic region, in basepairs. If unspecified, it will be set to 5% of the syntenic block length.
start	str	start	the GFF column with the start position of the gene (“start”).
end	str	end	the GFF column with the end position of the gene (“end”).
Returns	(<class ‘int’>, <class ‘int’>)

test_fail(
    syntenic_block_borders,
    contains="The parameter `flank_length` has to be an integer",
    args=(gff,),
    kwargs=dict(flank_length=5.2),
)

The process of reading in a GFF3 file can be expedited with the

source

read_aln

 read_aln (m8:str, id_sep:str|None=None, **kwargs)

Reads

	Type	Default	Details
m8	str		the path to the MMseqs2 alignment table file
id_sep	str \| None	None
kwargs	VAR_KEYWORD
Returns	DataFrame		the tabulated form of the alignment results.

source

estimate_plot_size

 estimate_plot_size (gff, width_factor:int=3, height:int=2)

assert estimate_plot_size(gff[gff["type"] == "gene"]) == (45, 2)

source

insert_gap

 insert_gap (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
             identifier='gene_id', purge_columns=None, no_gaps=1)

This function inserts a number of dummy entries between two loci (lines) in the GFF DataFrame.

test_fail(
    insert_gap,
    contains="The two loci are not consecutive;",
    args=(hox,),
    kwargs={"locus1": "PB.8615", "locus2": "g9720", "identifier": "gene_id"},
)

hox = insert_gap(
    hox,
    locus1="PB.8615",
    locus2="g9718",
    identifier="gene_id",
    no_gaps=4,
    purge_columns=["gene_name"],
)

hox["strand"] = "-"

hox

	seqid	gene_name	gene_id	start	end	strand
0	pseudochrom_56	Hox1-A	PB.8615	1927066	1936157	-
1	pseudochrom_56		gap_PB.8615-0	1936158	1936159	-
2	pseudochrom_56		gap_PB.8615-1	1936160	1936161	-
3	pseudochrom_56		gap_PB.8615-2	1936162	1936163	-
4	pseudochrom_56		gap_PB.8615-3	1936164	1936165	-
5	pseudochrom_56	Hox2-A	g9718	1998922	2024148	-
6	pseudochrom_56	Hox3-A	PB.8616	2058396	2065953	-
7	pseudochrom_56	Hox4-A	g9720	2195412	2206712	-
8	pseudochrom_56	Hox5-A	g9721	2351936	2354374	-
9	pseudochrom_56	Hox6-A	PB.8617	2373415	2375678	-
10	pseudochrom_56	Hox7-A	g9723	2565196	2594468	-
11	pseudochrom_56	Hox8-A	g9724	2916314	2926445	-
12	pseudochrom_56	Hox10-A	g9725	2986021	2996225	-

hox = insert_gap(hox, locus2="PB.8615", no_gaps=1, purge_columns=["gene_name"])
hox = insert_gap(hox, locus1="g9725", no_gaps=1, purge_columns=["gene_name"])

source

flip

 flip (gff)

flip(hox)

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	PacBio	gene	-1927066	-1936157	.	+	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...	PB.8615	Hox1-A
1	pseudochrom_56	AUGUSTUS	gene	-1998922	-2024148	.	+	.	ID=g9718;function=sequence-specific DNA bindin...	g9718	Hox2-A
2	pseudochrom_56	PacBio	gene	-2058396	-2065953	.	+	.	ID=PB.8616;function=homeobox protein;gene=Hox3...	PB.8616	Hox3-A
3	pseudochrom_56	AUGUSTUS	gene	-2195412	-2206712	.	+	.	ID=g9720;function=sequence-specific DNA bindin...	g9720	Hox4-A
4	pseudochrom_56	AUGUSTUS	gene	-2351936	-2354374	.	+	.	ID=g9721;function=sequence-specific DNA bindin...	g9721	Hox5-A
5	pseudochrom_56	PacBio	gene	-2373415	-2375678	.	+	.	ID=PB.8617;function=sequence-specific DNA bind...	PB.8617	Hox6-A
6	pseudochrom_56	AUGUSTUS	gene	-2565196	-2594468	.	+	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
7	pseudochrom_56	AUGUSTUS	gene	-2916314	-2926445	.	+	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
8	pseudochrom_56	AUGUSTUS	gene	-2986021	-2996225	.	+	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A

source

insert_break

 insert_break (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
               identifier='gene_id')

This function inserts a molecule break between two loci (lines) in the GFF DataFrame.

keep = gff["gene_id"].isin(hox_genes)
hox = gff[keep].reset_index(drop=True)
decorate(hox)

on_scaff44 = gff["seqid"] == "scaffold_44"
is_gene = gff["type"] == "gene"
hoxc = gff[is_gene & on_scaff44].reset_index(drop=True)
decorate(hoxc)

interrupted = pd.concat((hox.loc[6:], hoxc)).reset_index(drop=True)

interrupted

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
1	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
2	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A
3	scaffold_44	PacBio	gene	1927066	1936157	.	-	.	ID=PB.1762;function=Homeobox domain;gene=Hox11...	PB.1762	Hox11
4	scaffold_44	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g13061;function=sequence-specific DNA bindi...	g13061	Hox12

interrupted = insert_break(interrupted, locus1="g9725", locus2="PB.1762")

interrupted

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
1	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
2	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A
3										break
4	scaffold_44	PacBio	gene	1927066	1936157	.	-	.	ID=PB.1762;function=Homeobox domain;gene=Hox11...	PB.1762	Hox11
5	scaffold_44	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g13061;function=sequence-specific DNA bindi...	g13061	Hox12