# util The information about the location of genes in a genome can come from anywhere, but it is typically stored in a GFF3 file. While this is far from a standardized format, there exist some *guidelines* about how a (syntactically) valid GFF3 file should look like; they can be found [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). Such a typical, 9-column, tab-separated file is what I have mostly been working with. If your GFF3 looks different you will have to figure out how to read it in yourself - or [open an issue](https://github.com/galicae/geneorder/issues/new) and I can try to help you. ------------------------------------------------------------------------ source ### read_gff > read_gff (loc:str|pathlib.Path, gff_columns:list=['seqid', 'source', > 'type', 'start', 'end', 'score', 'strand', 'phase', > 'attributes'], skiprows:int=1, header:Union[int,collections.abc > .Sequence[int],NoneType,Literal['infer']]=None, sep:str='\t', > **kwargs) *A function to read a GFF3 file. Expects 9 tab-separated fields, and will try to name the columns according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md*

	Type	Default	Details
loc	str \| pathlib.Path		input filepath
gff_columns	list	[‘seqid’, ‘source’, ‘type’, ‘start’, ‘end’, ‘score’, ‘strand’, ‘phase’, ‘attributes’]
skiprows	int	1	how many rows to skip in the beginning
header	Union	None
sep	str		separator for the table
kwargs	VAR_KEYWORD
Returns	DataFrame		the GFF3 file in DataFrame form

For instance, let’s try and read an example. This is a fake GFF3 file based on the *Pycnogonum litorale* Hox gene cluster; for the real one, please refer to our [work on the sea spider](https://doi.org/10.1101/2024.11.20.624475). ``` python gff = read_gff(os.environ["EXAMPLE_DATA_PATH"] + "plit.gff3") gff ```

	seqid	source	type	start	end	score	strand	phase	attributes
0	pseudochrom_56	PacBio	gene	1927066	1936157	.	-	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...
1	pseudochrom_56	PacBio	mRNA	1927066	1936157	.	-	.	ID=PB.8615.1;Parent=PB.8615;function=Homeobox ...
2	pseudochrom_56	PacBio	exon	1927066	1928028	.	-	.	ID=PB.8615.1.exon1;Parent=PB.8615.1;function=H...
3	pseudochrom_56	PacBio	exon	1935229	1936157	.	-	.	ID=PB.8615.1.exon2;Parent=PB.8615.1;function=H...
4	pseudochrom_56	PacBio	CDS	1927066	1928028	.	-	1	ID=PB.8615.1.CDS1;Parent=PB.8615.1;function=Ho...
...	...	...	...	...	...	...	...	...	...
134	scaffold_44	AUGUSTUS	CDS	1998922	2000654	0.72	-	2	ID=g13061.t1.CDS1;Parent=g13061.t1;function=se...
135	scaffold_44	AUGUSTUS	CDS	2023821	2024148	0.53	-	0	ID=g13061.t1.CDS2;Parent=g13061.t1;function=se...
136	scaffold_44	GeneMark.hmm3	mRNA	2023807	2024148	.	-	.	ID=g13061.t2;Parent=g13061;function=sequence-s...
137	scaffold_44	GeneMark.hmm3	exon	2023807	2024148	.	-	0	ID=g13061.t2.exon1;Parent=g13061.t2;function=s...
138	scaffold_44	GeneMark.hmm3	CDS	2023807	2024148	.	-	0	ID=g13061.t2.CDS1;Parent=g13061.t2;function=se...

139 rows × 9 columns

To begin with, we would like to visualize the ‘real’ Hox genes. We are also only interested in plotting entire genes (no exon structure), so we can filter the GFF rows based on that. We should also extract the gene IDs from the table to help us filter. Furthermore, we should be extracting gene names, in case we want to plot them too. ------------------------------------------------------------------------ source ### gff_attribute_selector > gff_attribute_selector (line, sep:str=';', select='ID') *A function that extracts the value of a specified field from the attributes of a GFF3 line. Should be used on the `attributes` field of the corresponding pandas DataFrame.*

	Type	Default	Details
line			the attributes field of a GFF3 line
sep	str	;	the field separator. Should be a semicolon for a GFF3 file.
select	str	ID	the field ID. Should be one that is included in the GFF3 file. Refer to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for possibilities, or choose a manually defined tag that you know is present in the file.
Returns	str		the value of field `select`

``` python gff["gene_id"] = gff["attributes"].apply( lambda x: gff_attribute_selector(x, select="ID") ) gff["gene_name"] = gff["attributes"].apply( lambda x: gff_attribute_selector(x, select="gene") ) ``` ``` python hox_genes = [ "PB.8615", "g9718", "PB.8616", "g9720", "g9721", "PB.8617", "g9723", "g9724", "g9725", ] is_hox = gff["gene_id"].isin(hox_genes) is_gene = gff["type"] == "gene" slim = gff[is_gene & is_hox] slim ```

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	PacBio	gene	1927066	1936157	.	-	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...	PB.8615	Hox1-A
6	pseudochrom_56	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g9718;function=sequence-specific DNA bindin...	g9718	Hox2-A
15	pseudochrom_56	PacBio	gene	2058396	2065953	.	-	.	ID=PB.8616;function=homeobox protein;gene=Hox3...	PB.8616	Hox3-A
56	pseudochrom_56	AUGUSTUS	gene	2195412	2206712	.	-	.	ID=g9720;function=sequence-specific DNA bindin...	g9720	Hox4-A
62	pseudochrom_56	AUGUSTUS	gene	2351936	2354374	.	-	.	ID=g9721;function=sequence-specific DNA bindin...	g9721	Hox5-A
68	pseudochrom_56	PacBio	gene	2373415	2375678	.	-	.	ID=PB.8617;function=sequence-specific DNA bind...	PB.8617	Hox6-A
79	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
85	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
91	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A

This table basically already contains all the information we need: \* the name of the chromosome \* the location of each gene \* the strand of each gene \* the directionality of each gene (given by relative start/end positions) \* the name of each gene (hidden in the attributes) The only thing that’s left is to extract this information in the way that is needed for the plotter: ``` python hox = slim[["seqid", "gene_name", "gene_id", "start", "end"]].reset_index(drop=True) hox ```

	seqid	gene_name	gene_id	start	end
0	pseudochrom_56	Hox1-A	PB.8615	1927066	1936157
1	pseudochrom_56	Hox2-A	g9718	1998922	2024148
2	pseudochrom_56	Hox3-A	PB.8616	2058396	2065953
3	pseudochrom_56	Hox4-A	g9720	2195412	2206712
4	pseudochrom_56	Hox5-A	g9721	2351936	2354374
5	pseudochrom_56	Hox6-A	PB.8617	2373415	2375678
6	pseudochrom_56	Hox7-A	g9723	2565196	2594468
7	pseudochrom_56	Hox8-A	g9724	2916314	2926445
8	pseudochrom_56	Hox10-A	g9725	2986021	2996225

------------------------------------------------------------------------ source ### filter > filter (gff, filter_by_type=True, filter_type='gene', > filter_by_field=True, field='gene_id', field_values=None) ------------------------------------------------------------------------ source ### decorate > decorate (gff:pandas.core.frame.DataFrame, attributes:dict={'gene_id': > 'ID', 'gene_name': 'gene'}) *A function that*

	Type	Default	Details
gff	DataFrame		a GFF file in Pandas dataframe form
attributes	dict	{‘gene_id’: ‘ID’, ‘gene_name’: ‘gene’}

------------------------------------------------------------------------ source ### syntenic_block_borders > syntenic_block_borders (gff:pandas.core.frame.DataFrame, > flank_length:int=None, start:str='start', > end:str='end') *A function to calculate the boundaries of a syntenic block. It automatically pads the boundaries by an additional 5% of total length on both ends.*

	Type	Default	Details
gff	DataFrame		a GFF in Pandas dataframe form. Only includes the genes of the syntenic block in question.
flank_length	int	None	the amount of space to be granted on both sides of the syntenic region, in basepairs. If unspecified, it will be set to 5% of the syntenic block length.
start	str	start	the GFF column with the start position of the gene (“start”).
end	str	end	the GFF column with the end position of the gene (“end”).
Returns	(<class ‘int’>, <class ‘int’>)

``` python test_fail( syntenic_block_borders, contains="The parameter `flank_length` has to be an integer", args=(gff,), kwargs=dict(flank_length=5.2), ) ``` The process of reading in a GFF3 file can be expedited with the ------------------------------------------------------------------------ source ### read_aln > read_aln (m8:str, id_sep:str|None=None, **kwargs) *Reads*

	Type	Default	Details
m8	str		the path to the MMseqs2 alignment table file
id_sep	str \| None	None
kwargs	VAR_KEYWORD
Returns	DataFrame		the tabulated form of the alignment results.

------------------------------------------------------------------------ source ### estimate_plot_size > estimate_plot_size (gff, width_factor:int=3, height:int=2) ``` python assert estimate_plot_size(gff[gff["type"] == "gene"]) == (45, 2) ``` ------------------------------------------------------------------------ source ### insert_gap > insert_gap (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None, > identifier='gene_id', purge_columns=None, no_gaps=1) *This function inserts a number of dummy entries between two loci (lines) in the GFF DataFrame.* ``` python test_fail( insert_gap, contains="The two loci are not consecutive;", args=(hox,), kwargs={"locus1": "PB.8615", "locus2": "g9720", "identifier": "gene_id"}, ) ``` ``` python hox = insert_gap( hox, locus1="PB.8615", locus2="g9718", identifier="gene_id", no_gaps=4, purge_columns=["gene_name"], ) ``` ``` python hox["strand"] = "-" ``` ``` python hox ```

	seqid	gene_name	gene_id	start	end	strand
0	pseudochrom_56	Hox1-A	PB.8615	1927066	1936157	-
1	pseudochrom_56		gap_PB.8615-0	1936158	1936159	-
2	pseudochrom_56		gap_PB.8615-1	1936160	1936161	-
3	pseudochrom_56		gap_PB.8615-2	1936162	1936163	-
4	pseudochrom_56		gap_PB.8615-3	1936164	1936165	-
5	pseudochrom_56	Hox2-A	g9718	1998922	2024148	-
6	pseudochrom_56	Hox3-A	PB.8616	2058396	2065953	-
7	pseudochrom_56	Hox4-A	g9720	2195412	2206712	-
8	pseudochrom_56	Hox5-A	g9721	2351936	2354374	-
9	pseudochrom_56	Hox6-A	PB.8617	2373415	2375678	-
10	pseudochrom_56	Hox7-A	g9723	2565196	2594468	-
11	pseudochrom_56	Hox8-A	g9724	2916314	2926445	-
12	pseudochrom_56	Hox10-A	g9725	2986021	2996225	-

``` python hox = insert_gap(hox, locus2="PB.8615", no_gaps=1, purge_columns=["gene_name"]) hox = insert_gap(hox, locus1="g9725", no_gaps=1, purge_columns=["gene_name"]) ``` ------------------------------------------------------------------------ source ### flip > flip (gff) ``` python flip(hox) ```

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	PacBio	gene	-1927066	-1936157	.	+	.	ID=PB.8615;function=Homeobox domain;gene=Hox1-...	PB.8615	Hox1-A
1	pseudochrom_56	AUGUSTUS	gene	-1998922	-2024148	.	+	.	ID=g9718;function=sequence-specific DNA bindin...	g9718	Hox2-A
2	pseudochrom_56	PacBio	gene	-2058396	-2065953	.	+	.	ID=PB.8616;function=homeobox protein;gene=Hox3...	PB.8616	Hox3-A
3	pseudochrom_56	AUGUSTUS	gene	-2195412	-2206712	.	+	.	ID=g9720;function=sequence-specific DNA bindin...	g9720	Hox4-A
4	pseudochrom_56	AUGUSTUS	gene	-2351936	-2354374	.	+	.	ID=g9721;function=sequence-specific DNA bindin...	g9721	Hox5-A
5	pseudochrom_56	PacBio	gene	-2373415	-2375678	.	+	.	ID=PB.8617;function=sequence-specific DNA bind...	PB.8617	Hox6-A
6	pseudochrom_56	AUGUSTUS	gene	-2565196	-2594468	.	+	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
7	pseudochrom_56	AUGUSTUS	gene	-2916314	-2926445	.	+	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
8	pseudochrom_56	AUGUSTUS	gene	-2986021	-2996225	.	+	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A

------------------------------------------------------------------------ source ### insert_break > insert_break (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None, > identifier='gene_id') *This function inserts a molecule break between two loci (lines) in the GFF DataFrame.* ``` python keep = gff["gene_id"].isin(hox_genes) hox = gff[keep].reset_index(drop=True) decorate(hox) on_scaff44 = gff["seqid"] == "scaffold_44" is_gene = gff["type"] == "gene" hoxc = gff[is_gene & on_scaff44].reset_index(drop=True) decorate(hoxc) ``` ``` python interrupted = pd.concat((hox.loc[6:], hoxc)).reset_index(drop=True) ``` ``` python interrupted ```

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
1	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
2	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A
3	scaffold_44	PacBio	gene	1927066	1936157	.	-	.	ID=PB.1762;function=Homeobox domain;gene=Hox11...	PB.1762	Hox11
4	scaffold_44	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g13061;function=sequence-specific DNA bindi...	g13061	Hox12

``` python interrupted = insert_break(interrupted, locus1="g9725", locus2="PB.1762") ``` ``` python interrupted ```

	seqid	source	type	start	end	score	strand	phase	attributes	gene_id	gene_name
0	pseudochrom_56	AUGUSTUS	gene	2565196	2594468	.	-	.	ID=g9723;function=sequence-specific DNA bindin...	g9723	Hox7-A
1	pseudochrom_56	AUGUSTUS	gene	2916314	2926445	.	-	.	ID=g9724;function=sequence-specific DNA bindin...	g9724	Hox8-A
2	pseudochrom_56	AUGUSTUS	gene	2986021	2996225	.	-	.	ID=g9725;function=sequence-specific DNA bindin...	g9725	Hox10-A
3										break
4	scaffold_44	PacBio	gene	1927066	1936157	.	-	.	ID=PB.1762;function=Homeobox domain;gene=Hox11...	PB.1762	Hox11
5	scaffold_44	AUGUSTUS	gene	1998922	2024148	.	-	.	ID=g13061;function=sequence-specific DNA bindi...	g13061	Hox12