The information about the location of genes in a genome can come from anywhere, but it is typically stored in a GFF3 file. While this is far from a standardized format, there exist some guidelines about how a (syntactically) valid GFF3 file should look like; they can be found here. Such a typical, 9-column, tab-separated file is what I have mostly been working with. If your GFF3 looks different you will have to figure out how to read it in yourself - or open an issue and I can try to help you.
A function to read a GFF3 file. Expects 9 tab-separated fields, and will try to name the columns according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
For instance, let’s try and read an example. This is a fake GFF3 file based on the Pycnogonum litorale Hox gene cluster; for the real one, please refer to our work on the sea spider.
To begin with, we would like to visualize the ‘real’ Hox genes. We are also only interested in plotting entire genes (no exon structure), so we can filter the GFF rows based on that. We should also extract the gene IDs from the table to help us filter. Furthermore, we should be extracting gene names, in case we want to plot them too.
A function that extracts the value of a specified field from the attributes of a GFF3 line. Should be used on the attributes field of the corresponding pandas DataFrame.
Type
Default
Details
line
the attributes field of a GFF3 line
sep
str
;
the field separator. Should be a semicolon for a GFF3 file.
select
str
ID
the field ID. Should be one that is included in the GFF3 file. Refer to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for possibilities, or choose a manually defined tag that you know is present in the file.
This table basically already contains all the information we need: * the name of the chromosome * the location of each gene * the strand of each gene * the directionality of each gene (given by relative start/end positions) * the name of each gene (hidden in the attributes)
The only thing that’s left is to extract this information in the way that is needed for the plotter:
A function to calculate the boundaries of a syntenic block. It automatically pads the boundaries by an additional 5% of total length on both ends.
Type
Default
Details
gff
DataFrame
a GFF in Pandas dataframe form. Only includes the genes of the syntenic block in question.
flank_length
int
None
the amount of space to be granted on both sides of the syntenic region, in basepairs. If unspecified, it will be set to 5% of the syntenic block length.
start
str
start
the GFF column with the start position of the gene (“start”).
end
str
end
the GFF column with the end position of the gene (“end”).
Returns
(<class ‘int’>, <class ‘int’>)
test_fail( syntenic_block_borders, contains="The parameter `flank_length` has to be an integer", args=(gff,), kwargs=dict(flank_length=5.2),)
The process of reading in a GFF3 file can be expedited with the
This function inserts a number of dummy entries between two loci (lines) in the GFF DataFrame.
test_fail( insert_gap, contains="The two loci are not consecutive;", args=(hox,), kwargs={"locus1": "PB.8615", "locus2": "g9720", "identifier": "gene_id"},)