# util


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The information about the location of genes in a genome can come from
anywhere, but it is typically stored in a GFF3 file. While this is far
from a standardized format, there exist some *guidelines* about how a
(syntactically) valid GFF3 file should look like; they can be found
[here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md).
Such a typical, 9-column, tab-separated file is what I have mostly been
working with. If your GFF3 looks different you will have to figure out
how to read it in yourself - or [open an
issue](https://github.com/galicae/geneorder/issues/new) and I can try to
help you.

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L19"
target="_blank" style="float:right; font-size:smaller">source</a>

### read_gff

>  read_gff (loc:str|pathlib.Path, gff_columns:list=['seqid', 'source',
>                'type', 'start', 'end', 'score', 'strand', 'phase',
>                'attributes'], skiprows:int=1, header:Union[int,collections.abc
>                .Sequence[int],NoneType,Literal['infer']]=None, sep:str='\t',
>                **kwargs)

*A function to read a GFF3 file. Expects 9 tab-separated fields, and
will try to name the columns according to
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md*

<table>
<colgroup>
<col style="width: 6%" />
<col style="width: 25%" />
<col style="width: 34%" />
<col style="width: 34%" />
</colgroup>
<thead>
<tr>
<th></th>
<th><strong>Type</strong></th>
<th><strong>Default</strong></th>
<th><strong>Details</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>loc</td>
<td>str | pathlib.Path</td>
<td></td>
<td>input filepath</td>
</tr>
<tr>
<td>gff_columns</td>
<td>list</td>
<td>[‘seqid’, ‘source’, ‘type’, ‘start’, ‘end’, ‘score’, ‘strand’,
‘phase’, ‘attributes’]</td>
<td></td>
</tr>
<tr>
<td>skiprows</td>
<td>int</td>
<td>1</td>
<td>how many rows to skip in the beginning</td>
</tr>
<tr>
<td>header</td>
<td>Union</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>sep</td>
<td>str</td>
<td></td>
<td>separator for the table</td>
</tr>
<tr>
<td>kwargs</td>
<td>VAR_KEYWORD</td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Returns</strong></td>
<td><strong>DataFrame</strong></td>
<td></td>
<td><strong>the GFF3 file in DataFrame form</strong></td>
</tr>
</tbody>
</table>

For instance, let’s try and read an example. This is a fake GFF3 file
based on the *Pycnogonum litorale* Hox gene cluster; for the real one,
please refer to our [work on the sea
spider](https://doi.org/10.1101/2024.11.20.624475).

``` python
gff = read_gff(os.environ["EXAMPLE_DATA_PATH"] + "plit.gff3")
gff
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">type</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">score</th>
<th data-quarto-table-cell-role="th">strand</th>
<th data-quarto-table-cell-role="th">phase</th>
<th data-quarto-table-cell-role="th">attributes</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>1927066</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8615;function=Homeobox domain;gene=Hox1-...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>mRNA</td>
<td>1927066</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8615.1;Parent=PB.8615;function=Homeobox ...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>exon</td>
<td>1927066</td>
<td>1928028</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8615.1.exon1;Parent=PB.8615.1;function=H...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>exon</td>
<td>1935229</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8615.1.exon2;Parent=PB.8615.1;function=H...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>CDS</td>
<td>1927066</td>
<td>1928028</td>
<td>.</td>
<td>-</td>
<td>1</td>
<td>ID=PB.8615.1.CDS1;Parent=PB.8615.1;function=Ho...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">134</td>
<td>scaffold_44</td>
<td>AUGUSTUS</td>
<td>CDS</td>
<td>1998922</td>
<td>2000654</td>
<td>0.72</td>
<td>-</td>
<td>2</td>
<td>ID=g13061.t1.CDS1;Parent=g13061.t1;function=se...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">135</td>
<td>scaffold_44</td>
<td>AUGUSTUS</td>
<td>CDS</td>
<td>2023821</td>
<td>2024148</td>
<td>0.53</td>
<td>-</td>
<td>0</td>
<td>ID=g13061.t1.CDS2;Parent=g13061.t1;function=se...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">136</td>
<td>scaffold_44</td>
<td>GeneMark.hmm3</td>
<td>mRNA</td>
<td>2023807</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g13061.t2;Parent=g13061;function=sequence-s...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">137</td>
<td>scaffold_44</td>
<td>GeneMark.hmm3</td>
<td>exon</td>
<td>2023807</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>0</td>
<td>ID=g13061.t2.exon1;Parent=g13061.t2;function=s...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">138</td>
<td>scaffold_44</td>
<td>GeneMark.hmm3</td>
<td>CDS</td>
<td>2023807</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>0</td>
<td>ID=g13061.t2.CDS1;Parent=g13061.t2;function=se...</td>
</tr>
</tbody>
</table>

<p>139 rows × 9 columns</p>
</div>

To begin with, we would like to visualize the ‘real’ Hox genes. We are
also only interested in plotting entire genes (no exon structure), so we
can filter the GFF rows based on that. We should also extract the gene
IDs from the table to help us filter. Furthermore, we should be
extracting gene names, in case we want to plot them too.

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L45"
target="_blank" style="float:right; font-size:smaller">source</a>

### gff_attribute_selector

>  gff_attribute_selector (line, sep:str=';', select='ID')

*A function that extracts the value of a specified field from the
attributes of a GFF3 line. Should be used on the `attributes` field of
the corresponding pandas DataFrame.*

<table>
<colgroup>
<col style="width: 6%" />
<col style="width: 25%" />
<col style="width: 34%" />
<col style="width: 34%" />
</colgroup>
<thead>
<tr>
<th></th>
<th><strong>Type</strong></th>
<th><strong>Default</strong></th>
<th><strong>Details</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>line</td>
<td></td>
<td></td>
<td>the attributes field of a GFF3 line</td>
</tr>
<tr>
<td>sep</td>
<td>str</td>
<td>;</td>
<td>the field separator. Should be a semicolon for a GFF3 file.</td>
</tr>
<tr>
<td>select</td>
<td>str</td>
<td>ID</td>
<td>the field ID. Should be one that is included in the GFF3 file. Refer
to
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
for possibilities, or choose a manually defined tag that you know is
present in the file.</td>
</tr>
<tr>
<td><strong>Returns</strong></td>
<td><strong>str</strong></td>
<td></td>
<td><strong>the value of field <code>select</code></strong></td>
</tr>
</tbody>
</table>

``` python
gff["gene_id"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="ID")
)
gff["gene_name"] = gff["attributes"].apply(
    lambda x: gff_attribute_selector(x, select="gene")
)
```

``` python
hox_genes = [
    "PB.8615",
    "g9718",
    "PB.8616",
    "g9720",
    "g9721",
    "PB.8617",
    "g9723",
    "g9724",
    "g9725",
]
is_hox = gff["gene_id"].isin(hox_genes)
is_gene = gff["type"] == "gene"

slim = gff[is_gene & is_hox]
slim
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">type</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">score</th>
<th data-quarto-table-cell-role="th">strand</th>
<th data-quarto-table-cell-role="th">phase</th>
<th data-quarto-table-cell-role="th">attributes</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">gene_name</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>1927066</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8615;function=Homeobox domain;gene=Hox1-...</td>
<td>PB.8615</td>
<td>Hox1-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">6</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>1998922</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9718;function=sequence-specific DNA bindin...</td>
<td>g9718</td>
<td>Hox2-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">15</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>2058396</td>
<td>2065953</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8616;function=homeobox protein;gene=Hox3...</td>
<td>PB.8616</td>
<td>Hox3-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">56</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2195412</td>
<td>2206712</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9720;function=sequence-specific DNA bindin...</td>
<td>g9720</td>
<td>Hox4-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">62</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2351936</td>
<td>2354374</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9721;function=sequence-specific DNA bindin...</td>
<td>g9721</td>
<td>Hox5-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">68</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>2373415</td>
<td>2375678</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.8617;function=sequence-specific DNA bind...</td>
<td>PB.8617</td>
<td>Hox6-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">79</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2565196</td>
<td>2594468</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9723;function=sequence-specific DNA bindin...</td>
<td>g9723</td>
<td>Hox7-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">85</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2916314</td>
<td>2926445</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9724;function=sequence-specific DNA bindin...</td>
<td>g9724</td>
<td>Hox8-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">91</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2986021</td>
<td>2996225</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9725;function=sequence-specific DNA bindin...</td>
<td>g9725</td>
<td>Hox10-A</td>
</tr>
</tbody>
</table>

</div>

This table basically already contains all the information we need: \*
the name of the chromosome \* the location of each gene \* the strand of
each gene \* the directionality of each gene (given by relative
start/end positions) \* the name of each gene (hidden in the attributes)

The only thing that’s left is to extract this information in the way
that is needed for the plotter:

``` python
hox = slim[["seqid", "gene_name", "gene_id", "start", "end"]].reset_index(drop=True)
hox
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">gene_name</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>Hox1-A</td>
<td>PB.8615</td>
<td>1927066</td>
<td>1936157</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td>Hox2-A</td>
<td>g9718</td>
<td>1998922</td>
<td>2024148</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td>Hox3-A</td>
<td>PB.8616</td>
<td>2058396</td>
<td>2065953</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>pseudochrom_56</td>
<td>Hox4-A</td>
<td>g9720</td>
<td>2195412</td>
<td>2206712</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>pseudochrom_56</td>
<td>Hox5-A</td>
<td>g9721</td>
<td>2351936</td>
<td>2354374</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">5</td>
<td>pseudochrom_56</td>
<td>Hox6-A</td>
<td>PB.8617</td>
<td>2373415</td>
<td>2375678</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">6</td>
<td>pseudochrom_56</td>
<td>Hox7-A</td>
<td>g9723</td>
<td>2565196</td>
<td>2594468</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">7</td>
<td>pseudochrom_56</td>
<td>Hox8-A</td>
<td>g9724</td>
<td>2916314</td>
<td>2926445</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">8</td>
<td>pseudochrom_56</td>
<td>Hox10-A</td>
<td>g9725</td>
<td>2986021</td>
<td>2996225</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L75"
target="_blank" style="float:right; font-size:smaller">source</a>

### filter

>  filter (gff, filter_by_type=True, filter_type='gene',
>              filter_by_field=True, field='gene_id', field_values=None)

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L58"
target="_blank" style="float:right; font-size:smaller">source</a>

### decorate

>  decorate (gff:pandas.core.frame.DataFrame, attributes:dict={'gene_id':
>                'ID', 'gene_name': 'gene'})

*A function that*

<table>
<thead>
<tr>
<th></th>
<th><strong>Type</strong></th>
<th><strong>Default</strong></th>
<th><strong>Details</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>gff</td>
<td>DataFrame</td>
<td></td>
<td>a GFF file in Pandas dataframe form</td>
</tr>
<tr>
<td>attributes</td>
<td>dict</td>
<td>{‘gene_id’: ‘ID’, ‘gene_name’: ‘gene’}</td>
<td></td>
</tr>
</tbody>
</table>

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L106"
target="_blank" style="float:right; font-size:smaller">source</a>

### syntenic_block_borders

>  syntenic_block_borders (gff:pandas.core.frame.DataFrame,
>                              flank_length:int=None, start:str='start',
>                              end:str='end')

*A function to calculate the boundaries of a syntenic block. It
automatically pads the boundaries by an additional 5% of total length on
both ends.*

<table>
<colgroup>
<col style="width: 6%" />
<col style="width: 25%" />
<col style="width: 34%" />
<col style="width: 34%" />
</colgroup>
<thead>
<tr>
<th></th>
<th><strong>Type</strong></th>
<th><strong>Default</strong></th>
<th><strong>Details</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>gff</td>
<td>DataFrame</td>
<td></td>
<td>a GFF in Pandas dataframe form. Only includes the genes of the
syntenic block in question.</td>
</tr>
<tr>
<td>flank_length</td>
<td>int</td>
<td>None</td>
<td>the amount of space to be granted on both sides of the syntenic
region, in basepairs. If unspecified, it will be set to 5% of the
syntenic block length.</td>
</tr>
<tr>
<td>start</td>
<td>str</td>
<td>start</td>
<td>the GFF column with the start position of the gene (“start”).</td>
</tr>
<tr>
<td>end</td>
<td>str</td>
<td>end</td>
<td>the GFF column with the end position of the gene (“end”).</td>
</tr>
<tr>
<td><strong>Returns</strong></td>
<td><strong>(&lt;class ‘int’&gt;, &lt;class ‘int’&gt;)</strong></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

``` python
test_fail(
    syntenic_block_borders,
    contains="The parameter `flank_length` has to be an integer",
    args=(gff,),
    kwargs=dict(flank_length=5.2),
)
```

The process of reading in a GFF3 file can be expedited with the

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L125"
target="_blank" style="float:right; font-size:smaller">source</a>

### read_aln

>  read_aln (m8:str, id_sep:str|None=None, **kwargs)

*Reads*

<table>
<colgroup>
<col style="width: 6%" />
<col style="width: 25%" />
<col style="width: 34%" />
<col style="width: 34%" />
</colgroup>
<thead>
<tr>
<th></th>
<th><strong>Type</strong></th>
<th><strong>Default</strong></th>
<th><strong>Details</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>m8</td>
<td>str</td>
<td></td>
<td>the path to the MMseqs2 alignment table file</td>
</tr>
<tr>
<td>id_sep</td>
<td>str | None</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>kwargs</td>
<td>VAR_KEYWORD</td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Returns</strong></td>
<td><strong>DataFrame</strong></td>
<td></td>
<td><strong>the tabulated form of the alignment results.</strong></td>
</tr>
</tbody>
</table>

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L155"
target="_blank" style="float:right; font-size:smaller">source</a>

### estimate_plot_size

>  estimate_plot_size (gff, width_factor:int=3, height:int=2)

``` python
assert estimate_plot_size(gff[gff["type"] == "gene"]) == (45, 2)
```

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L164"
target="_blank" style="float:right; font-size:smaller">source</a>

### insert_gap

>  insert_gap (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
>                  identifier='gene_id', purge_columns=None, no_gaps=1)

*This function inserts a number of dummy entries between two loci
(lines) in the GFF DataFrame.*

``` python
test_fail(
    insert_gap,
    contains="The two loci are not consecutive;",
    args=(hox,),
    kwargs={"locus1": "PB.8615", "locus2": "g9720", "identifier": "gene_id"},
)
```

``` python
hox = insert_gap(
    hox,
    locus1="PB.8615",
    locus2="g9718",
    identifier="gene_id",
    no_gaps=4,
    purge_columns=["gene_name"],
)
```

``` python
hox["strand"] = "-"
```

``` python
hox
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">gene_name</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">strand</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>Hox1-A</td>
<td>PB.8615</td>
<td>1927066</td>
<td>1936157</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td></td>
<td>gap_PB.8615-0</td>
<td>1936158</td>
<td>1936159</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td></td>
<td>gap_PB.8615-1</td>
<td>1936160</td>
<td>1936161</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>pseudochrom_56</td>
<td></td>
<td>gap_PB.8615-2</td>
<td>1936162</td>
<td>1936163</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>pseudochrom_56</td>
<td></td>
<td>gap_PB.8615-3</td>
<td>1936164</td>
<td>1936165</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">5</td>
<td>pseudochrom_56</td>
<td>Hox2-A</td>
<td>g9718</td>
<td>1998922</td>
<td>2024148</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">6</td>
<td>pseudochrom_56</td>
<td>Hox3-A</td>
<td>PB.8616</td>
<td>2058396</td>
<td>2065953</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">7</td>
<td>pseudochrom_56</td>
<td>Hox4-A</td>
<td>g9720</td>
<td>2195412</td>
<td>2206712</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">8</td>
<td>pseudochrom_56</td>
<td>Hox5-A</td>
<td>g9721</td>
<td>2351936</td>
<td>2354374</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">9</td>
<td>pseudochrom_56</td>
<td>Hox6-A</td>
<td>PB.8617</td>
<td>2373415</td>
<td>2375678</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">10</td>
<td>pseudochrom_56</td>
<td>Hox7-A</td>
<td>g9723</td>
<td>2565196</td>
<td>2594468</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">11</td>
<td>pseudochrom_56</td>
<td>Hox8-A</td>
<td>g9724</td>
<td>2916314</td>
<td>2926445</td>
<td>-</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">12</td>
<td>pseudochrom_56</td>
<td>Hox10-A</td>
<td>g9725</td>
<td>2986021</td>
<td>2996225</td>
<td>-</td>
</tr>
</tbody>
</table>

</div>

``` python
hox = insert_gap(hox, locus2="PB.8615", no_gaps=1, purge_columns=["gene_name"])
hox = insert_gap(hox, locus1="g9725", no_gaps=1, purge_columns=["gene_name"])
```

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L221"
target="_blank" style="float:right; font-size:smaller">source</a>

### flip

>  flip (gff)

``` python
flip(hox)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">type</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">score</th>
<th data-quarto-table-cell-role="th">strand</th>
<th data-quarto-table-cell-role="th">phase</th>
<th data-quarto-table-cell-role="th">attributes</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">gene_name</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>-1927066</td>
<td>-1936157</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=PB.8615;function=Homeobox domain;gene=Hox1-...</td>
<td>PB.8615</td>
<td>Hox1-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-1998922</td>
<td>-2024148</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9718;function=sequence-specific DNA bindin...</td>
<td>g9718</td>
<td>Hox2-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>-2058396</td>
<td>-2065953</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=PB.8616;function=homeobox protein;gene=Hox3...</td>
<td>PB.8616</td>
<td>Hox3-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-2195412</td>
<td>-2206712</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9720;function=sequence-specific DNA bindin...</td>
<td>g9720</td>
<td>Hox4-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-2351936</td>
<td>-2354374</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9721;function=sequence-specific DNA bindin...</td>
<td>g9721</td>
<td>Hox5-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">5</td>
<td>pseudochrom_56</td>
<td>PacBio</td>
<td>gene</td>
<td>-2373415</td>
<td>-2375678</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=PB.8617;function=sequence-specific DNA bind...</td>
<td>PB.8617</td>
<td>Hox6-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">6</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-2565196</td>
<td>-2594468</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9723;function=sequence-specific DNA bindin...</td>
<td>g9723</td>
<td>Hox7-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">7</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-2916314</td>
<td>-2926445</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9724;function=sequence-specific DNA bindin...</td>
<td>g9724</td>
<td>Hox8-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">8</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>-2986021</td>
<td>-2996225</td>
<td>.</td>
<td>+</td>
<td>.</td>
<td>ID=g9725;function=sequence-specific DNA bindin...</td>
<td>g9725</td>
<td>Hox10-A</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a
href="https://github.com/galicae/geneorder/blob/main/geneorder/util.py#L229"
target="_blank" style="float:right; font-size:smaller">source</a>

### insert_break

>  insert_break (gff:pandas.core.frame.DataFrame, locus1=None, locus2=None,
>                    identifier='gene_id')

*This function inserts a molecule break between two loci (lines) in the
GFF DataFrame.*

``` python
keep = gff["gene_id"].isin(hox_genes)
hox = gff[keep].reset_index(drop=True)
decorate(hox)

on_scaff44 = gff["seqid"] == "scaffold_44"
is_gene = gff["type"] == "gene"
hoxc = gff[is_gene & on_scaff44].reset_index(drop=True)
decorate(hoxc)
```

``` python
interrupted = pd.concat((hox.loc[6:], hoxc)).reset_index(drop=True)
```

``` python
interrupted
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">type</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">score</th>
<th data-quarto-table-cell-role="th">strand</th>
<th data-quarto-table-cell-role="th">phase</th>
<th data-quarto-table-cell-role="th">attributes</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">gene_name</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2565196</td>
<td>2594468</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9723;function=sequence-specific DNA bindin...</td>
<td>g9723</td>
<td>Hox7-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2916314</td>
<td>2926445</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9724;function=sequence-specific DNA bindin...</td>
<td>g9724</td>
<td>Hox8-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2986021</td>
<td>2996225</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9725;function=sequence-specific DNA bindin...</td>
<td>g9725</td>
<td>Hox10-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>scaffold_44</td>
<td>PacBio</td>
<td>gene</td>
<td>1927066</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.1762;function=Homeobox domain;gene=Hox11...</td>
<td>PB.1762</td>
<td>Hox11</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>scaffold_44</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>1998922</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g13061;function=sequence-specific DNA bindi...</td>
<td>g13061</td>
<td>Hox12</td>
</tr>
</tbody>
</table>

</div>

``` python
interrupted = insert_break(interrupted, locus1="g9725", locus2="PB.1762")
```

``` python
interrupted
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">seqid</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">type</th>
<th data-quarto-table-cell-role="th">start</th>
<th data-quarto-table-cell-role="th">end</th>
<th data-quarto-table-cell-role="th">score</th>
<th data-quarto-table-cell-role="th">strand</th>
<th data-quarto-table-cell-role="th">phase</th>
<th data-quarto-table-cell-role="th">attributes</th>
<th data-quarto-table-cell-role="th">gene_id</th>
<th data-quarto-table-cell-role="th">gene_name</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2565196</td>
<td>2594468</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9723;function=sequence-specific DNA bindin...</td>
<td>g9723</td>
<td>Hox7-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2916314</td>
<td>2926445</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9724;function=sequence-specific DNA bindin...</td>
<td>g9724</td>
<td>Hox8-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>pseudochrom_56</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>2986021</td>
<td>2996225</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g9725;function=sequence-specific DNA bindin...</td>
<td>g9725</td>
<td>Hox10-A</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>break</td>
<td></td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>scaffold_44</td>
<td>PacBio</td>
<td>gene</td>
<td>1927066</td>
<td>1936157</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=PB.1762;function=Homeobox domain;gene=Hox11...</td>
<td>PB.1762</td>
<td>Hox11</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">5</td>
<td>scaffold_44</td>
<td>AUGUSTUS</td>
<td>gene</td>
<td>1998922</td>
<td>2024148</td>
<td>.</td>
<td>-</td>
<td>.</td>
<td>ID=g13061;function=sequence-specific DNA bindi...</td>
<td>g13061</td>
<td>Hox12</td>
</tr>
</tbody>
</table>

</div>
