Skip to content

Commit 13d91f2

Browse files
committed
Add biopython_utilities image
1 parent 9c66716 commit 13d91f2

6 files changed

Lines changed: 180 additions & 0 deletions

File tree

biopython_utilities/Dockerfile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
ARG biopython_version
2+
3+
FROM pegi3s/biopython:${biopython_version}
4+
5+
ADD scripts /opt/scripts
6+
7+
RUN chmod 777 /opt/scripts/*
8+
9+
ENV PATH="/opt/scripts:${PATH}"

biopython_utilities/README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# This image belongs to a larger project called Bioinformatics Docker Images Project (http://pegi3s.github.io/dockerfiles)
2+
3+
# Biopython utilities
4+
5+
This Docker image contains Biopython-based scripts to perform different tasks.
6+
7+
# `plot_gene_distribution.py`
8+
9+
The `plot_gene_distribution.py` represents a list of genes in a *GenomeDiagram*. The input data must be a TSV file with four columns: (1) the group to wich the gene belongs to (each group is drawn in a different color), (2) the name of the gene, (3) the start coordinate, and (4) the end coordinate. For instance, the test data available [here](https://github.com/pegi3s/dockerfiles/tree/master/biopython_utilities/test_data/test_plot_gene_distribution.tsv) contains the following 9 genes:
10+
11+
```
12+
F-Box Fbox1 320276 321550
13+
F-Box Fbox2 363707 364915
14+
F-Box Fbox3 473425 472151
15+
F-Box Fbox4 805518 807710
16+
F-Box Fbox5 812394 813713
17+
F-Box Fbox6 1542754 1541522
18+
F-Box Fbox7 1551260 1550496
19+
F-Box Fbox8 3672240 3673466
20+
SRNase SRNase 1545618 1547318
21+
```
22+
23+
By default, the script:
24+
- Represents all genes in a single horizontal line. It is possible to set the number of axis breaks with the `--breaks`parameter (default is 20).
25+
- Determines the start and end positions by taking the minimum and maximum from all the genes. Nevertheless, it is possible to use the `--start` and `--end` parameters to define a custom interval.
26+
- Represents all genes above the horizontal line. To preserve the genes strand and draw genes with *start > end* bellow the line, use the `--preserve-strand` parameter.
27+
- Saves the figure in PDF. Use the `--format` to specify a different one.
28+
29+
Run the following command to show the script help: `docker run --rm pegi3s/biopython_utilities:1.78_0.1.0 plot_gene_distribution.py -h`
30+
31+
You should adapt and run the following command: `docker run --rm -v /your/data/dir:/data pegi3s/biopython_utilities:1.78_0.1.0 plot_gene_distribution.py /data/<input_TSV> -o /data/<output_image>`
32+
33+
In this command, you should replace:
34+
- `/your/data/dir` to point to the directory that contains the input TSV file you want to process.
35+
- `<input_TSV>` to the actual name of your input TSV file.
36+
- `<output_image>` to the actual name of your output file (without the extension, which is automatically added by the script).
37+
38+
# Test data
39+
40+
To test this utility, the input TSV file is available [here]([here](https://github.com/pegi3s/dockerfiles/tree/master/biopython_utilities/test_data/test_plot_gene_distribution.tsv)).
41+
42+
# Changelog
43+
44+
The `latest` tag contains always the most recent version.
45+
46+
## [0.1.0] - 16/02/2021
47+
- Initial `biopython_utilities` image containing the `plot_gene_distribution.py` utility.
48+
49+
# Building the image
50+
51+
To build this image, the version of the `pegi3s/biopython` image to use as base must be provided. When building from the command line, use `--build-arg biopython_version=1.78`.

biopython_utilities/hooks/build

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/bin/bash
2+
3+
docker build --build-arg biopython_version=$biopython_version -f $DOCKERFILE_PATH -t $IMAGE_NAME .
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
#!/usr/bin/python3
2+
3+
import math
4+
import sys
5+
import argparse
6+
7+
from random import seed
8+
from random import random
9+
10+
from reportlab.lib import colors
11+
from reportlab.lib.units import cm
12+
from Bio.Graphics import GenomeDiagram
13+
from Bio.SeqFeature import SeqFeature, FeatureLocation
14+
15+
colors_list = [colors.red, colors.blue, colors.yellow, colors.brown, colors.green, colors.darkmagenta, colors.greenyellow, colors.lavender, colors.purple, colors.navy, colors.orange, colors.paleturquoise, colors.powderblue, colors.rosybrown, colors.turquoise, colors.black]
16+
17+
parser = argparse.ArgumentParser(description='Create an image with the distribution of a list of genes. Lines in the input TSV file must have four columns: (1) the group to wich the gene belongs to (each group is drawn in a different color), (2) the name of the gene, (3) the start coordinate, and (4) the end coordinate.')
18+
19+
parser.add_argument('INPUT_FILE', help='input TSV file with gene names and coordinates')
20+
parser.add_argument('-o', '--output', default='image', help='output image name (without file extension)')
21+
parser.add_argument('-f', '--format', default='pdf', help='output image format: PS, PDF, SVG, JPG, BMP, GIF, PNG, TIFF or TIFF (default is PDF)')
22+
parser.add_argument('-s', '--start', help='start coordinate to draw (if not provided, it is automatically determined from the input data)')
23+
parser.add_argument('-e', '--end', help='end coordinate to draw (if not provided, it is automatically determined from the input data)')
24+
parser.add_argument('-p','--preserve-strand', help='use this flag to preserve gene strands', action='store_true')
25+
parser.add_argument('-b','--breaks', default='20', help='number of axis breaks to represent the scale')
26+
parser.add_argument('-r','--random-seed', default='1', help='random seed for color generation')
27+
28+
arg = parser.parse_args()
29+
30+
fh = open(arg.INPUT_FILE)
31+
lines = fh.readlines()
32+
33+
series_features = {}
34+
series_indexes = {}
35+
36+
series_index = 0
37+
for line in lines:
38+
if not ('#' in line):
39+
temp_line = line.upper().replace('\n', '')
40+
sp_line = temp_line.split(sep='\t')
41+
series = sp_line[0]
42+
gene_name = sp_line[1]
43+
44+
if int(sp_line[2]) > int(sp_line[3]):
45+
start = int(sp_line[3])
46+
end = sp_line [2]
47+
if arg.preserve_strand:
48+
strand = -1
49+
else:
50+
strand = 1
51+
else:
52+
start = int(sp_line[2])
53+
end = int(sp_line[3])
54+
strand = 1
55+
56+
if series in series_features:
57+
series_features[series].append((gene_name, strand, start, end))
58+
else:
59+
series_features[series] = [(gene_name, strand, start, end)]
60+
series_indexes[series] = series_index
61+
series_index = series_index + 1
62+
63+
start = sys.maxsize
64+
end = -1
65+
66+
gdd = GenomeDiagram.Diagram("diagram", tracklines = False, y = 0.4)
67+
gd_track_for_features = gdd.new_track(1, scale = True, height = 1, scale_smallticks = 0)
68+
gds_features = gd_track_for_features.new_set()
69+
70+
seed(int(arg.random_seed))
71+
72+
for series in series_features.keys():
73+
if series_indexes[series] < len(colors_list):
74+
current_color = colors_list[series_indexes[series]]
75+
else:
76+
current_color = colors.Color(random(), random(), random())
77+
78+
for i in range(0, len(series_features[series])):
79+
current_feature = series_features[series][i]
80+
feature = SeqFeature(FeatureLocation(int(current_feature[2]), int(current_feature[3])), strand=current_feature[1])
81+
gds_features.add_feature(feature, name="{}".format(current_feature[0]), label=True, color=current_color)
82+
83+
if int(current_feature[2]) < start:
84+
start = int(current_feature[2])
85+
if int(current_feature[3]) > end:
86+
end = int(current_feature[3])
87+
88+
89+
if not arg.start == None:
90+
start = int(arg.start)
91+
else:
92+
start = start - 1000
93+
94+
if not arg.end == None:
95+
end = int(arg.end)
96+
else:
97+
end = end + 1000
98+
99+
if start > end:
100+
sys.exit("start must be less than end")
101+
102+
gd_track_for_features.start = start
103+
gd_track_for_features.end = end
104+
gd_track_for_features.scale_largetick_interval = (end-start) / int(arg.breaks)
105+
106+
gdd.draw(format="linear", fragments=1, start=start, end=end)
107+
gdd.write(arg.output+'.'+arg.format, arg.format)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
F-Box Fbox1 320276 321550
2+
F-Box Fbox2 363707 364915
3+
F-Box Fbox3 473425 472151
4+
F-Box Fbox4 805518 807710
5+
F-Box Fbox5 812394 813713
6+
F-Box Fbox6 1542754 1541522
7+
F-Box Fbox7 1551260 1550496
8+
F-Box Fbox8 3672240 3673466
9+
SRNase SRNase 1545618 1547318

docs/index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,7 @@ <h5>Additional images:</h5>
297297
<ul style="font-size:20px">
298298
<li><a href="https://hub.docker.com/r/pegi3s/biopython/" target="_blank"><b>biopython</b></a>
299299
<a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html" target="_blank">[doc]</a> - Multipurpose Python tools for computational molecular biology</li>
300+
<li><a href="https://hub.docker.com/r/pegi3s/biopython_utilities/" target="_blank"><b>biopython_utilities</b></a> - Biopython utilities</li>
300301
<li><a href="https://hub.docker.com/r/pegi3s/formfind/" target="_blank"><b>formfind</b></a>
301302
<a href="https://github.com/VR51/formfind" target="_blank">[doc]</a> - HTML processing</li>
302303
<li><a href="https://hub.docker.com/r/pegi3s/r_project/" target="_blank"><b>r_project</b></a>

0 commit comments

Comments
 (0)