alejandrogzi / bed2gtf Goto Github PK

View Code? Open in Web Editor NEW

12.0 1.0 0.0 148 KB

high-performance BED-to-GTF converter written in Rust

License: MIT License

Rust 98.93% Dockerfile 1.07%

bed gene-annotation genes gtf bioinformatics genome-annotation

bed2gtf's Introduction

bed2gtf

A high-performance bed-to-gtf converter written in Rust.

translates

chr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,

into

chr27 bed2gtf gene 17266470 17285418 . + . gene_id "ENSG00000151743";

chr27 bed2gtf transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";

chr27 bed2gtf exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

...

in a few seconds.

Converts

Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 3.25 seconds.
Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 1.99 seconds.
Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.20 seconds.
Gallus galus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.36 seconds.

What's new on v.1.9.2

Adds --no-gene flag to only perform conversion without isoforms!

Modifies -i to be required unless --no-gene mode is present.

Refactors BedRecord.

Isolates CLI as owned mod.

Usage

Usage: bed2gtf[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT>
 
Arguments:
    -b, --bed <BED>: a .bed file
    -i, --isoforms <ISOFORMS>: a tab-delimited file
    -o, --output <OUTPUT>: path to output file
    -g, --gz[=<FLAG>]          Compress output file [default: false] [possible values: true, false]
    -n, --no-gene[=<FLAG>]     Flag to disable gene_id feature [default: false] [possible values: true, false]

Options:
    --help: print help
    --version: print version
    --threads/-t: number of threads (default: max ncpus)
    --gz: compress output .gtf

Warning

All the transcripts in .bed file should appear in the isoforms file.

crate: https://crates.io/crates/bed2gtf

click for detailed formats

bed2gtf just needs two files:

a .bed file

tab-delimited files with 3 required and 9 optional fields:

chrom   chromStart  chromEnd      name    ...
  |         |           |           |
chr20   50222035    50222038    ENST00000595977    ...

see BED format for more information

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):
```
> cat isoforms.txt

ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977
```
you can build a custom file for your preferred species using Ensembl BioMart.

Installation

to install bed2gtf on your system follow this steps:

get rust: curl https://sh.rustup.rs -sSf | sh on unix, or go here for other options
run cargo install bed2gtf (make sure ~/.cargo/bin is in your $PATH before running it)
use bed2gtf with the required arguments
enjoy!

Build

to build bed2gtf from this repo, do:

get rust (as described above)
run git clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf
run cargo run --release -- -b <BED> -i <ISOFORMS> -o <OUTPUT>

Container image

to build the development container image:

run git clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf
initialize docker with start docker or systemctl start docker
build the image docker image build --tag bed2gtf .
run docker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gtf -b /dir/<BED> -i /dir/<ISOFORMS> -o /dir/<OUTPUT>

Conda

to use bed2gtf through Conda just:

conda install bed2gtf -c bioconda or conda create -n bed2gtf -c bioconda gtfsort

Output

bed2gtf will send the output directly to the same .bed file path if you specify so

bed2gtf annotation.bed isoforms.txt output.gtf

.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gtf

where output.gtf is the result.

FAQ

Why?

UCSC offers a fast way to convert BED into GTF files through KentUtils or specific binaries (1) + several other bioinformaticians have shared scripts trying to replicate a similar solution (2,3,4).

A GTF file is a 9-column tab-delimited file that holds gene annotation data for a specific assembly (5). The 9th column defines the attributes of each entry. This field is important, as some post-processing tools that handle GTF files need them to extract gene information (e.g. STAR, arriba, etc). An incomplete GTF attribute field would probably lead to annotation-related errors in these software.

Of the available tools/scripts mentioned above, none produce a fully functional attribute GTF file conversion. (1) uses a two-step approach (bedToGenePred | genePredToGtf) written in C, which is extremely fast. Since a .bed file does not preserve any gene-related information, this approach fails to a) include correct gene_id attributes (duplicated transcript_ids) if no refTable is included b) append 3rd column gene features.

This is an example:

chr27 stdin transcript 17266470 17281218 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8";

chr27 stdin exon 17266470 17266572 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

On the other hand, available scripts (2,3,4) fall into bad-formatted outputs unable to be used as input to other tools. Some of them show a very customed format, far from a complete GTF file (2):

chr20 ---- peak 50222035 50222038 . + . peak_id "chr20_50222035_50222038";

chr20 ---- peak 50188548 50189130 . + . peak_id "chr20_50188548_50189130";

and others (4) just provide exon-related information:

chr20 ensembl exon 50222035 50222038 . + . gene_id "ENST00000595977.1735"; transcript_id "ENST00000595977.1735"; exon_number "0

chr20 ensembl exon 50188548 50188930 . + . gene_id "ENST00000595977.3403"; transcript_id "ENST00000595977.3403"; exon_number "0

This is where bed2gtf comes in: a fast and memory efficient BED-to-GTF converter written in Rust. In ~4 seconds this tool produces a fully functional GTF converted file with all the needed features needed for post-processing tools.

How?

bed2gtf is basically the reimplementation of C binaries merged in 1 step. This tool evaluates the position of k exons in j transcript, calculates start/stop/codon/UTR positions preserving reading frames and adjust the index + 1 (to be compatible with GTF convention). The isoforms file works as the refTable in C binaries to map each transcript to their respective gene; however, bed2gtf takes advantage of this and adds an additional "gene" line (to be compatible with other tools).

References

bed2gtf's People

Contributors

Stargazers

Watchers

bed2gtf's Issues

Output is not sorted

Even since gtfsort exists, user may want to receive an output nicely sorted. The computation cost is not that significant (~0.2 seconds).

How to do it: Implement an early ordering layer before converting lines into BedRecords.

Verbose does not include time units

In any given bed2gtf run, the logging info lacks time units:

2023-10-05T17:20:03.860Z INFO  [bed2gtf] Memory usage: 10.5582695 MB
2023-10-05T17:20:03.860Z INFO  [bed2gtf] Elapsed: 9.9935

should be:

2023-10-05T17:20:03.860Z INFO  [bed2gtf] Memory usage: 10.5582695 MB
2023-10-05T17:20:03.860Z INFO  [bed2gtf] Elapsed: 9.9935 secs

How to fix it: modify the formatted String in the logging function.

TO DO: compressed input files

To be implemented in the next version.

*Also include version badge in README

Assignation of gene coordinates

Hello Alejandro

Thank you for developing this tool to convert the TOGA bed output to GTF. I need a GTF file to perform several downstream analyses, so this tool is exactly what I need! I appreciate your effort for creating this.
I annotated a mammal genome using TOGA and used your tool to get the GTF file, but I noticed the gene coordinates are weird. The exon and CDS coordinates are correct, but I observed that the gene feature coordinates are not assigned correctly see my example below:

The original bed file:

mirAng1_1 12119959 12159135 ENST00000381995.28 1000 - 12119959 12159135 0,0,200 7 136,158,163,146,203,133,85, 0,2829,3372,7380,11124,17472,39091,
The GTF converted file is:

mirAng1_1 bed2gtf gene 9847 12159135 . - . gene_id "ENSG00000159128"; mirAng1_1 bed2gtf transcript 12119960 12159135 . - . gene_id "ENSG00000159128"; mirAng1_1 bed2gtf exon 12159051 12159135 . - . gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.28.1"; mirAng1_1 bed2gtf CDS 12159051 12159135 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.28.1"; mirAng1_1 bed2gtf start_codon 12159133 12159135 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.28.1"; mirAng1_1 bed2gtf exon 12137432 12137564 . - . gene_id "ENSG00000159128"; exon_number "2"; transcript_id "ENST00000381995.28.2"; mirAng1_1 bed2gtf CDS 12137432 12137564 . - 2 gene_id "ENSG00000159128"; exon_number "2"; transcript_id "ENST00000381995.28.2"; mirAng1_1 bed2gtf exon 12131084 12131286 . - . gene_id "ENSG00000159128"; exon_number "3"; transcript_id "ENST00000381995.28.3"; mirAng1_1 bed2gtf CDS 12131084 12131286 . - 1 gene_id "ENSG00000159128"; exon_number "3"; transcript_id "ENST00000381995.28.3"; mirAng1_1 bed2gtf exon 12127340 12127485 . - . gene_id "ENSG00000159128"; exon_number "4"; transcript_id "ENST00000381995.28.4"; mirAng1_1 bed2gtf CDS 12127340 12127485 . - 2 gene_id "ENSG00000159128"; exon_number "4"; transcript_id "ENST00000381995.28.4"; mirAng1_1 bed2gtf exon 12123332 12123494 . - . gene_id "ENSG00000159128"; exon_number "5"; transcript_id "ENST00000381995.28.5"; mirAng1_1 bed2gtf CDS 12123332 12123494 . - 0 gene_id "ENSG00000159128"; exon_number "5"; transcript_id "ENST00000381995.28.5"; mirAng1_1 bed2gtf exon 12122789 12122946 . - . gene_id "ENSG00000159128"; exon_number "6"; transcript_id "ENST00000381995.28.6"; mirAng1_1 bed2gtf CDS 12122789 12122946 . - 2 gene_id "ENSG00000159128"; exon_number "6"; transcript_id "ENST00000381995.28.6"; mirAng1_1 bed2gtf exon 12119960 12120095 . - . gene_id "ENSG00000159128"; exon_number "7"; transcript_id "ENST00000381995.28.7"; mirAng1_1 bed2gtf CDS 12119960 12120095 . - 0 gene_id "ENSG00000159128"; exon_number "7"; transcript_id "ENST00000381995.28.7"; mirAng1_1 bed2gtf transcript 12119960 12159135 . - . gene_id "ENSG00000159128"; mirAng1_1 bed2gtf exon 12159051 12159135 . - . gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.28.1"; mirAng1_1 bed2gtf CDS 12159051 12159135 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.28.1"; mirAng1_1 bed2gtf start_codon 12159133 12159135 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.28.1"; mirAng1_1 bed2gtf exon 12137432 12137564 . - . gene_id "ENSG00000159128"; exon_number "2"; transcript_id "ENST00000290219.28.2"; mirAng1_1 bed2gtf CDS 12137432 12137564 . - 2 gene_id "ENSG00000159128"; exon_number "2"; transcript_id "ENST00000290219.28.2"; mirAng1_1 bed2gtf exon 12131084 12131286 . - . gene_id "ENSG00000159128"; exon_number "3"; transcript_id "ENST00000290219.28.3"; mirAng1_1 bed2gtf CDS 12131084 12131286 . - 1 gene_id "ENSG00000159128"; exon_number "3"; transcript_id "ENST00000290219.28.3"; mirAng1_1 bed2gtf exon 12127340 12127485 . - . gene_id "ENSG00000159128"; exon_number "4"; transcript_id "ENST00000290219.28.4"; mirAng1_1 bed2gtf CDS 12127340 12127485 . - 2 gene_id "ENSG00000159128"; exon_number "4"; transcript_id "ENST00000290219.28.4"; mirAng1_1 bed2gtf exon 12123332 12123494 . - . gene_id "ENSG00000159128"; exon_number "5"; transcript_id "ENST00000290219.28.5"; mirAng1_1 bed2gtf CDS 12123332 12123494 . - 0 gene_id "ENSG00000159128"; exon_number "5"; transcript_id "ENST00000290219.28.5"; mirAng1_1 bed2gtf exon 12122789 12122946 . - . gene_id "ENSG00000159128"; exon_number "6"; transcript_id "ENST00000290219.28.6"; mirAng1_1 bed2gtf CDS 12122789 12122946 . - 2 gene_id "ENSG00000159128"; exon_number "6"; transcript_id "ENST00000290219.28.6"; mirAng1_1 bed2gtf exon 12119960 12120095 . - . gene_id "ENSG00000159128"; exon_number "7"; transcript_id "ENST00000290219.28.7"; mirAng1_1 bed2gtf CDS 12119960 12120095 . - 0 gene_id "ENSG00000159128"; exon_number "7"; transcript_id "ENST00000290219.28.7"; mirAng1_1 bed2gtf transcript 12137432 12137564 . - . gene_id "ENSG00000159128"; mirAng1_1 bed2gtf exon 12137432 12137564 . - . gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.30683.1"; mirAng1_1 bed2gtf CDS 12137432 12137564 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.30683.1"; mirAng1_1 bed2gtf start_codon 12137562 12137564 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000290219.30683.1"; mirAng1_1 bed2gtf transcript 12137432 12137564 . - . gene_id "ENSG00000159128"; mirAng1_1 bed2gtf exon 12137432 12137564 . - . gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.30683.1"; mirAng1_1 bed2gtf CDS 12137432 12137564 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.30683.1"; mirAng1_1 bed2gtf start_codon 12137562 12137564 . - 0 gene_id "ENSG00000159128"; exon_number "1"; transcript_id "ENST00000381995.30683.1";

As you can see the gene feature coordinates goes from 9847-12159135, have you seen this error before?

Also, I noticed that not all the transcripts have a gene feature, could you please explain to me how the gene feature is assigned? I wonder if this is an issue with the pipeline or my evidence files.

I followed your suggestion to create the isoform file in TOGA issue #91.

isoform file not available

I don´t have an isoform.txt file, can I still convert bed to gtf?

why exon_number 0 existed/generated in gtf ?

Hi, this is a very useful tool, but i met some error here
I'm running in centos7 zsh:
bed2gtf --bed test.bed12 --isoforms test.isoform --output test.gtf
you could see many "exon_number 0" have been generated in output gtf
could you explanin why this happened?

-> % cat ./test.bed12
chrX 119786505 119791643 ENST00000361575.3 0 - 119786683 119791576 0 3 227,104,70, 0,3402,5068,
chrX 119786505 119791595 ENST00000468844.1 0 - 119791595 119791595 0 2 227,1688, 0,3402,
chrX 119786503 119787456 ENST00000477403.1 0 - 119787456 119787456 0 2 229,176, 0,777,

-> % cat test.isoform
ENSG00000198918.7 ENST00000477403.1
ENSG00000198918.7 ENST00000468844.1
ENSG00000198918.7 ENST00000361575.3

-> % cat test.gtf
#provider: bed2gtf
#version: 1.8.0
#contact: github.com/alejandrogzi/bed2gtf
#date: 2024-1-3
chrX bed2gtf gene 119786504 119791643 . - . gene_id "ENSG00000198918.7";
chrX bed2gtf transcript 119786504 119787456 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000477403.1";
chrX bed2gtf exon 119786504 119786732 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000477403.1"; exon_number "0"; exon_id "ENST00000477403.1.2";
chrX bed2gtf transcript 119786506 119791643 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3";
chrX bed2gtf exon 119786506 119786732 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "0"; exon_id "ENST00000361575.3.3";
chrX bed2gtf three_prime_utr 119786506 119786683 . - 1 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3";
chrX bed2gtf transcript 119786506 119791595 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000468844.1";
chrX bed2gtf exon 119786506 119786732 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000468844.1"; exon_number "0"; exon_id "ENST00000468844.1.2";
chrX bed2gtf stop_codon 119786684 119786686 . - 0 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "0"; exon_id "ENST00000361575.3.3";
chrX bed2gtf CDS 119786687 119786732 . - 1 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "0"; exon_id "ENST00000361575.3.3";
chrX bed2gtf exon 119787281 119787456 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000477403.1"; exon_number "1"; exon_id "ENST00000477403.1.1";
chrX bed2gtf exon 119789908 119790011 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "1"; exon_id "ENST00000361575.3.2";
chrX bed2gtf CDS 119789908 119790011 . - 0 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "1"; exon_id "ENST00000361575.3.2";
chrX bed2gtf exon 119789908 119791595 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000468844.1"; exon_number "1"; exon_id "ENST00000468844.1.1";
chrX bed2gtf exon 119791574 119791643 . - . gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "2"; exon_id "ENST00000361575.3.1";
chrX bed2gtf CDS 119791574 119791576 . - 0 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "2"; exon_id "ENST00000361575.3.1";
chrX bed2gtf start_codon 119791574 119791576 . - 0 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3"; exon_number "2"; exon_id "ENST00000361575.3.1";
chrX bed2gtf five_prime_utr 119791577 119791643 . - 0 gene_id "ENSG00000198918.7"; transcript_id "ENST00000361575.3";