Coder Social home page Coder Social logo

tseemann / berokka Goto Github PK

View Code? Open in Web Editor NEW
25.0 4.0 3.0 3.6 MB

🍊 πŸ’« Trim, circularise and orient long read bacterial genome assemblies

License: GNU General Public License v3.0

Perl 96.86% Makefile 3.14%
circular-genome bioinformatics genomics genome-assembly long-read-sequencing

berokka's Introduction

Fizzy orange tablet Build Status License: GPL v3 Don't judge me

berokka

Trim, circularise, orient & filter long read bacterial genome assemblies

Introduction

There is already a good piece of software to trim/circularise and orient genome assemblies called Circlator. Please try that first!

You should only try Berokka if:

  1. You only have the contig files and do not have the corrected reads anymore
  2. Your contigs are simple cases with clear overhang and could be done manually with BLAST
  3. Circlator fails on your data even after troubleshooting

NOTE: orientation to dnaA or rep genes is not yet implemented.

Installation

Homebrew

Using Homebrew will install all the dependencies for you: Linux or MacOS

brew install brewsci/bio/berokka

Conda

Using Bioconda) will take care of everything:

conda install -c conda-forge -c bioconda -c defaults berokka

Source

git clone https://github.com/tseemann/berokka.git
./berokka/bin/berokka -h

You will need to install all the dependencies manually:

  • BioPerl >=Β 1.6 (for Bio::SeqIO and Bio::SearchIO)
  • BLAST+ >= 2.3.0 (for blastn)

Usage

Input

Input should be completed long-read assemblies in FASTA format, such as those from CANU or HGAP.

Usage

% berokka --outdir trimdir canu.contigs.fasta
<snip>
Did you know? berokka is a play on the concept of overhang vs hangover

% ls trimdir/
01.input.fa
02.trimmed.fa
03.results.tab

% cat trimdir/03.results.tab

#sequence       status  old_len new_len trimmed
tig00000000     trimmed 5461026 5448790 12236
tig00000002     trimmed 138825  113601  25224
tig00000003     trimmed 57075   43297   13778
tig00000004     kept    24900   24900   0
tig00000006     trimmed 1620    1320    300
tig00000007     removed 2380    0       0

Output

Filename Format Description
01.input.fa FASTA All the input sequences
02.trimmed.fa FASTA The (possibly) trimmed sequences
03.results.tab TSV Summary of results

The 02.trimmed.fa output has been augmented with new header data (unless --noanno used):

  • circular=true - inform that this is a circular sequence (Rebaler uses this)
  • overhang=N - informs that N bp were trimmed off
  • len=N - the new contig length if it was present (Canu adds this)
  • suggestCircular=yes if the no version was present (Canu adds this)
  • class=replicon if the class=contig was present and we circularised

Options

  • --filter <FASTA> allows you to remove contigs which match 50% of sequences in this file. Berokka comes with the standard Pacbio control sequence. You can provide your own FASTA file using this option. If you want to disable filtering, use --filter 0.

  • --readlen LENGTH can be used for datasets that won't seem to circularise. It affects the length of the match it attempts to make using BLAST.

  • --noanno will ensure that the FASTA descriptions are not altered between the input and output FASTA files.

  • --keepfiles and --debug are primarily for use by the developer.

Etymology

Berocca is a brand of effervescent drink and vitamin tablets containing vitamin B and C. It is a popular cure for a hangover. A key role of the berokka tool is to remove the "overhang" that occurs at the ends of long-read assemblies of circular genomes.

Feedback

Please file questions, bugs or ideas to the Issue Tracker

License

GPLv3

Citation

Not published yet.

Authors

  • Torsten Seemann

berokka's People

Contributors

tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

berokka's Issues

Error: NCBI C++ Exception:

hello,
I used conda to install berokka, but run into this problem:

Error: NCBI C++ Exception:
T0 "/opt/conda/conda-bld/blast_1537407096784/work/c++/src/objtools/readers/fasta.cpp", line 2178: Error: ncbi::objects::CFastaReader::PostWarning() - CFastaReader: Near line 7, there's a line that doesn't look like plausible data, but it's not marked as defline or comment. (m_Pos = 7

Thanks!!!!

Berokka conda installer is broken?

Just reporting the conda installer for berokka has a Perl issue.

conda create -n berokka_env berokka
conda activate berokka_env
berokka
Can't locate Bio/SeqIO.pm in @INC (you may need to install the Bio::SeqIO module) 

Orient and rotate mitochondrial genome

I have an animal mitochondrial genome assembled by Unicycler (so no overlap). I'd like to orient and rotate the genome to agree with its closest relative in Genbank. I have a FASTA file of the related genome. Is this task possible with Berokka?

Small contigs cause reverse blast hit results and trim fail

Usually the beginning hits the end, which we handle:

Running: blastn -query 4.head.fa -subject 4.fa -out 4.bls -evalue 1E-6 -dust no
blastn: 1..13766/20000 aligns to 43298..57075/57075
tig00000003 keep 1..43297/57075 (remove 13779 bp)

On these smaller ones, the end hits the beginning, which we DO NOT HANDLE.

*** [7] tig00000006 ***
Using first 900 bp to BLAST
Writing tig00000006 ( 900 bp ) to 7.fa
Writing tig00000006 ( 900 bp ) to 7.head.fa
Running: blastn -query 7.head.fa -subject 7.fa -out 7.bls -evalue 1E-6 -dust no
blastn: 781..900/900 aligns to 1..120/900
tig00000006 - COULD NOT TRIM

This should return the opposite

$ ! berokka --doesnotexist
Unknown option: doesnotexist
SYNOPSIS
  Filter, trim, circularise & orient long read assemblies
USAGE
  berokka [options] canu.contigs.fasta [another.fasta ...]
OPTIONS
  --help          This help.
  --debug         Debug info (default '0').
  --version       Print version and exit.
  --check         Check dependencies and exit.
  --test          Run a small test and exit.
  --force         Force overwite of existing (default '0').
  --outdir [X]    Output folder (default '').
  --readlen [N]   Approximate read length (default '30000').
  --keepfiles     Keep intermediate files (default '0').
  --noanno        Don't annotate FASTA with circular=true (default '0').
AUTHOR
  Torsten Seemann | https://github.com/tseemann/berokka
The command "! berokka --doesnotexist" exited with 1.

MSG: trunc start,end -- there was no end for 1

*** [5] tig00003742 ***
Using first 3959 bp to BLAST
Writing tig00003742 ( 3959 bp ) to 5.fa
Writing tig00003742 ( 3959 bp ) to 5.head.fa
Running: blastn -query 5.head.fa -subject 5.fa -out 5.bls -evalue 1E-6 -dust no
blastn: 2..3959/3959 aligns to 2..3959/3959 at 98.5 %id
tig00003742 keep 1..0/3959 (remove 3960 bp)

------------- EXCEPTION -------------
MSG: trunc start,end -- there was no end for 1
STACK Bio::PrimarySeqI::trunc /home/linuxbrew/.linuxbrew/Cellar/perl/5.26.1_1/lib/perl5/site_perl/5.26.1/Bio/PrimarySeqI.pm:447
STACK main::check_overhang /home/tseemann/git/berokka/bin/berokka:149
STACK toplevel /home/tseemann/git/berokka/bin/berokka:74
-------------------------------------

Error on undefined value

Can't call method "start" on an undefined value at /home/tseemann/git/berokka/bin/berokka line 134, line 123.

berokka conda install has an issue with the Bio::SeqIO perl module

My conda install produces the following error:

berokka Can't locate Bio/SeqIO.pm in @INC (you may need to install the Bio::SeqIO module) (@INC contains: /home/kvandelannoo/miniconda3/envs/berokka_env/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/kvandelannoo/miniconda3/envs/berokka_env/lib/site_perl/5.26.2 /home/kvandelannoo/miniconda3/envs/berokka_env/lib/5.26.2/x86_64-linux-thread-multi /home/kvandelannoo/miniconda3/envs/berokka_env/lib/5.26.2 .) at /home/kvandelannoo/miniconda3/envs/berokka_env/bin/berokka line 4. BEGIN failed--compilation aborted at /home/kvandelannoo/miniconda3/envs/berokka_env/bin/berokka line 4.

I installed berokka using:
conda install -c conda-forge -c bioconda -c defaults berokka

I tried the following things without success:
1/ updating conda
2/ creating a separate conda env

My install looks OK to me:

which berokka
~/miniconda3/envs/berokka_env/bin/berokka

which perl
~/miniconda3/envs/berokka_env/bin/perl

echo $PATH | tr ":" "\n" | nl
 1  /home/kvandelannoo/miniconda3/envs/berokka_env/bin
     2  /home/kvandelannoo/miniconda3/condabin
     3  /usr/local/showq/0.15/bin
     4  /usr/local/slurm/latest/bin
     5  /usr/lib64/qt-3.3/bin
     6  /usr/local/bin
     7  /usr/bin
     8  /usr/local/sbin
     9  /usr/sbin
    10  /opt/ibutils/bin
    11  /opt/puppetlabs/bin
    12  /opt/dell/srvadmin/bin
    13  /home/kvandelannoo/.local/bin
    14  /home/kvandelannoo/bin

 perl -e "print qq(@INC)"
/home/kvandelannoo/miniconda3/envs/berokka_env/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/kvandelannoo/miniconda3/envs/berokka_env/lib/site_perl/5.26.2 /home/kvandelannoo/miniconda3/envs/berokka_env/lib/5.26.2/x86_64-linux-thread-multi /home/kvandelannoo/miniconda3/envs/berokka_env/lib/5.26.2

Any help with this would be much appreciated.

KV

Output unchanged despite clear overlaps

Hello,

Thank you for this tool. I tried to run it with a bacterial genome and it returned the same input sequence despite having clear overlaps at the beginning and end. It did not output any error. Do you know what could be the problem?

Thanks

Removing files messages should be --debug only

Removing temporary files: 1.fa 1.head.fa 1.bls
Removing temporary files: 2.fa 2.head.fa 2.bls
Removing temporary files: 3.fa 3.head.fa 3.bls
Removing temporary files: 4.fa 4.head.fa 4.bls

Doesn't quite align to end and circ fails

tig00000001     dna     4736005

 Score = 55308 bits (29950),  Expect = 0.0
 Identities = 29995/30013 (99%), Gaps = 17/30013 (0%)
 Strand=Plus/Plus

Query  1        CGCTGTCGGCAAGAATATAGCGGCTTGATGCCAAAG-CGCCT-GGTCATTTCGACAAAAA  58
                |||||||||||||||||||||||||||||||||||| ||||| |||||||||||||||||
Sbjct  4704147  CGCTGTCGGCAAGAATATAGCGGCTTGATGCCAAAGGCGCCTGGGTCATTTCGACAAAAA  4704206

<snip>

Query  29988    ACGGTTTTTCAGT  30000
                |||||||||||||
Sbjct  4734143  ACGGTTTTTCAGT  4734155

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.