Coder Social home page Coder Social logo

csiro-crop-informatics / biokanga Goto Github PK

View Code? Open in Web Editor NEW
23.0 4.0 4.0 25.31 MB

An integrated high performance bioinformatics toolkit

License: Other

C++ 57.78% Makefile 0.10% Objective-C 0.53% C 41.47% Python 0.09% Shell 0.01% M4 0.01% Assembly 0.02% Dockerfile 0.01%

biokanga's Introduction

Latest GitHub tag Travis Build Status Build status Docker pulls Docker pulls Docker pulls

BioKanga

BioKanga is an integrated toolkit of high performance bioinformatics subprocesses targeting the challenges of next generation sequencing analytics. Kanga is an acronym standing for 'K-mer Adaptive Next Generation Aligner'.

Why YAL (Yet Another Aligner)

Compared with other widely used aligners, BioKanga provides substantial gains in both the proportion and quality of aligned sequence reads at competitive or increased computational efficiency. Unlike most other aligners, BioKanga utilises Hamming distances between putative alignments to the targeted genome assembly for any given read as the discrimative acceptance criteria rather than relying on sequencer generated quality scores.

Another primary differentiator for BioKanga is that this toolkit can process billions of reads against targeted genomes containing 100 million contigs and totalling up to 100Gbp of sequence.

Toolset Components

The BioKanga toolset contains a number of subprocesses, each of which is targeting a specific bioinformatics analytics task. Primary subprocesses provide functionality for:

  • Generate simulated NGS datasets
  • Quality check the raw NGS reads to identify potential processing issues
  • Filter NGS reads for sequencer errors and/or exact duplicates
  • de Novo assemble filtered reads into contigs
  • Scaffold de Novo assembled contigs
  • Blitz local alignments
  • Generate index over genome assembly or sequences
  • NGS reads alignment-less K-mer derived marker sequences generation
  • NGS reads alignment-less prefix K-mer derived marker sequences generation
  • Concatenate sequences to create pseudo-genome assembly
  • Align NGS reads to indexed genome assembly or sequences
  • Scaffold assembly contigs using PE read alignments
  • Identify SSRs in multifasta sequences
  • Map aligned reads loci to known features
  • RNA-seq differential expression analyser with optional Pearsons generation
  • Generate tab delimited counts file for input to DESeq or EdgeR
  • Extract fasta sequences from multifasta file
  • Merge PE short insert overlap reads
  • SNP alignment derived marker sequences identification
  • Remap alignment loci
  • Locate and report regions of interest
  • Generate marker sequences from SNP loci
  • Generate SQLite Marker Database from SNP markers
  • Generate SQLite SNP Database from aligner identified SNPs
  • Generate SQLite DE Database from RNA-seq DE
  • Generate SQLite Blat alignment PSL database

Build and installation

Linux

To build on linux, clone this repository, run autoreconf, configure and make. The following example will install the biokanga toolkit to a bin directory underneath the user's home directory.

git clone https://github.com/csiro-crop-informatics/biokanga.git
cd biokanga
autoreconf -f -i
./configure --prefix=$HOME
make install

Alternatively, the binary built for the appropriate platform can be used directly.

Windows

To build on Windows, the current version requires Visual Studio 2015 or 2017 with build tools v140.

  1. Open the biokanga.sln file in Visual Studio.
  2. Under the Build menu, select Configuration Manager.
  3. For Active solution platform, select x64.
  4. The project can then be built. By default, executables will be copied into the Win64 directory.

Alternatively, the windows binaries can be used directly.

Documentation

Documentation for the core functionality of biokanga and pacbiokanga is available under the Docs directory.

Contributing

BioKanga is maintained by the Crop Bioinformatics and Data Science team at CSIRO in Canberra, Australia.

Contributions are most welcome. To contribute, follow these steps.

  1. Fork biokanga into your own repository (more information)
  2. Clone and enter the repository to your development machine
  3. Checkout the dev branch
  4. Make and checkout a new branch for your work (git checkout -b great-new-feature)
  5. Make regular commits on your new branch
  6. Push your branch back to your github repository (git push origin great-new-feature)
  7. Create a pull request to the dev branch of the csiro-crop-informatics/biokanga repository (more information)
  8. If you're work is related to an existing issue, refer to the issue in the pull request comment

Issues

Please report issues on the github project.

Authors

BioKanga has been developed by Dr Stuart Stephen, with contributions from other team member in CSIRO.

biokanga's People

Contributors

alexwhan avatar rsuchecki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

biokanga's Issues

genGenomeFromAGP help needed

I have a draft assembly in scaffolds (61)
With the help of genetics markers I have ordered these scaffolds.
The results generated by the script I have used is a agp file.

I want to use genGenomeFromAGP to construct the draft genome in linkage groups (chromosomes).

But I have a doubt about the input files needed as described in genGenomeFromAGP --help

Is this command line correct ? :

genGenomeFromAGP -f 4 -F mylogfile.log -m 1 -i scaffolds.fasta -I ordered_scaffolds.agp -o chomosome_assembly.fasta

Thanks

Version Tags

Any chance you can add version tags back in or did you lose/remove revesion history with the migration to GitHub?

Here's a todo list for the sake of record keeping:

TODO List

  • Add in git tags for prior versions
  • Add github releases for prior versions
  • Update EasyBuild EasyConfigs for BioKanga releases, including URL and more recent versions

Segfault with pemode 3

Version: 4.4.0+

[Apr 11 16:01:37.367 2019](biokanga) Paired end association and partner alignment processing started..
[Apr 11 16:01:37.381 2019](biokanga) Generating paired reads index over 246 paired reads
[Apr 11 16:01:37.384 2019](biokanga) Starting to associate Paired End reads to be within insert size range ...
[Apr 11 16:01:37.387 2019](biokanga) Processed putative 0 pairs, accepted 0Segmentation fault (core dumped)

Steps to reproduce:

curl ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.chromosome.10.fa.gz \
| gunzip --stdout > O_sativa_IRGSP-1.0_Chr10.fasta

biokanga index \
  --threads 2 \
  -i O_sativa_IRGSP-1.0_Chr10.fasta \
  -o O_sativa_IRGSP-1.0_Chr10.fasta.sfx \
  --ref O_sativa_IRGSP-1.0_Chr10.fasta

biokanga align   \
  --sfx O_sativa_IRGSP-1.0_Chr10.fasta.sfx   \
  --mode 0   \
  --format 5   \
  --pemode 3  \
  --in O_sativa_IRGSP-1.0_Chr10_ArtIllumina_reads.1.fq.gz   \
  --pair O_sativa_IRGSP-1.0_Chr10_ArtIllumina_reads.2.fq.gz  \
  --out out.bam   \
  --threads 2  \
  --substitutions 5

Reads:
O_sativa_IRGSP-1.0_Chr10_ArtIllumina_reads.1.fq.gz
O_sativa_IRGSP-1.0_Chr10_ArtIllumina_reads.2.fq.gz

Zenodo DOI author list

Zenodo DOI author list needs fixing, it was generated automatically and does not reflect reality.

Bug: MREVERSE flag not set

I don't think BioKanga always correctly sets the MREVERSE SAM flag (0x20). It may only affect read pairs from small inserts such that the start/end coordinates are the same for both reads in a pair.

I think there is an issue when READ1 has REVERSE (0x10) set but then READ2 does not get MREVERSE (0x20) set. An example is this:

D00615:71:CC1WCANXX:2:2301:2669:90742   131     chr6D_part1     100001827       255     80M     =       100001827       80
D00615:71:CC1WCANXX:2:2301:2669:90742   83      chr6D_part1     100001827       255     80M     =       100001827       80

I think the relevant code resides here:

switch(ReadIs) {
and involves cSAMFlgAS and cSAMFlgMateAS

Bug: Not generating BAM/CSI when ref sequences are long

When a reference sequence is longer than the (2^29)-1 bp limit imposed by the BAI index, BioKanga attempts to generate the newer CSI index. However, BioKanga hangs and doesn't generate a BAM/CSI files with content. This has happened with both the latest version (4.3.4) and a past version (3.4.5) I've used.

Basically, things hang and the last bit of the log is something like:

[Apr  5 06:17:31.322 2017](biokanga) Read nonalignment reason summary:
[Apr  5 06:17:31.323 2017](biokanga)    0 (NA) Not processed for alignment
[Apr  5 06:17:31.324 2017](biokanga)    65026874 (AA) Alignment accepted
[Apr  5 06:17:31.324 2017](biokanga)    5663 (EN) Excessive indeterminate (Ns) bases
[Apr  5 06:17:31.325 2017](biokanga)    12105612 (NL) No potential alignment loci
[Apr  5 06:17:31.325 2017](biokanga)    0 (MH) Mismatch delta (minimum Hamming) criteria not met
[Apr  5 06:17:31.326 2017](biokanga)    26812906 (ML) Aligned to multiloci
[Apr  5 06:17:31.326 2017](biokanga)    0 (ET) Excessively end trimmed
[Apr  5 06:17:31.327 2017](biokanga)    0 (OJ) Aligned as orphaned splice junction
[Apr  5 06:17:31.327 2017](biokanga)    0 (OM) Aligned as orphaned microInDel
[Apr  5 06:17:31.328 2017](biokanga)    0 (DP) Duplicate PCR
[Apr  5 06:17:31.328 2017](biokanga)    0 (DS) Duplicate read sequence
[Apr  5 06:17:31.328 2017](biokanga)    0 (FC) Aligned to filtered target sequence
[Apr  5 06:17:31.329 2017](biokanga)    0 (PR) Aligned to a priority region
[Apr  5 06:17:31.329 2017](biokanga)    816554 (UI) PE under minimum insert size
[Apr  5 06:17:31.330 2017](biokanga)    47092 (OI) PE over maximum insert size
[Apr  5 06:17:31.330 2017](biokanga)    6063353 (UP) PE partner not aligned
[Apr  5 06:17:31.331 2017](biokanga)    104096 (IS) PE partner aligned to inconsistent strand
[Apr  5 06:17:31.331 2017](biokanga)    1309794 (IT) PE partner aligned to different target sequence
[Apr  5 06:17:31.332 2017](biokanga)    0 (NP) PE alignment not accepted
[Apr  5 06:17:31.332 2017](biokanga)    0 (LC) Alignment violated loci base constraints
[Apr  5 06:17:31.333 2017](biokanga) Reporting of aligned result set started...
[Apr  5 06:17:31.337 2017](biokanga) Sorting alignments by ascending chrom.loci
[Apr  5 06:17:45.885 2017](biokanga) StartAlignments: Generating CSI instead of SAI index file as alignments to sequence lengths (max 830829764) more than SAI 512Mbp limit
[Apr  5 06:17:45.892 2017](biokanga) Header written with references to 22 sequences of which 22 have at least 1 alignments
[Apr  5 06:17:45.893 2017](biokanga) Reported BAM 0 read alignments

Contributions

There should be a CONTRIBUTING.md file so people know how contributions should be made, the process involved etc. Is the intention to try and have a stable master and have development work take place on a develop branch or to have development take place on master?

GitHub has some guidelings for these here:
https://github.com/blog/1184-contributing-guidelines

Incorrect exit status

command:

biokanga index     -i 161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta     -o kangadb     --ref 161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta

The underlying issue is with insufficient memory requested for the batch job. However, as biokanga tries and fails to allocate more memory it should not return Exit code: 0

[Jul 26 04:23:36.930 2018](biokanga) Subprocess index Version 4.3.9 starting
[Jul 26 04:23:36.930 2018](biokanga) Resources: cores: 20 physmem: 126 (GB) virtual mem: Unlimited (cur) Unlimited (max) locked mem: Unlimited (cur) Unlimited (max) data seg size: Unlimited (cur) Unlimited (max) threads: 514763 (cur) 514763 (max) stack size: Unlimited (cur) Unlimited (max)
[Jul 26 04:23:36.930 2018](biokanga) Processing parameters:
        Accepting for indexing sequences of length at least: 50bp
        Process input files as: 'standard'
        Input source file spec: '161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta'
        Output to suffix array file: 'kangadb'
        Reference species: '161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta'
        Title text: '161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta'
        Descriptive text: '161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta'
        Number of threads : 20
[Jul 26 04:23:36.931 2018](biokanga) Will process in this order: 1 '161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta
[Jul 26 04:23:36.984 2018](biokanga) ProcessFastaFile:- Adding 161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta..
[Jul 26 04:29:55.462 2018](biokanga) CreateBioseqSuffixFile: sorting suffix array...
[Jul 26 04:29:55.470 2018](biokanga) LoadReads: SfxBlock memory re-allocation to 87283569669 bytes - Cannot allocate memory
[Jul 26 04:29:55.479 2018](biokanga) LoadReads: SfxBlock memory re-allocation to 87283569669 bytes - Cannot allocate memory
[Jul 26 04:29:55.663 2018](biokanga) CreateBioseqSuffixFile: completed...
[Jul 26 04:29:55.664 2018](biokanga) Exit code: 0 Total processing time:     00:06:18.740 seconds

Bug: First alignment on header line

I've also noticed that the first alignment (not sure if this is in every case) appears on the last line of the BAM/header. e.g. Probably just a missing new line character at the end of the @pg header line.

@PG     ID:biokanga     VN:4.3.5HWI-ST226:209:D04H2ACXX:4:1107:8119:199917      99      chr1A   293     255     62M     =       484     269     CTCGAGCT

Installation error

I am trying to install biokanga on ubuntu 18.0.4 and getting the following error at make install stage:

waqas@waqas-Inspiron-5521:~/biokanga$ make install
Making install in libbiokanga
make[1]: Entering directory '/home/waqas/biokanga/libbiokanga'
make[2]: Entering directory '/home/waqas/biokanga/libbiokanga'
make[2]: Nothing to be done for 'install-exec-am'.
make[2]: Nothing to be done for 'install-data-am'.
make[2]: Leaving directory '/home/waqas/biokanga/libbiokanga'
make[1]: Leaving directory '/home/waqas/biokanga/libbiokanga'
Making install in libBKPLPlot
make[1]: Entering directory '/home/waqas/biokanga/libBKPLPlot'
make[2]: Entering directory '/home/waqas/biokanga/libBKPLPlot'
make[2]: Nothing to be done for 'install-exec-am'.
make[2]: Nothing to be done for 'install-data-am'.
make[2]: Leaving directory '/home/waqas/biokanga/libBKPLPlot'
make[1]: Leaving directory '/home/waqas/biokanga/libBKPLPlot'
Making install in genhyperconserved
make[1]: Entering directory '/home/waqas/biokanga/genhyperconserved'
g++ -g -O2 -o genhyperconserved genhyperconserved.o ../libbiokanga/libbiokanga.a ../libbiokanga/zlib/libz.a -lrt -ldl -lpthread
/usr/bin/ld: ../libbiokanga/zlib/libz.a(crc32.o): relocation R_X86_64_32 against .rodata' can not be used when making a PIE object; recompile with -fPIC /usr/bin/ld: ../libbiokanga/zlib/libz.a(deflate.o): relocation R_X86_64_32S against .rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: ../libbiokanga/zlib/libz.a(inflate.o): relocation R_X86_64_32S against hidden symbol zcfree' can not be used when making a PIE object /usr/bin/ld: ../libbiokanga/zlib/libz.a(inftrees.o): relocation R_X86_64_32S against .rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: ../libbiokanga/zlib/libz.a(trees.o): relocation R_X86_64_32S against hidden symbol _length_code' can not be used when making a PIE object /usr/bin/ld: ../libbiokanga/zlib/libz.a(zutil.o): relocation R_X86_64_32 against .rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: ../libbiokanga/zlib/libz.a(compress.o): relocation R_X86_64_32 against .rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC /usr/bin/ld: ../libbiokanga/zlib/libz.a(gzlib.o): relocation R_X86_64_32 against .rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: ../libbiokanga/zlib/libz.a(gzread.o): relocation R_X86_64_32 against .rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC /usr/bin/ld: ../libbiokanga/zlib/libz.a(gzwrite.o): relocation R_X86_64_32S against .rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: ../libbiokanga/zlib/libz.a(inffast.o): relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Makefile:338: recipe for target 'genhyperconserved' failed
make[1]: *** [genhyperconserved] Error 1
make[1]: Leaving directory '/home/waqas/biokanga/genhyperconserved'
Makefile:371: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

Any help will be highly appreciated.

Bug: Blitz alignment outside of target sequence length

I've stumbled upon an interesting bug in a Biokanga Blitz result (Version 4.3.9)- an alignment block is reported that sits wholly beyond the end of the target sequence.

strand qName qSize qStart qEnd tName tSize tStart tEnd blockCount blockSizes qStarts tStarts
-  query  1345  44  1290  target  55537  20104  81702  2  337,781,  55,520,  20104,80921,

Note that the target sequence is only 55,537 bp (confirmed), but the second alignment block starts at 80,921 bp and overall alignment goes to 81,702 bp.

A pairwise BLAST confirms the presence and location of the first alignment block, but of course not the second.

I have at least 6 more examples of same issue having occurred. However, testing a one-to-one alignment of this query to this target does not replicate the issue; it's only happening when part of a larger scale process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.