Coder Social home page Coder Social logo

nucmerge's Introduction

NucMerge manual



1 Introduction

NucMerge improves genome assembly accuracy by incorporating information derived from an alternative assembly and paired-end Illumina reads from the same genome. It corrects insertion, deletion, substitution, and inversion errors and locates inter- and intra-chromosomal rearrangement errors. The tool is described in the manuscript mentioned in Section 6.



2 Prerequisites

NucMerge can be run on Linux and Mac OS.

Tools that should be preinstalled and added to the PATH before running NucMerge:

NucBreak (https://github.com/uio-bmi/NucBreak) is provided together with NucMerge.

NucMerge was tested using Python 2.7, Pilon v1.22, NucDiff v2.0.2, NucBreak v1.0, bwa v0.7.5, samtools v.1.3.1, bowtie2 2.2.9, and MUMmer 3.23.



3 Installation

Clone the NucMerge github repository using the following command:

git clone --recursive https://github.com/uio-bmi/NucMerge.git
 



4 Running

4.1 Command line syntax and input arguments

To run NucMerge, run nucmerge.py with valid input arguments:

python nucmerge.py [-h] [--proc [int]] [--version]
                   Target_assembly.fasta Query_assembly.fasta PE_reads_1.fastq PE_reads_2.fastq Output_dir Prefix

Positional arguments:

  • Target_assembly.fasta - Fasta file with the target assembly
  • Query_assembly.fasta - Fasta file with the query assembly
  • PE_reads_1.fastq - Fastq file with the first part of paired-end reads. They are supposed to be forward-oriented.
  • PE_reads_2.fastq - Fastq file with the second part of paired-end reads. They are supposed to be reverse-oriented.
  • Output_dir - Path to the directory where all intermediate and final results will be stored
  • Prefix - Name that will be added to all generated files

Optional arguments:

  • -h, --help - show this help message and exit
  • --proc - Number of processes to be used. It is advised to use 5 processes. [5]
  • --version - show program's version number and exit

4.2 Running examples

A running example with the NucMerge predefined parameter values:

python nucmerge.py my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix

A running example with the introduced --proc parameter value:

python nucmerge.py --proc 1 my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix



5 NucMerge output

NucMerge stores the output results produced by NucDiff, NucBreak, and Pilon in the following directories:

  • Nucdiff - <output_dir>/NucDiff
  • NucBreak run with the target assembly - <output_dir>/NucBreak_1
  • NucBreak run with the query assembly - <output_dir>/NucBreak_2
  • Pilon run with the target assembly - <output_dir>/Pilon_1
  • Pilon run with the query assembly - <output_dir>/Pilon_2

NucMerge produces the following files stored in <output_dir>:

  • ‹Prefix›_local_differences.gff
  • ‹Prefix›_structural_differences.gff
  • ‹Prefix›_nucmerge_asmb.fasta

5.1 ‹Prefix›_local_differences.gff

The file contains information about the different types of insertion, deletion, and substitution errors detected in the target assembly.

The following information is contained in the file:

  • column 1 - Name of the target assembly sequence
  • column 2 - NucMerge version used
  • column 3 - Sequence Ontology accession number
  • column 4 - Error start
  • column 5 - Error end
  • column 6,7,8 - Score/strand/phase fields are not used
  • column 9, ID - Identification name of an error
  • column 9, ID_nucdiff - Error's ID assigned by NucDiff. If ID_nucdiff starts with SNP, information about the error can be found in query_snps.gff, else it can be found in query_struct.gff.
  • column 9, Name - Error type as it is detected by NucDiff compared to the query assembly
  • column 9, old_len - Length of an errorneous fragment in the target assembly
  • column 9, new_len - Length of an erroneous frgament after correction in the resulted assembly
  • column 9, old_seq - Errorneous fragment sequence in the target assembly
  • column 9, new_seq - Errorneous fragment sequence after correction in the resulted assembly

The description of the query_snps.gff and query_struct.gff files produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.

The ‹Prefix›_local_differences.gff file example:

##gff-version 3
##sequence-region	NODE_1	1	273095
NODE_1	NucMerge_v1.0	SO:1000002	27951	27951	.	.	.	ID=LD_1;ID_nucdiff=SNP_4;Name=substitution;old_len=1;new_len=1;old_seq=C;new_seq=G;color=#42C042
NODE_1	NucMerge_v1.0	SO:0000667	129759	129759	.	.	.	ID=LD_2;ID_nucdiff=SNP_11;Name=insertion;old_len=1;new_len=0;old_seq=G;new_seq=.;color=#EE0000
NODE_1	NucMerge_v1.0	SO:0000667	233592	233601	.	.	.	ID=LD_3;ID_nucdiff=SNP_27;Name=inserted_gap;old_len=10;new_len=0;old_seq=NNNNNNNNNN;new_seq=.;color=#EE0000
##sequence-region	NODE_2	1	211125
NODE_2	NucMerge_v1.0	SO:1000035	139350	139382	.	.	.	ID=LD_4;ID_nucdiff=SV_21;Name=duplication;old_len=33;new_len=0;old_seq=CCCGGGAGCATAGATAACTATGTGACCGGGGTG;new_seq=.;color=#EE0000
NODE_2	NucMerge_v1.0	SO:0000159	173435	173435	.	.	.	ID=LD_5;ID_nucdiff=SV_33;Name=collapsed_tandem_repeat;old_len=0;new_len=20;old_seq=.;new_seq=AGCCAGCGGCTGTTTGTCAG;color=#0000EE
...

5.2 ‹Prefix›_structural_differences.gff

The file contains information about inversion errors and structural breakpoints corresponding to inter- and intra-chromosomal rearrangement errors detected in the target assembly.

The following information is contained in the file:

  • column 1 - Name of the target assembly sequence
  • column 2 - NucMerge version used
  • column 3 - Sequence Ontology accession number
  • column 4 - Error start
  • column 5 - Error end
  • column 6,7,8 - Score/strand/phase fields are not used
  • column 9, ID - Identification name of an error
  • column 9, Name - Iversion or breakpoint
  • column 9, ID_nucdiff - Error's ID assigned by NucDiff. Information about the error can be found in query_struct.gff.
  • column 9, Type_nucdiff - The type of an error detected by NucDiff. The real error type can differ from the given one.

The description of the query_struct.gff file produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.

The ‹Prefix›_structural_differences.gff file example:

##gff-version 3
##sequence-region	NODE_1	1	617
NODE_1	NucMerge_v1.0	SO:0000699	331	430	.	.	.	ID=SD_1;Name=breakpoint;ID_nucdiff=SV_149;Type_nucdiff=translocation-inserted_gap;color=#0000EE
##sequence-region	NODE_2	1	4763
NODE_2	NucMerge_v1.0	SO:0000699	4478	4478	.	.	.	ID=SD_2;Name=breakpoint;ID_nucdiff=SV_174;Type_nucdiff=reshuffling-part_1_gr_0;color=#0000EE
##sequence-region	NODE_3	1	208973
NODE_3	NucMerge_v1.0	SO:1000036	418	1022	.	.	.	ID=SD_3;Name=inversion;ID_nucdiff=SV_317;Type_nucdiff=inversion;color=#EE0000
NODE_3	NucMerge_v1.0	SO:0000699	71741	71926	.	.	.	ID=SD_4;Name=breakpoint;ID_nucdiff=SV_2577;Type_nucdiff=translocation-inserted_gap;color=#0000EE
NODE_3	NucMerge_v1.0	SO:0000699	110857	110857	.	.	.	ID=SD_5;Name=breakpoint;ID_nucdiff=SV_2629;Type_nucdiff=reshuffling-part_2_gr_1;color=#0000EE
NODE_3	NucMerge_v1.0	SO:0000699	110857	110857	.	.	.	ID=SD_6;Name=breakpoint;ID_nucdiff=SV_2630;Type_nucdiff=inversion;color=#0000EE
...

5.3 ‹Prefix›_nucmerge_asmb.fasta

The file contains the resulted assembly obtained from the target assembly by (1) correcting inversion errors and errors listed in ‹Prefix›_local_differences.gff and (2) splitting target assembly sequences in the regions contained breakpoints from ‹Prefix›_structural_differences.gff.

6 Citing NucMerge

To cite your use of NucMerge in your publication :

Khelik K., et al. NucMerge: Genome assembly quality improvement assisted by alternative assemblies and paired-end Illumina reads. (in preparation)

nucmerge's People

Contributors

kseniakh avatar

Stargazers

johnsonz avatar  avatar Josh Herr avatar Liming Tao avatar HE.Zheng-Shan avatar Shaun Jackman avatar Duncan Berger avatar

Watchers

James Cloos avatar Lex Nederbragt avatar sandve avatar Torbjørn Rognes avatar  avatar

Forkers

kseniakh

nucmerge's Issues

Program stopped after Bowtie process

Hello,
I already clone the repo and run NucMerge with my data with the following command.
nucmerge.py --proc 50 Contigs1.fasta Contigs2.fasta fw.fastq rv.fastq TEST

But the program stopped after a few minutes and for hours doesn't change the prompt
I'm in Ubuntu 16.04 Ram 125gb, 64bits os and I have all the required Software in my $PATH

Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 6286577
    bwtLen: 6286578
    sz: 1571645
    bwtSz: 1571645
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 392912
    offsSz: 1571648
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 32743
    numLines: 32743
    ebwtTotLen: 2095552
    ebwtTotSz: 2095552
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:02
908765 reads; of these:
  908765 (100.00%) were unpaired; of these:
    356 (0.04%) aligned 0 times
    881953 (97.05%) aligned exactly 1 time
    26456 (2.91%) aligned >1 times
99.96% overall alignment rate
908765 reads; of these:
  908765 (100.00%) were unpaired; of these:
    9772 (1.08%) aligned 0 times
    867596 (95.47%) aligned exactly 1 time
    31397 (3.45%) aligned >1 times
98.92% overall alignment rate
908765 reads; of these:
  908765 (100.00%) were unpaired; of these:
    12195 (1.34%) aligned 0 times
    871690 (95.92%) aligned exactly 1 time
    24880 (2.74%) aligned >1 times
98.66% overall alignment rate
908765 reads; of these:
  908765 (100.00%) were unpaired; of these:
    21603 (2.38%) aligned 0 times
    857190 (94.32%) aligned exactly 1 time
    29972 (3.30%) aligned >1 times
97.62% overall alignment rate
min_frag_size 36
max_frag_size 1244
read_length 251
min_frag_size 36
max_frag_size 1217
read_length 251

Thanks in advance

TypeError: __init__() takes at least 3 arguments (1 given)

hello:
I had ran the following command: python nucmerge.py asb-1 asb-2 pe-1 pe-2 outdir prefix
But I got these errors:
What does that mean?

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.19 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.10 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -p /mnt/data/liyunxia/4-project/mito/velvet/Pilon_1/bwa/yyl-nucmer_1 /mnt/data/liyunxia/4-project/mito/velvet/YYL_Mtctgs-zhu2-uniq.fasta
[main] Real time: 0.311 sec; CPU: 0.310 sec
[E::bwa_set_rg] the read group line contained literal <tab> characters -- replace with escaped tabs: \t
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/mnt/data/liyunxia/anaconda3/envs/py2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/mnt/data/liyunxia/anaconda3/envs/py2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/data/liyunxia/anaconda3/envs/py2/lib/python2.7/multiprocessing/pool.py", line 392, in _handle_results
    task = get()
TypeError: __init__() takes at least 3 arguments (1 given)

Hope for your reply
yun

NucDiff process seems not running correctly

Hi, I was trying to merge two assembly results, came from different assemblers.
I could run through pilon and it seems like it has processed through MUMMer.
However, NucDiff/results directory is empty.
I've checked the log file, but I could not figure out what was the problem.

$ ls -hal
total 114M
163 Mar 23 01:39 .
191 Mar 22 14:48 ..
38M Mar 23 01:39 nucmerge.coords
37M Mar 23 01:39 nucmerge.delta
17M Mar 23 01:39 nucmerge.filter
0 Mar 23 01:39 nucmerge_filtered.snps
0 Mar 23 01:29 results

partial_run.log

Multiple contigs and fasta files

Hello, can I use more than 2 assemblies and more than a pair of pe fastq files? For example 3 assemblies and 3 pe fastq files?

Or should I just concatenate them into R1 and R2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.