Coder Social home page Coder Social logo

nucbreak's Introduction

NucBreak manual



1 Introduction

NucBreak detects structural errors in assemblies and structural variants between pairs of genomes when only a genome of one organism and Illumina paired-end reads from another organism are available. It is able to detect insertions, deletions, different inter- and intra-chromosomal translocations, and inversions. However, the types of detected breakpoints are not specified. It was written in Python and uses Bowtie2 for reads alignment. The tool is described in the manuscript mentioned in Section 5.



2 Prerequisites

NucBreak can be run on Linux and Mac OS. It uses Python 2.7, Bowtie2 v2.2.9 and the SAMtools utilities v1.3.1. Bowtie2 and SAMtools should be installed and be in the PATH before running NucBreak.

Bowtie2 can be downloaded at https://sourceforge.net/projects/bowtie-bio/files/bowtie2/ . The SAMtools can be downloaded at https://github.com/samtools/samtools .



3 Running NucBreak

3.1 Command line syntax and input arguments

To run NucBreak, run the nucbreak.py script with valid input arguments:

python nucbreak.py [-h] [--min_frag_size [MIN_FRAG_SIZE]]
                        [--max_frag_size [MAX_FRAG_SIZE]
                        [--sam_1 [SAM_1]]
                        [--sam_2 [SAM_2]] 
                        [--bam_pos [{yes,no}]] 
                        [--version]
                        Genome.fasta PE_reads_1.fastq PE_reads_2.fastq Output_dir Prefix

Positional arguments:

  • Genome.fasta - Fasta file with genome sequences
  • PE_reads_1.fastq - Fastq file with the first part of paired-end reads. They supposed to be forward-oriented
  • PE_reads_2.fastq - Fastq file with the second part of paired-end reads. They supposed to be reverse-oriented
  • Output_dir - Path to the directory where all intermediate and final results will be stored
  • Prefix - Name that will be added to all generated files including the ones created by Bowtie2

Optional arguments:

  • -h, --help - show this help message and exit
  • --min_frag_size - minimum fragment size used to choose perfectly mapped read pairs
  • --max_frag_size - miximum fragment size used to choose perfectly mapped read pairs
  • --sam_1 - Path to the already existing Bowtie2 sam file containing alignment results for the first part of paired-end reads.
  • --sam_2 - Path to the already existing Bowtie2 sam file containing alignment results for the second part of paired-end reads.
  • --bam_pos - Generate bam files with entries sorted out by location and index files (yes/no)
  • --version - show program's version number and exit

3.2 Running examples

A running example with the NucBreak predefined parameters values:

python nucbreak.py my_genome.fasta my_pe_reads_1.fastq my_pe_reads_1.fastq my_output_dir my_prefix

A running example with the already existed Bowtie2 sam files. Each read file is supposed to be aligned independently of another read file. Bowtie2 should be run with the "--sensitive_local --ma 1 -a" parameter settings. The output sam files should be sorted by read names.

python nucbreak.py --sam_1 my_sam_1 --sam_2 my_sam_2 my_genome.fasta my_pe_reads_1.fastq my_pe_reads_1.fastq my_output_dir my_prefix

A running example with the predefined minimum and maximum fragment sizes. It is better to use your own minimum and maximum fragment sizes only when you are not agree with automatically detected ones.

python nucbreak.py --min_frag_size 50 --max_frag_size 1150 my_sam_2 my_genome.fasta my_pe_reads_1.fastq my_pe_reads_1.fastq my_output_dir my_prefix

To visualize read alignments in genome browsers, use bam_pos option. The bam file with alignments sorted by positions together with indexed files will be generated automatically:

python nucbreak.py --bam_pos yes my_genome.fasta my_pe_reads_1.fastq my_pe_reads_1.fastq my_output_dir my_prefix



4 NucBreak output

NucBreak puts the Bowtie2 output in the <output_dir>/bowtie2 directory. The file with the fragment size distribution and the file with detected breakpoints are located in <output_dir>.

4.1 Fragment_size_distr.txt

The file contains information about the minimum and maximum fragments sizes and the read length used by NucBreak together with fragment size distribution. The first and second columns show found fragment sizes and the corresponding number of read pairs for each fragment size, respectively.

The Fragment_size_distr.txt file example:

min_frag_size=35
max_frag_size=1129
read_length=251

Fragment size distribution
250	200
251	287
252	357
253	344
254	317
255	351
256	369
257	397
258	426
...

4.2 prefix_breakpoints.bedgraph

The file contains information about all detected assembly errors or structural variations in a genome. The first column corresponds to the genome sequence name. The second and third columns show the location of detected breakpoints. The fourth column is used for the result visualization in a genome browser and is always equal to 1.

The prefix_breakpoints.bedgraph file example:

track type=bedGraph name=breakpoints description="BedGraph format" visibility=full color=0,0,0 graphType=bar autoScale=on
NODE_44	   9866	 9873	1
NODE_136   352	 369	1
NODE_136   537	 589	1
NODE_136   1047	 1064	1
NODE_150   2533	 2541	1
NODE_649   506	 526	1
...



5 Citing NucBreak

To cite your use of NucBreak in your publication :

Khelik K., et al. NucBreak: Location of structural errors in a genome assembly and structural variations between a pair of genomes using Illumina paired-end reads. (in preparation)

nucbreak's People

Contributors

kseniakh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

ethering kseniakh

nucbreak's Issues

ERROR: unknown simbol in the beginning of the sigar string

Hi,

I'm testing NucBreak. I'm separately aligning my pairs with BWA with the following command line:

bwa mem -t $THREADS $ASSEMBLY.fasta $GENOMICR1 | samtools view -@ $THREADS -Sb - | samtools sort -@ $THREADS - | samtools view -@ $THREADS -h > $ASSEMBLY.DNA.Aligned.P1.sortedByCoord.out.bam

and then I execute NucBreak with this command:

python $NucBREAK --min_frag_size 180 --sam_1 $ASSEMBLY.DNA.Aligned.P1.sortedByCoord.out_.bam --sam_2 $ASSEMBLY.DNA.Aligned.P2.sortedByCoord.out_.bam $ASSEMBLY.fasta $GENOMICR1 $GENOMICR2 $ASSEMBLY.NucBREAK Test.NucBREAK

Unfortunately I get this error.

H
ERROR: unknown simbol in the beginning of the sigar string
Traceback (most recent call last):
  File "/home/fc464/software/NucBreak/nucbreak.py", line 92, in <module>
    main()
  File "/home/fc464/software/NucBreak/nucbreak.py", line 90, in main
    START(args)
  File "/home/fc464/software/NucBreak/nucbreak.py", line 55, in START
    min_ins_size, max_ins_size, read_length,read_groups_dict, asmb_seq_dict=insert_size.FIND_INSERT_SIZE_VALUES(pe_sam_1,pe_sam_2,working_dir+'Fragment_size_distr.txt',min_frag_size, max_frag_size)
  File "/home/fc464/software/NucBreak/insert_size.py", line 342, in FIND_INSERT_SIZE_VALUES
    insertion_size_dict, read_length, read_groups_dict, asmb_seq_dict=PARSE_SAM_FILES(pe_sam_1,pe_sam_2)
  File "/home/fc464/software/NucBreak/insert_size.py", line 132, in PARSE_SAM_FILES
    append_line=general.FIND_LINE(temp1)
  File "/home/fc464/software/NucBreak/general.py", line 118, in FIND_LINE
    return [map_flag_dict[map_flag],ref_name,ref_st,ref_end,read_st, read_end, cigar,nm,qual_ascii_33]
KeyError: 2064

Any help?

Thanks
F

Preprint

Interesting preprint. Aren't you worried about upgrading to python3 ? Py2.7 is only supported for ~16 months from now. At that point there is going to be a lot of obsolete software.

cheers

Incomplete output after NucBreak run

Hi,
I want to find structural variants in my samples, so I'm running NucBreak as follows (abbreviated for ease of understanding):

python nucbreak.py reference.fasta R1.fastq.gz R2.fastq.gz out sp_polecat

NucBreak runs for about 10 days and finishes without any errors.
I have the following output files:

$ ls  out/Results/bowtie2/
sp_polecat_1.sam
sp_polecat_tree.1.bt2
sp_polecat_tree.2.bt2
sp_polecat_tree.3.bt2
sp_polecat_tree.4.bt2
sp_polecat_tree.rev.1.bt2
sp_polecat_tree.rev.2.bt2

I have no other output files in the directory structure.
I was expecting to see sp_polecat_2.sam in the bowtie2 directory as well as Fragment_size_distr.txt and sp_polecat_breakpoints.bedgraph somewhere in the output, but they don't exist. I've attached my stdout so you can see what's happening.
I've run NucBreak on three different samples and the above applies in all three instances.
nucbreak_stdout.txt

Graham

Bowtie vs BWA

Hi! I just discovered this promising tool when I found its latest article.

I have a question regarding the mapping tool used: Is there a specific reason to use Bowtie2 over BWA? And if there is not, would be possible to use BWA somehow?

I see in the paper that option --rna is used when mapping with Bowtie2, is that the reason? If an RNA mapping tool is needed, wouldn't be better to use Hisat2 or magicblast?

Thanks in advance!
Sivico

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.