Coder Social home page Coder Social logo

brentp / bwa-meth Goto Github PK

View Code? Open in Web Editor NEW
139.0 13.0 53.0 8.83 MB

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome

Home Page: https://arxiv.org/abs/1401.1129

License: MIT License

Python 48.96% Shell 18.00% R 6.00% TeX 26.59% Makefile 0.44%

bwa-meth's Introduction

bwa-meth

Fast and accurante alignment of BS-Seq reads.

NOTE!!!

As of 2016-08-18, bwa-meth now outputs sam to stdout. It is up to the user to convert to bam. This means that the --prefix and --calmd flags are gone.

Update 2016

bwa-meth is still among (if not the) best aligners for BS-Seq. While it is fairly stable, I will continue to support the alignment part of bwa-meth--fixing any bugs or updating as needed.

There are now several (likely better) alternatives for tabulation and SNP calling than provided here so I will not develop those further.

For tabulation, bias, and plotting, use MethylDackel

For SNP calling (a more modern BisSNP), use biscuit

Intro

This works for single-end reads and for paired-end reads from the directional protocol (most common).

Uses the method employed by methylcoder and Bismark of in silico conversion of all C's to T's in both reference and reads.

Recovers the original read (needed to tabulate methylation) by attaching it as a comment which bwa appends as a tag to the read.

Performs favorably to existing aligners gauged by number of on and off-target reads for a capture method that targets CpG-rich region. Some off-target regions may be enriched, but all aligners are be subject to the same assumptions. See manuscript: http://arxiv.org/abs/1401.1129 for details. Optimal alignment is the upper-left corner. Curves are drawn by varying the mapping quality cutoff for alingers that use it.

This image is on real reads and represents an attempt to find good parameters for all aligners tested.

Untrimmed reads comparison

Note that bwa-meth and Last perform well without trimming.

run.sh scripts for each method are here: https://github.com/brentp/bwa-meth/tree/master/compare I have done my best to have each method perform optimally, but no doubt there could be improvements.

QuickStart

Without installation, you can use as python bwameth.py with install, the command is bwameth.py.

The commands:

bwameth.py index $REF #Indexes with BWA-MEM (default)
    #OR
bwameth.py index-mem2 $REF #Indexes with BWA-MEM2

bwameth.py --reference $REF some_R1.fastq.gz some_R2.fastq.gz > some.output.sam

will create some.output.bam and some.output.bam.bai. To align single end-reads, specify only 1 file.

See the full example at: https://github.com/brentp/bwa-meth/tree/master/example/

Installation

The following snippet should work for most systems that have samtools and bwa installed and the ability to install python packages. (Or, you can send this to your sys-admin). See the dependencies section below for further instructions:

    # these 4 lines are only needed if you don't have toolshed installed
    wget https://pypi.python.org/packages/source/t/toolshed/toolshed-0.4.0.tar.gz
    tar xzvf toolshed-0.4.0.tar.gz
    cd toolshed-0.4.0
    sudo python setup.py install

    wget https://github.com/brentp/bwa-meth/archive/master.zip
    unzip master.zip
    cd bwa-meth-master/
    sudo python setup.py install

After this, you should be able to run: bwameth.py and see the help.

Dependencies

bwa-meth depends on

  • python 2.7+ (including python3)

    • toolshed library. can be installed with:

      • easy_install toolshed or
      • pip install toolshed
    • if you don't have root or sudo priviledges, you can run python setup.py install --user from this directory and the bwameth.py executable will be at: ~/.local/bin/bwameth.py

    • if you do have root or sudo run: [sudo] python setup.py install from this directory

    • users unaccustomed to installing their own python packages should download anaconda: https://store.continuum.io/cshop/anaconda/ and then install the toolshed module with pip as described above.

  • samtools command on the $PATH (https://github.com/samtools/samtools)

  • bwa mem from: https://github.com/lh3/bwa OR bwa-mem2 from: https://github.com/bwa-mem2/bwa-mem2

usage

Index

One time only, you need to index a reference sequence.

bwameth.py index $REF #Indexes with BWA-MEM (default)
#OR
bwameth.py index-mem2 $REF #Indexes with BWA-MEM2

If your reference is some.fasta, this will create some.c2t.fasta and all of the bwa indexes associated with it.

Align

bwameth.py --threads 16 \
     --reference $REFERENCE \
     $FQ1 $FQ2 > some.sam

The output will pass will have the reads in the correct location (flipped from G => A reference).

Handles clipped alignments and indels correctly. Fastqs can be gzipped or not.

The command above will be sent to BWA-MEM or BWA-MEM2 to do the work as something like:

bwa mem -L 25 -pCM -t 15  $REFERENCE.c2t.fa \
            '<python bwameth.py c2t $FQ1 $FQ2'

              #OR

bwa-mem2 mem -L 25 -pCM -t 15  $REFERENCE.c2t.fa \
            '<python bwameth.py c2t $FQ1 $FQ2'

Index from BWA-MEM or BWA-MEM2 is auto detected and the corresponding aligner is chosen.

So the converted reads are streamed directly to bwa and never written to disk. The output from that is modified by bwa-meth and streamed straight to a bam file.

bwa-meth's People

Contributors

astatham avatar boegel avatar brentp avatar bwlang avatar chris-cheshire avatar dpryan79 avatar grahamgower avatar iromeo avatar jdidion avatar nchernia avatar nsoranzo avatar paulmenzel avatar poisonalien avatar swingingsimian avatar tsy19900929 avatar ttriche avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bwa-meth's Issues

Relax Python version restrictions

Is it possible to support >=Python2.6?

Update: bellow is a list of Python2.7 features I've spotted in the code

  1. The use of argparse, which is only available in Python2.7. This can be fixed by adding a conditional dependency on argparse in setup.py.
  2. toolshed/files.py and toolshed/pool.py assume sys.version_info is a namedtuple. This is easily fixed by using plain indexing, eg. sys.version_info[0].

Mapq

I’ve noticed that running bwa meth results in a higher rate of MAPQ 0 reads than vanilla bwa (I did a test on non-bisulfite converted data). Is there any way to ameliorate this or is it just a natural consequence of needing to distinguish between methylated and nonmethylated loci?

error with bwa-mem -R

Hi Brent,

I am aware of tseemann/snippy#100
I am using bwa/0.7.10.
but I checked https://github.com/brentp/bwa-meth/blob/master/bwameth.py, it seems you already took care of it https://github.com/brentp/bwa-meth/blob/master/bwameth.py#L293 by
escaping the tab rg = '@RG\\tID:{rg}\\tSM:{rg}'.format(rg=rg)

what's wrong here?

python /scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py --threads 8 --reference /scratch/genomic_med/apps/annot/bwa_meth_index/UCSC_hg19_genome.fa 01trimmed_fastqs/Panc1_Exo1_trimmed.fq                     --read-group '@RG   ID:Panc1_Exo1   SM:Panc1_Exo1'
running: bwa mem -T 40 -B 2 -L 10 -CM -R '@RG   ID:Panc1_Exo1   SM:Panc1_Exo1' -t 8  /scratch/genomic_med/apps/annot/bwa_meth_index/UCSC_hg19_genome.fa.bwameth.c2t '</scratch/genomic_med/apps/python/anaconda/default/bin/python /scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py c2t 01trimmed_fastqs/Panc1_Exo1_trimmed.fq NA'
[E::bwa_set_rg] no ID at the read group line
Traceback (most recent call last):
  File "/scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py", line 450, in <module>
    main(sys.argv[1:])
  File "/scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py", line 447, in main
    set_as_failed=args.set_as_failed)
  File "/scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py", line 273, in bwa_mem
    as_bam(cmd, fa, set_as_failed)
  File "/scratch/genomic_med/apps/bwa-meth/bwa-meth-0.2.0/bwameth.py", line 290, in as_bam
    raise Exception("bad or empty fastqs")
Exception: bad or empty fastqs

thanks.

[fputs] Broken pipe

I got this several error when run BWA-METH.

/haifengc/panfs/bireads/SRR1171540_1_1000000.fastq NA'
writing to:
samtools view -bS - | samtools sort - bwa-meth
[M::bwa_idx_load_from_disk] read 0 ALT contigs
converting reads in /home/rcf-40/haifengc/panfs/bireads/SRR1171540_1_1000000.fastq,NA
WARNING: running bwameth in single-end mode
[M::process] read 666668 sequences (60000120 bp)...
[M::process] read 333332 sequences (29999880 bp)...
[M::mem_process_seqs] Processed 666668 reads in 316.005 CPU sec, 319.250 real sec
Traceback (most recent call last):
File "bwameth.py", line 601, in
main(sys.argv[1:])
File "bwameth.py", line 586, in main
set_as_failed=args.set_as_failed)
File "bwameth.py", line 259, in bwa_mem
as_bam(cmd, fa, prefix, calmd, set_as_failed)
File "bwameth.py", line 299, in as_bam
out.write(str(aln) + '\n')
IOError: [Errno 32] Broken pipe
[fputs] Broken pipe

WARNING: 5355 reads with length < 80: this program is designed for long reads.

Hello Brent,

I make simulation fastq files and then use bwa-meth to calculate methylation rate. First, i simulate RRBS reads. Second, i trim those RRBS reads using trim_galore. Then i use bwa-meth to calculate the methylation rate. But after running bwameth.py, it says

WARNING: 5355 reads with length < 80
: this program is designed for long reads

Could you be kind enough to help me solve this problem ?

Thanks

paired reads not recoqnized as such

Dear brentp,

Here's my SLURM script:

!/bin/bash

SBATCH --job-name=bwa-meth ### Job Name

SBATCH --output=bwa-meth24hr.out ### File in which to store job output

SBATCH --error=bwa-meth24hr.err ### File in which to store job error messages

SBATCH --qos=normal ### Quality of Service (like a queue in PBS)

SBATCH --time=0-24:00:00 ### Wall clock time limit in Days-HH:MM:SS

SBATCH --nodes=1 ### Node count required for the job

SBATCH --ntasks-per-node=20 ### Nuber of tasks to be launched per Node

echo pwd=pwd
module load python/2.7.11
module load samtools
REFERENCE=/lustre/project/hpcstaff/user/OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.bgz
echo Generating index for $REFERENCE
time bwameth.py index $REFERENCE
echo Analyzing methylation..
time bwameth.py --reference $REFERENCE ../sra-toolkit/SRR2050320_1.fastq.gz ../sra-toolkit/SRR2050320_2.fastq.gz -t 20 --read-group $'@rg\tID:SRR2050320\tSM:SRR2050320'
echo Analysis succeeded.

...and here's the result of my alignment check...
37067603 + 3256037 in total (QC-passed reads + QC-failed reads)
63598 + 967992 secondary
0 + 0 supplimentary
0 + 0 duplicates
35373511 + 3256037 mapped (95.43%:100.00%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Q: This says zero reads paired, true? Is this correct?

Please let me know if you need more info, thanks.

Best,
CB

Test fail: both release 0.9 and git cloned

index worked, align failed see below.
python 2.7.6, Centos 6.2, setuptools 3.3,

note: lib/python2.7/site-packages/
contains only a single file: bwameth-0.09-py2.7.egg

It does not contain a directory so named, etc:
bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py

ERROR Report:

bwameth.py --reference ref.fa t_R1.fastq.gz t_R2.fastq.gz -t 12
running: bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@rg ID:t_R SM:t_R' -t 12 ref.fa.bwameth.c2t '</home/pi/vogelw/bin/python /home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py c2t t_R1.fastq.gz t_R2.fastq.gz'
writing to:
samtools view -bS - | samtools sort - bwa-meth
/home/pi/vogelw/bin/python: can't find 'main' module in '/home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py'
[gzclose] buffer error
cmd was:bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@rg ID:t_R SM:t_R' -t 12 ref.fa.bwameth.c2t '</home/pi/vogelw/bin/python /home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py c2t t_R1.fastq.gz t_R2.fastq.gz'
Traceback (most recent call last):
File "/home/pi/vogelw/bin/bwameth.py", line 5, in
pkg_resources.run_script('bwameth==0.09', 'bwameth.py')
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 528, in run_script
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1401, in run_script
File "/home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 582, in

File "/home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 579, in main

File "/home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 258, in bwa_mem

File "/home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 286, in as_bam

File "/home/pi/vogelw/lib/python2.7/site-packages/toolshed-0.3.6-py2.7.egg/toolshed/files.py", line 252, in reader
for toks in line_gen:
File "/home/pi/vogelw/lib/python2.7/site-packages/toolshed-0.3.6-py2.7.egg/toolshed/files.py", line 53, in process_iter
raise ProcessException(cmd)
toolshed.files.ProcessException: bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@rg ID:t_R SM:t_R' -t 12 ref.fa.bwameth.c2t '</home/pi/vogelw/bin/python /home/pi/vogelw/lib/python2.7/site-packages/bwameth-0.09-py2.7.egg/EGG-INFO/scripts/bwameth.py c2t t_R1.fastq.gz t_R2.fastq.gz'

'samtools sort' memory limit

It's currently not possible to configure the memory limit for samtools sort. Can you please add a command line option?

Error in tabulation of example

I have tried running the example through bwa-meth, but I fail at the tabulation step, any idea what goes wrong ?

jvh@host:/location_on_disk/bwa-meth/example$ ../bwameth.py index ref.fa
converting c2t in ref.fa to ref.fa.bwameth.c2t
indexing: ref.fa.bwameth.c2t
jvh@host:/location_on_disk/bwa-meth/example$ ../bwameth.py --reference ref.fa t_R1.fastq.gz t_R2.fastq.gz -t 12
running: bwa mem -T 40 -B 3 -L 25 -CM -U 100 -p -R '@RG ID:t_R  SM:t_R' -t 12  ref.fa.bwameth.c2t '</usr/bin/python ../bwameth.py c2t t_R1.fastq.gz t_R2.fastq.gz'
writing to:
samtools view -bS - | samtools sort -@3 - bwa-meth
converting reads in t_R1.fastq.gz,t_R2.fastq.gz
[M::main_mem] read 92726 sequences (9365326 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 44884, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (109, 137, 175)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 307)
[M::mem_pestat] mean and std.dev: (145.08, 48.71)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 373)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[main] Version: 0.7.6a-dev-r434
[main] CMD: bwa mem -T 40 -B 3 -L 25 -CM -U 100 -p -R @RG       ID:t_R  SM:t_R -t 12 ref.fa.bwameth.c2t </usr/bin/python ../bwameth.py c2t t_R1.fastq.gz t_R2.fastq.gz
[main] Real time: 9.615 sec; CPU: 16.204 sec
running: samtools index bwa-meth.bam
[E::hts_idx_push] chromosome blocks not continuous
jvh@host:/location_on_disk/bwa-meth/example$ samtools flagstat bwa-meth.bam
92791 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
92723 + 0 mapped (99.93%:-nan%)
92791 + 0 paired in sequencing
46399 + 0 read1
46392 + 0 read2
92276 + 0 properly paired (99.44%:-nan%)
92652 + 0 with itself and mate mapped
71 + 0 singletons (0.08%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
jvh@host:/location_on_disk/bwa-meth/example$ ../bwameth.py tabulate  --trim 3,3 --map-q 60 --bissnp $BISSNP --prefix ex -t 12 --reference ref.fa bwa-meth.bam
Traceback (most recent call last):
  File "../bwameth.py", line 500, in <module>
    main(sys.argv[1:])
  File "../bwameth.py", line 475, in main
    sys.exit(tabulate_main(args[1:]))
  File "../bwameth.py", line 407, in tabulate_main
    % a.reference)
AssertionError: run samtools faidx ref.fa

samtools not found

Hi,

I have installed bwa-meth and was running the install version on the test examples on the Git page. When I try to create the reference index files I get the following error: Traceback (most recent call last): File "/usr/local/bin/bwameth.py", line 5, in <module> pkg_resources.run_script('bwameth==0.10', 'bwameth.py') File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 487, in run_script File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1337, in run_script File "/usr/local/lib/python2.7/dist-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 45, in <module> checkX('samtools') File "/usr/local/lib/python2.7/dist-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 43, in checkX raise Exception("executable for '%s' not found" % cmd)

I have samtools1.3.1 installed and included in the path and I am wondering how to fix the error?

Thanks in advance for your help.

aligning to CTOT+CTOB -- equivalent of bismark --pbat

I understand that bwa-meth aligns to the c2t converted reference. Would it be possible to change bwa-meth also to align to the g2a converted reference?

I believe this is what bismark calls "--pbat" option.

If not, are there any other tools out there that will wrap up "bwa mem" for bisulfite sequencing that may have this option added?

-p flag

When paired end files are sent in as

bwameth.py --reference hg19.fasta test_R1.fastq test_R2.fastq > test.sam

The output starts with:
running: bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p ...

The -p flag is incorrect. According to BWA:

If mates.fq file is absent and option -p is not set, this command regards input reads are single-end. If mates.fq is present, this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the 2i-th and the (2i+1)-th read in reads.fq constitute a read pair (such input file is said to be interleaved). In this case, mates.fq is ignored. In the paired-end mode, the mem command will infer the read orientation and the insert size distribution from a batch of reads.

README update, confused about output

Hi everyone,

Thanks for the excellent tool!

(1) Could you update the README so the --prefix flag isn't shown? I missed the note above following the examples; it would help productivity if you just created a "legacy code" section or eliminate all references to --prefix all together

(2) I'm a bit confused, as it appears that I do not have an output for bwameth.py --reference ref.fa t_R1.fastq.gz t_R2.fastq.gz -t X .

When running this command, I see the output using bwa mem with the last item as NA. Normally, if I were to use bwa mem alone, I would pipe this to an output. When running bwameth.py --reference ref.fa t_R1.fastq.gz t_R2.fastq.gz -t X as in the example https://github.com/brentp/bwa-meth/tree/master/example/, how exactly do I access the output?

bed file

What does "pct" mean?

how can I tell which site is methylated?

adding passing options to bwa mem

When I run bwa-meth, it calls bwa-mem with these parameter set:

bwa mem -T 40 -B 2 -L 10 -CM

Would it be possible to pass-through the -T/B/L options to the script to tune it to the desired options by the user?

Better modularity

I think it would be nice to split bwa-meth into separate subcommands, e.g.

+ bwameth
|
+-- index.py
+-- c2t.py
+-- tabulate.py
+-- cnvs.py
+-- __main__.py

Then each subcommand can be called with python -m bwameth.<cmd>.

Push out a new release?

It'd be nice to add this to bioconda and Galaxy. For that, it'd be nice to include a version that's not from 2014 :) If you take care of the new release, I'll see to it getting added to bioconda and Galaxy.

Disallow RC alignments

Bowtie has --norc option

If --norc is specified, bowtie will not attempt to align against the reverse-complement reference strand.

I'm not aware of a similar option for BWA, but I think it makes sense as a default strategy for bs-seq alignments, because the reads from the reverse strand don't add any extra information in terms of methylation status.

What do you think?

Methylation data

Thanks for the tool. One question - is there an easy way to detect from the read itself whether or not it is methylated? MethylDackel calculates on a per-cytosine basis but I'm also looking for essentially the inverse. Bismark does this with the XM tag.

SNP Question

Just wondering if the bam file generated by bwa-meth file can be used directly in other SNP callers like biscuit? My (naive) self is just curious since bwa-meth performs the C -> T conversion to speed up alignment.

Installation problem

Hi,

I'm trying to install bwa-meth but I get the following error when trying to run
sudo python setup.py install

Traceback (most recent call last):
File "setup.py", line 5, in
import bwameth
File "/home/ilkka/biologia/bwa-meth-0.09/bwameth.py", line 42, in
checkX('samtools')
File "/home/ilkka/biologia/bwa-meth-0.09/bwameth.py", line 40, in checkX
raise Exception("executable for '%s' not found" % cmd)
Exception: executable for 'samtools' not found

It seems that the installer cannot find samtools. However, I have samtools in my $PATH. Any ideas?

Cheers,
Ilkka

add support for interleaved PE fastq

It would be nice for bwameth to accept interleaved reads like bwa, so you can adapter trim on a pipe (like bwakit). Would you entertain a PR to that effect?

Samtools 1.3 support

In version 1.3 samtools deleted some deprecated CLI options for "samtools sort" command (see http://www.htslib.org/doc/samtools-1.3.html) :

Historically samtools sort also accepted a less flexible way of specifying the final and temporary output filenames:

samtools sort [-f] [-o] in.bam out.prefix

This has now been removed. The previous out.prefix argument (and -f option, if any) should be changed to an appropriate combination of -T PREFIX and -o FILE. The previous -o option should be removed, as output defaults to standard output

So bwa-meth.py doesn't work with latest samtools. It can fail with different kinds of errors:

  1. AssertionError
  2. Exception: bad or empty fastqs
  3. IOError: [Errno 32] Broken pipe (most likely same problem as in issue #18)

Details:

  1. AssertionError
    $ bwameth.py --reference /indexes/hg38.fa --threads 20 --prefix issue18 ./foo.fastq :
unning: bwa mem -T 40 -B 2 -L 10 -CM -R '@RG    ID:bm   SM:bm' -t 20  /indexes/hg38.fa.bwameth.c2t '</home/user/anaconda/bin/python /home/user/anaconda/bin/bwameth.py c2t ./foo.fastq NA'
writing to:
samtools view -bS - | samtools sort -m 2415919104 - issue18
converting reads in ./foo.fastq,NA
WARNING: running bwameth in single-end mode
[M::main_mem] read 10 sequences (900 bp)...
[M::mem_process_seqs] Processed 10 reads in 0.006 CPU sec, 0.002 real sec
[main] Version: 0.7.9a-r786
[main] CMD: bwa mem -T 40 -B 2 -L 10 -CM -R @RG ID:bm   SM:bm -t 20 /indexes/hg38.fa.bwameth.c2t </home/user/anaconda/bin/python /home/user/anaconda/bin/bwameth.py c2t ./foo.fastq NA
[main] Real time: 13.531 sec; CPU: 12.509 sec
Traceback (most recent call last):
File "/home/user/anaconda/bin/bwameth.py", line 548, in <module>
  main(sys.argv[1:])
File "/home/user/anaconda/bin/bwameth.py", line 533, in main
  set_as_failed=args.set_as_failed)
File "/home/user/anaconda/bin/bwameth.py", line 198, in bwa_mem
  as_bam(cmd, fa, prefix, calmd, set_as_failed)
File "/home/user/anaconda/bin/bwameth.py", line 243, in as_bam
  assert p.wait() == 0
AssertionError

Where foo.fastq is fastq content suggested in #18
2. Exception: bad or empty fastqs
bwameth.py --reference /indexes/hg38.fa --threads 20 --prefix SRR033942 ./GSM491349/sra/SRX015763/SRR033942/SRR033942.sra

running: bwa mem -T 40 -B 2 -L 10 -CM -R '@RG   ID:bm   SM:bm' -t 20  /indexes/hg38.fa.bwameth.c2t '</home/user/anaconda/bin/python /home/user/anaconda/bin/bwameth.py c2t ./GSM491349/sra/SRX015763/SRR033942/SRR033942.sra NA'
writing to:
samtools view -bS - | samtools sort -m 2415919104 - SRR033942
converting reads in ./GSM491349/sra/SRX015763/SRR033942/SRR033942.sra,NA
WARNING: running bwameth in single-end mode
[gzclose] buffer error
Traceback (most recent call last):
File "/home/user/anaconda/bin/bwameth.py", line 548, in <module>
  main(sys.argv[1:])
File "/home/user/anaconda/bin/bwameth.py", line 533, in main
  set_as_failed=args.set_as_failed)
File "/home/user/anaconda/bin/bwameth.py", line 198, in bwa_mem
  as_bam(cmd, fa, prefix, calmd, set_as_failed)
File "/home/user/anaconda/bin/bwameth.py", line 232, in as_bam
  raise Exception("bad or empty fastqs")
Exception: bad or empty fastqs
  1. IOError: [Errno 32] Broken pipe
    bwameth.py --reference /indexes/hg38.fa --threads 20 --prefix //GSE19418_tmp/GSM491349/SRR033987_hg38 //GSE19418_tmp/GSM491349/tmp/SRR033987.fastq.gz

    running: bwa mem -T 40 -B 2 -L 10 -CM -R '@RG   ID:bm   SM:bm' -t 20  //indexes/hg38.fa.bwameth.c2t '</home/user/anaconda/bin/python /home/user/anaconda/bin/bwameth.py c2t /GSE19418_tmp/GSM491349/tmp/SRR033987.fastq.gz NA'
    writing to:
    samtools view -bS - | samtools sort -m 2415919104 - /GSE19418_tmp/GSM491349/SRR033987_hg38
    converting reads in /GSE19418_tmp/GSM491349/tmp/SRR033987.fastq.gz,NA
    WARNING: running bwameth in single-end mode
    [M::main_mem] read 2000000 sequences (200000000 bp)...
    [M::mem_process_seqs] Processed 2000000 reads in 7131.578 CPU sec, 896.316 real sec
    Traceback (most recent call last):
    File "/home/user/anaconda/bin/bwameth.py", line 548, in <module>
      main(sys.argv[1:])
    File "/home/user/anaconda/bin/bwameth.py", line 533, in main
      set_as_failed=args.set_as_failed)
    File "/home/user/anaconda/bin/bwameth.py", line 198, in bwa_mem
      as_bam(cmd, fa, prefix, calmd, set_as_failed)
    File "/home/user/anaconda/bin/bwameth.py", line 238, in as_bam
      out.write(str(aln) + '\n')
    IOError: [Errno 32] Broken pipe
    [fputs] Broken pipe
    

Solution, just use new API, available at least since 1.0 (http://www.htslib.org/doc/samtools-1.0.html)

My fixes: iromeo@c45210d and iromeo@0a1703b

unique bam?

Hi
many thanks for developing this tools, it is super fast! I have a question. Do bwa-meth output unique mapping reads or should I add an additional filter?
Many thanks for your help
Alice

could not make reference genome

Hi,
I am trying to create reference genome from hg38 and getting following error:

/bwameth.py index hg38_no_alt.fa
converting c2t in hg38_no_alt.fa to hg38_no_alt.fa.bwameth.c2t
indexing: hg38_no_alt.fa.bwameth.c2t
[bwa_index] Pack FASTA... 53.61 sec
[bwa_index] Reverse the packed sequence... 19.25 sec
[bwa_index] Construct BWT for the packed sequence...
TextLengthFromBytePacked(): text length > 2^32!
cmd was:bwa index -a bwtsw hg38_no_alt.fa.bwameth.c2t
Traceback (most recent call last):
File "/bwa-meth-0.10/bwameth.py", line 601, in
main(sys.argv[1:])
File "/bwa-meth-0.10/bwameth.py", line 548, in main
sys.exit(bwa_index(convert_fasta(args[1])))
File "/bwa-meth-0.10/bwameth.py", line 151, in bwa_index
run("bwa index -a bwtsw %s" % fa)
File "/bwa-meth-0.10/bwameth.py", line 60, in run
list(nopen("|%s" % cmd.lstrip("|")))
File "/anaconda/lib/python2.7/site-packages/toolshed-0.4.0-py2.7.egg/toolshed/files.py", line 53, in process_iter
raise ProcessException(cmd)
toolshed.files.ProcessException: bwa index -a bwtsw hg38_no_alt.fa.bwameth.c2t

Any idea on resolving the issue?
Thanks,
Ali

samtools goes awol

Hey Brent :D

So it seems after bwa-meth's bias-plot has finished running, writes it's output and closes, "samtools view -F4" continues to hang around reading the file (presumably until the end) and not close?

I guess the bias plot takes a sample of the data and not all the data when creating it's plot (because it's so quick) - so i guess samtools should close down too once its done :)

Perhaps there could be an --all flag when running the bias-plot to check all the reads? I'm a little nervous about subsamples, particularly on sorted data. All the best, and again thank you so much for this awesome mapper :)

Broken pipe at samtools index

Hi,

Can you help me udnerstand this error:

...
running: samtools index Sample_CGATGT_AC4190ANXX_L004_001.bam
[fputs] Broken pipe

Seems like the BAM is ok and indexing just failed but want to be absolutely sure. Any ideas?

thanks for your help.

bwameth set_as_failed=args.set_as_failed

I cannot get bwameth to work.

I tried to run the following:
bwameth.py --threads 16 --reference ESR1_nmasked.fasta.bwameth.c2t DMA11392-22973-TSF-13-05-POM_S84_L001_R1_001_val_1.fq DMA11392-22973-TSF-13-05-POM_S84_L001_R2_001_val_2.fq -t 12 | samtools view -b - > bwa-meth.bam

Regardless of whether I am in the folder where these files are located or I specify the full path for the reference genome and the for/rev reads, I get the following error:
Traceback (most recent call last): File "/nv/hp10/bin/bwa-meth-0.10/bwameth.py", line 601, in <module> main(sys.argv[1:]) File "/nv/hp10/bin/bwa-meth-0.10/bwameth.py", line 586, in main set_as_failed=args.set_as_failed) File "/nv/hp10/bin/bwa-meth-0.10/bwameth.py", line 246, in bwa_mem raise BWAMethException("first run bwameth.py index %s" % fa) __main__.BWAMethException: first run bwameth.py index /nv/hp10/scratch/nmasked.fasta.bwameth.c2t

I'm using bwa-0.7.16a and bwa-meth-0.10 on a computing cluster running x86_64 x86_64 x86_64 GNU/Linux.

'samtools' errors aren't reported in 'as_bam'

Errors, raised by samtools view ... | samtools sort ... aren't reported, due to the buffering of sys.stderr. In my case the error was:

[samopen] SAM header is present: 25 sequences.
Parse error at line 31: sequence and quality are inconsistent

Instead of outputting the error bwa-meth printed the following traceback:

Traceback (most recent call last):
  File "/home/user/.virtualenvs/env/bin/bwameth.py", line 10, in <module>
    execfile(__file__)
  File "/home/user/.virtualenvs/env/src/bwameth-head/bwameth.py", line 596, in <module>
    main(sys.argv[1:])
  File "/home/user/.virtualenvs/env/src/bwameth-head/bwameth.py", line 581, in main
    set_as_failed=args.set_as_failed)
  File "/home/user/.virtualenvs/env/src/bwameth-head/bwameth.py", line 259, in bwa_mem
    as_bam(cmd, fa, prefix, calmd, set_as_failed)
  File "/home/user/.virtualenvs/env/src/bwameth-head/bwameth.py", line 296, in as_bam
    out.write(str(aln) + '\n')
IOError: [Errno 32] Broken pipe

The traceback makes perfect sense, because after reporting the error samtools exited and the file descriptor behind out became stale. However, the original error is much more helpful.

I'm ready to submit a patch for this.

Mapping Rate From bwa-meth

Hey Brent,

It's been a while since I used bwa-meth, but I've recently generated a bunch of WGBS data and I was comparing alignment stats from mapping with Bismark vs bwa-meth on our real data (some other mappers too, but I'm unlikely to proceed with those).

For one of the samples Bismark reported an alignment rate of 58.8% in its output log file. My understanding is that Bismark only considers an alignment for downstream mC calling if it is unique and both reads of a pair are aligned.

I aligned the same sample with bwa-meth and the samtools flagstat output indicates that 96.92% are properly paired (both mates mapped to the same chromosome, etc.). This seemed unrealistically high to me. I wasn't sure if samtools flagstat was including multimapped reads so I went ahead and ran samtools view -c -F 0xF00 -f 66. Which should equate to counting the alignments which are NOT supplemental, secondary, QC-failed, or duplicates (MethylDackel uses those same ignore sam flags), and ARE the first mate of a pair, and are aligned in a proper pair. This resulted in an alignment rate of 96.5%. Still suspiciously high.

How would you recommend calculating the alignment rate after mapping with bwa-meth? Or is it really that much better?!

Best,
Jeff

Does bwa-meth aligned bam files need to be duplicates marked for downstream methylDackel work properly?

Hi Brent,

I am using following to map single end RRBS-seq data with bwa-meth

python {config[bwmeth_path]} {params.custom} --reference {config[ref_fa]} {input} \
                    --read-group '{params.rg}' 2> {log.bwa} \
                    | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}

samtools index {output[0]}

for RRBS, one expects to see many duplicates at the same CpG (exact restriction enzyme cut site).
MethylDakel has an option --keepDups to remain the duplicates. Do I need to mark the bam files from bwa-meth and then go with MethylDakel?

Thanks!
Tommy

unmapped reads in c2t converted

Hi,
it seems that bwameth.py only reverts the mapped reads back to their original state, while unmapped reads are reported in their c2t converted form in the aligned bam file.
Is this the desired behavior? For my application it would be crucial to get the original reads from the bam file also for the unmapped reads.
Is there a way to do this?
Thanks!

running bwameth.py tabulate on Ecoli data

Hi,

I am running bwameth.py tabulate on an Ecoli bisulfite treated sample aligned with bismark/bowtie2.

I was getting no methylation calls with the default parameters, so I changed to --context all and ran a parameter sweep on the --map-q option. If I run this on the first 1000bp of the Ecoli genome, I get no calls for mapq values ranging from 60-43, but then I get 138 lines in the meth.vcf file from map-q 42 downwards.

Any ideas what would be a good way of running bwameth/BisSNP on Ecoli data like this?

I read a paper that says Ecoli methylation is found in around 1-2 cases every 1000bp for the CmCWGG motif, but I am unsure how to specify that in BisSNP. Is it possible?

bwa-meth and samtools flagstat

Hi,
I used bwa-meth to align rust infected wheat samples to the rust genome. However, when I ran samtools flagstat I get the follow output:

22806328 + 4861610 in total (QC-passed reads + QC-failed reads)
154024 + 1050636 secondary
0 + 0 supplementary
0 + 0 duplicates
22634029 + 4846318 mapped (99.24% : 99.69%)
22652304 + 3810974 paired in sequencing
11326152 + 1905487 read1
11326152 + 1905487 read2
22435338 + 0 properly paired (99.04% : 0.00%)
22467696 + 3780390 with itself and mate mapped
12309 + 15292 singletons (0.05% : 0.40%)
30982 + 232010 with mate mapped to a different chr
13721 + 50746 with mate mapped to a different chr (mapQ>=5)

I didn't know BWA-meth gave secondary alignments. Also, the total number of QC-passed reads + QC-failed reads (27480347) is greater than the total of the pair-end reads (26463278).

toolshed issue

Hi Brent,
I was trying to have a testrun with the example from bwa-meth and after running bwameth.py --reference ref.fa t_R1.fastq.gz t_R2.fastq.gz -t 12 | samtools view -b - > bwa-meth.bam I get the following error:

Traceback (most recent call last):
File "/home/superste/.local/bin/bwameth.py", line 4, in <module>
__import__('pkg_resources').run_script('bwameth==0.2.1', 'bwameth.py')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 750, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1527, in run_script
exec(code, namespace, namespace)
File "/home/superste/.local/lib/python2.7/site-packages/bwameth-0.2.1-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 509, in <module>
main(sys.argv[1:])
File "/home/superste/.local/lib/python2.7/site-packages/bwameth-0.2.1-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 506, in main
set_as_failed=args.set_as_failed)
File "/home/superste/.local/lib/python2.7/site-packages/bwameth-0.2.1-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 331, in bwa_mem
as_bam(cmd, fa, set_as_failed)
File "/home/superste/.local/lib/python2.7/site-packages/bwameth-0.2.1-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 341, in as_bam
sam_iter = nopen_keep_parent_stdin(pfile, 'r')
File "/home/superste/.local/lib/python2.7/site-packages/bwameth-0.2.1-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 49, in nopen_keep_parent_stdin
preexec_fn=toolshed.files.prefunc,
AttributeError: 'module' object has no attribute 'prefunc'

I tried to reinstall both bwa-meth and toolshed but the error is still there.

Thank you in advance for the help!

Stefan

Why is there a problem?

[yubau@CNUH-i7 ~]$

python /local_work/bin/bwa-meth/bwameth.py
-t 12
--reference /DataShare/Transcriptome/Homo_sapiens/NCBI/ GRCh38/Sequence/BWAIndex/version0.6.0/genome.fa.bwameth.c2t
/home/yubau/Methylation/26-meth_S44_R1_001.fastq.gz
/home/yubau/Methylation/26-meth_S44_R2_001.fastq.gz > /home/yubau/26-meth_S44.sam

Traceback (most recent call last):
File "/local_work/bin/bwa-meth/bwameth.py", line 509, in
main(sys.argv[1:])
File "/local_work/bin/bwa-meth/bwameth.py", line 506, in main
set_as_failed=args.set_as_failed)
File "/local_work/bin/bwa-meth/bwameth.py", line 315, in bwa_mem
raise BWAMethException("first run bwameth.py index %s" % fa)
main.BWAMethException: first run bwameth.py index /DataShare/Transcriptome/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/v ersion0.6.0/genome.fa.bwameth.c2t

yes i already execute "bwameth.py index $REF"

[yubau@CNUH-i7 ~]$ ls /DataShare/Transcriptome/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/
genome.fa
genome.fa.ann
genome.fa.bwameth.c2t.amb
genome.fa.bwameth.c2t.bwt
genome.fa.bwameth.c2t.sa
genome.fa.fai genome.fa.sa
genome.fa.amb
genome.fa.bwameth.c2t
genome.fa.bwameth.c2t.ann
genome.fa.bwameth.c2t.pac
genome.fa.bwt
genome.fa.pac

[yubau@CNUH-i7 bwa-meth]$ ls
bam-merge-pairs.py
bwameth.py
compare
example
ez_setup.py
LICENSE
paper
README.md
requirements.txt
scripts
setup.py
[yubau@CNUH-i7 bwa-meth]$ pwd
/local_work/bin/bwa-meth

[yubau@CNUH-i7 bwa-meth]$ python -V
Python 2.7.15 :: Anaconda, Inc.
[yubau@CNUH-i7 bwa-meth]$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

[yubau@CNUH-i7 bwa-meth]$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188

Why is there a problem?

IOError: Paired End Reads

Hi Brent,

I am using bwa-meth to align paired end reads with the alignment failing due to an IO error,

I am using
toolshed 0.4.6
bwa-meth-0.10
bwa 0.7.15-r1140
samtools 1.3.1

See the cmd line output below,

Kind Regards

Nicola

running: bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG ID:BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R SM:BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R' -t 8  hg19_bwameth/human_g1k_v37.fasta.bwameth.c2t '</usr/bin/python /Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py c2t BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R1.fastq BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R2.fastq'
writing to:
samtools view -bS - | samtools sort - BP_1207.lane7
[M::bwa_idx_load_from_disk] read 0 ALT contigs
converting reads in BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R1.fastq,BP_1207.BPR.150912.HiSeq2500.FCB.lane7.R2.fastq
[M::process] read 634922 sequences (80000172 bp)...
[M::process] 0 single-end sequences; 634922 paired-end sequences
[M::process] read 634922 sequences (80000172 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 101430, 14, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (105, 126, 155)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (5, 255)
[M::mem_pestat] mean and std.dev: (131.80, 37.94)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 305)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (198, 445, 1766)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4902)
[M::mem_pestat] mean and std.dev: (824.00, 985.05)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 6470)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 634922 reads in 953.682 CPU sec, 119.192 real sec
Traceback (most recent call last):
  File "/usr/local/bin/bwameth.py", line 5, in <module>
    pkg_resources.run_script('bwameth==0.10', 'bwameth.py')
  File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 487, in run_script
    ns.clear()
  File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 1337, in run_script
    raise AssertionError(
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 601, in <module>
    main(sys.argv[1:])
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 586, in main
    set_as_failed=args.set_as_failed)
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 259, in bwa_mem
    as_bam(cmd, fa, prefix, calmd, set_as_failed)
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 299, in as_bam
    out.write(str(aln) + '\n')
IOError: [Errno 32] Broken pipe
[fputs] Broken pipe
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 601, in <module>
    main(sys.argv[1:])
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 551, in main
    sys.exit(convert_reads(args[1], args[2]))
  File "/Library/Python/2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 102, in convert_reads
    out.write("".join((name, seq, "\n+\n", qual)))
IOError: [Errno 32] Broken pipe

tabulate uses RGSM, but doesn't escape "/"

It seems that when running tabulate like so:

/home/john/toolshed-0.4.0/bwa-meth-0.10/bwameth.py tabulate --map-q 60 --bissnp /ex/BisSNP-0.82.2.jar --prefix Mm01.WGBS -t 10 --reference /ref/mm10.fa --trim 3,1 /media/john/DATA1/WGBS/44_Mm01_WEAd_C2_WGBS_E_no_dupes.bam

I get the following output:

    java -Xmx24g -jar /ex/BisSNP-0.82.2.jar \
        -R /ref/mm10.fa \
        -I /media/john/DATA1/WGBS/44_Mm01_WEAd_C2_WGBS_E_no_dupes.bam \
        -T BisulfiteGenotyper \
        --trim_5_end_bp 3 \
        --trim_3_end_bp 1 \
        -vfn1 Mm01.WGBS.meth.vcf -vfn2 Mm01.WGBS.snp.vcf \
        --non_directional_protocol \
        -mbq 12 \
        -minConv 0 \
        -toCoverage 1000 \
        -mmq 60   \
        -nt 10
0   T*  C
0   A*  G
0   A*  G
0   C*  T
0   T*  C
0   C*  T
0   C*  T
0   C*  T
0   T*  C
0   G*  A
0   T*  C
0   T*  C
0   C*  T
0   G*  A
0   A*  G
0   T*  C
0   C*  T
Mm01.WGBS.meth.vcf
Traceback (most recent call last):
  File "/home/john/toolshed-0.4.0/bwa-meth-0.10/bwameth.py", line 601, in <module>
    main(sys.argv[1:])
  File "/home/john/toolshed-0.4.0/bwa-meth-0.10/bwameth.py", line 554, in main
    sys.exit(tabulate_main(args[1:]))
  File "/home/john/toolshed-0.4.0/bwa-meth-0.10/bwameth.py", line 506, in tabulate_main
    .format(prefix=a.prefix, sample=sample), "w")
IOError: [Errno 2] No such file or directory: 'Mm01.WGBS.C57BL/6J.meth.bed'

I think the error is about there not being a directory called Mm01.WGBS.C57BL - but it only wants to write to that directory because my RGIDs look like this:


@RG ID:HWI-ST552.C2J56ACXX.1.NA SM:C57BL/6J LB:Mm01.WGBS    PL:illumina CN:Essen    DS:Circadian Day
@RG ID:HWI-ST552.C2J56ACXX.2.NA SM:C57BL/6J LB:Mm01.WGBS    PL:illumina CN:Essen    DS:Circadian Day
@RG ID:HWI-ST552.C2J56ACXX.3.NA SM:C57BL/6J LB:Mm01.WGBS    PL:illumina CN:Essen    DS:Circadian Day
@RG ID:HWI-ST552.D2FWTACXX.1.CAGATCA    SM:C57BL/6J LB:Mm01.WGBS    PL:illumina CN:Essen    DS:Circadian Day

and I think the "/" i the sample name is not being escaped/deleted when writing out the new file. Also, i'm not sure using data from the RG is a good idea regardless, since 1 BAM file can contain many RGIDs (it just so happens that in my file they all have the same SM/LB) I did however get the following outputs:

Mm01.WGBS.meth.vcf
Mm01.WGBS.meth.vcf.MethySummarizeList.txt
Mm01.WGBS.snp.vcf

All the best! :)

Multiple sets of FASTQ files

The current documentation states that it's possible to pass a bunch of paired reads, eg

$ bwameth.py --reference $REF foo_1.fastq,foo_2.fastq bar_1.fastq,bar_2.fastq
Traceback (most recent call last):
  File "/home/user/.virtualenvs/env/bin/bwameth.py", line 581, in <module>
    main(sys.argv[1:])
  File "/home/user/.virtualenvs/env/bin/bwameth.py", line 576, in main
    rname(*args.fastqs), calmd=args.calmd,
TypeError: rname() takes at most 2 arguments (3 given)

However, args.fastqs are handled in way which makes this use-case impossible.

YC and YD tags

Hi Brent,

Thanks for maintaining and answering questions on bwa-meth. I am not sure I understand the YD and YC tags that are set by bwa-meth. My understanding so far:

YD tag makes note of the direction of mapping: whether it maps to the forward (f) or reverse (r) strand of the reference genome.

YC tag tells you whether or not your read mapped back to the C to T or G to A converted genome.

As bwa-meth assumes directional libraries, shouldn't the YD / YC tags always be f + CT and r + GA? When parsing through some bwa-meth aligned files, I noticed that some reads contain r + CT as well as f + GA.

Thanks for your help!

Broken pipe, no alignments written to bam file.

Hi,
I get the following error when trying to align my reads:

[M::main_mem] read 100000 sequences (10000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 13941, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (87, 118, 163)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 315)
[M::mem_pestat] mean and std.dev: (128.17, 55.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 391)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 100000 reads in 171.495 CPU sec, 171.414 real sec
Traceback (most recent call last):
  File "/home/markuss/bin/bwameth.py", line 5, in <module>
    pkg_resources.run_script('bwameth==0.10', 'bwameth.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 488, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1338, in run_script
  File "/pica/h1/markuss/.local/lib/python2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 601, in <module>
    main(sys.argv[1:])
  File "/pica/h1/markuss/.local/lib/python2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 586, in main
    set_as_failed=args.set_as_failed)
  File "/pica/h1/markuss/.local/lib/python2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 259, in bwa_mem
    as_bam(cmd, fa, prefix, calmd, set_as_failed)
  File "/pica/h1/markuss/.local/lib/python2.7/site-packages/bwameth-0.10-py2.7.egg/EGG-INFO/scripts/bwameth.py", line 299, in as_bam
    out.write(str(aln) + '\n')
IOError: [Errno 32] Broken pipe
[fputs] Broken pipe

I am using bwa-meth version 0.10 and toolshed version 0.4.2 with python 2.7.6. Your provided example data works great. The bam file is created with a header, but no alignments are written.

/Markus

Do not add RG by default

Problem

I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.

Details

This is the command I run:

> bwameth.py --reference ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta -t 4 data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq > data/test/test-line_A.mapped.sam

The stdout/stderr output is here:

running: /home/oender/anaconda3/envs/population-epigenetics/bin/python /home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:test-line_A-R.classified.qc\tSM:test-line_A-R.classified.qc' -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -
converting reads in data/test/test-line_A-R1.classified.qc.fastq,data/test/test-line_A-R2.classified.qc.fastq
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 320080 sequences (40000212 bp)...
[M::process] 0 single-end sequences; 320080 paired-end sequences
WARNING: 1709 reads with length < 80
       : this program is designed for long reads
[M::process] read 121626 sequences (15199052 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 97487, 4, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 215, 277)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 493)
[M::mem_pestat] mean and std.dev: (227.57, 79.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 601)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 320080 reads in 245.362 CPU sec, 61.338 real sec

['NM:i:21', 'MD:Z:40^GGAATTGTTGATTTGGATTT80G5', 'MC:Z:126M', 'AS:i:97', 'XS:i:83', 'RG:Z:test-line_A-R.classified.qc', 'XA:Z:f3,+14193782,40S86M,1;f3,+14204191,40S86M,1;', 'RG:Z:CB0L6ANXX:1:ATTCCT YS:Z:TTTGGATTTGGAATTGTTGAGAAAAGTTTATCGGGTTTGAGGAATTGTTGAGAAAAGTTTATTGGGTTTGAGGATTTGTTGATTAGGAGTGGAAATTGTTGAGAAAAATTTATTGGGTTTTAGGAA', 'YC:Z:CT']
700523F:121:CB0L6ANXX:1:1103:2712:2482
Traceback (most recent call last):
  File "/home/oender/anaconda3/envs/population-epigenetics/bin/bwameth.py", line 4, in <module>
    __import__('pkg_resources').run_script('bwameth==0.2.2', 'bwameth.py')
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1444, in run_script
    exec(code, namespace, namespace)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 509, in <module>
    main(sys.argv[1:])
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 506, in main
    set_as_failed=args.set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 331, in bwa_mem
    as_bam(cmd, fa, set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 353, in as_bam
    for aln in handle_reads(pair_list, set_as_failed):
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 376, in handle_reads
    orig_seq = aln.original_seq
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 284, in original_seq
    return next(x for x in self.other if x.startswith("YS:Z:"))[5:]
StopIteration
[M::process] 0 single-end sequences; 121626 paired-end sequences

As you can see, RG:Z:CB0L6ANXX:1:ATTCCT is the RG that was part of the FASTQ input:

> head -n1 data/test/test-line_A-R{1,2}.classified.qc.fastq
==> data/test/test-line_A-R1.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

==> data/test/test-line_A-R2.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

I think it is a bug that bwameth adds RG:Z:test-line_A-R.classified.qc although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command

bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -

(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.

Suggestion

As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function bwa_mem as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,

rg=args.read_group or rname(*args.fastqs)

causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.

--prefix spec in docs breaks bwameth due to new opts pass through functionality

Morning Brent

I think the README for v0.2.0/master is a little out of date. I'm assuming you dropped the --prefix and bai indexing support from bwameth.py , as when I specify --prefix as the README suggests, this is no longer recognised by bwameth.py. And due to the new opt passthrough functionality it's passed through to bwa mem, as which point I get the following error:

mem: invalid option -- '-'

Also the installation instructions are a little out of date, reference v0.10.0 instead of v0.2.0.

I would have simply issued a pull request on the README, but I wasn't sure about the wider considerations with the latest release, so thought it best to leave it with you.

When I get time, I will try and tighten up the option pass through to make it more explicit using something like --bwa_mem_opts.

Thanks

Nathan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.