drl / blobtools Goto Github PK

View Code? Open in Web Editor NEW

180.0 180.0 44.0 30 MB

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets

License: GNU General Public License v3.0

Python 99.64% Dockerfile 0.36%

blobtools contamination genome-assembly quality-control visualisation

blobtools's People

Contributors

Stargazers

Watchers

Forkers

greatfireball hyphaltip mc-assemblage bioinfotools timregan evolgenomology gdko sujaikumar mz-cy-han1998 flopezo youreprettygood edinburghgenomics nickp60 cschu rjchallis ccoulombe rajaldebnath fenghen360 stephanholgerd zhou-ran aspirincode evoepi sanvva drish91 brittanymareeott chahnez007 log-lab shiyi-pan zhaokai2014 lizhizhong1992 abdo3a ibebio tw7649116 thoughtsynapse chen318liang 25280841 sunnycqcn arthurpere millak gitbackspacer ningshuang-yao wuchch666 cli135

blobtools's Issues

Which version NCBI nt database is recommenced for a plant genome

Hi Dom
I am working on a plant genome and intended to use blobtools as one of the filtering steps in my pipeline.
So far I did some blast run using NCBI nt database and produced blobplots for those run.
I used nt. 26 as a random database in my blast search.

It is 57 nt database version in the ftp://ftp.ncbi.nlm.nih.gov/blast/db/

What version would suggest to use for getting better results with blobtools?

Thanks in advance

Ashuttosh

blobplot with very small 'blobs'

Hello,

Thanks for creating such a nice tool. I am running blobtools on a new genome and for some reason the blobplot is outputting blobs that are quite small, leaving a lot of blank space in the coverage x GC proportion plot. Any idea what might be causing this? Or what I might do to fix it? I had run it on the same data a few months ago and didn't have this issue. Any insight would be great.

installation

Hi Dom,

This is not an issue, but for the Virtual environment:
virtualenv -p python2.7 ~/virtualenvs/blobtools

It would be good to add the python 2.7 specifically.

And this command in the installation:
/blobtools create
-i test_files/assembly.fna
-c test_files/mapping_1.bam.cov
-t test_files/blast.out
--names names.dmp
--nodes nodes.dmp

Should have a ./blobtools create?

Pete

blobtools plot: init() got an unexpected keyword argument 'fontsize'

Hello, thanks for making this great software! I am using it for the first time, and I tried plotting my BlobDB.json file using this command:

blobtools plot -i /Users/cg449/Desktop/Tripsacum/transcriptome_assembly/Tdactyloides/Trinity_assembly_no_root_libraries/Tdactyloides2_no_roots_no_normalization_min_kmer_cov_2_greater_than_499.BlobDB.json -p 14 --out Tdactyloides2_no_roots_no_normalization_min_kmer_cov_2_greater_than_499

Here is the error message I received:

[STATUS] : Reading BlobDb /Users/cg449/Desktop/Tripsacum/transcriptome_assembly/Tdactyloides/Trinity_assembly_no_root_libraries/Tdactyloides2_no_roots_no_normalization_min_kmer_cov_2_greater_than_499.BlobDB.json
[INFO] : no-hit : sequences = 156,883, span = 181.24 MB, N50 = 1,333 nt
Traceback (most recent call last):
File "/Users/cg449/Applications/blobtools/plot.py", line 183, in
plotObj.plotBlobs(cov_lib, info_flag)
File "/Users/cg449/Applications/blobtools/lib/BtPlot.py", line 606, in plotBlobs
plot_ref_legend(axScatter)
File "/Users/cg449/Applications/blobtools/lib/BtPlot.py", line 191, in plot_ref_legend
axScatter.legend([ref_1,ref_2,ref_3], ["1,000nt", "5,000nt", "10,000nt"], numpoints=1, loc = 4, fontsize=FONTSIZE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 4519, in legend
self.legend_ = mlegend.Legend(self, handles, labels, **kwargs)
TypeError: init() got an unexpected keyword argument 'fontsize'

Can you tell me what is wrong? I noticed that "fontsize=FONTSIZE" appeared many times in Btplot.py script, when maybe it should have been fontsize=35 or some other number. Thanks.

comparecov broken

(blob_env)lapalmejo@QCSTJE669854L:~/prog/blobtools$

 ./blobtools comparecov -i test_blob.BlobDB.json -c ~/data_blobtools/TG-300.vs.nHd.2.3.abv500.bam.cov --log -r superkingdom
[STATUS]    : Reading BlobDb test_blob.BlobDB.json
Traceback (most recent call last):
  File "./comparecov.py", line 150, in <module>
    data_dict, max_cov, cov_libs, cov_libs_total_reads = blobDB.getPlotData(rank, min_length, hide_nohits, taxrule, False, False)
ValueError: need more than 3 values to unpack

Feature Request: Using Kraken for hits

Hi, I'm not sure if I have missed something, but is there a way to use the output from Kraken as a hits file? The main thing missing is some way of inferring a score for the third column of the blobtools hits file. The scope for the Kraken output files are given here

Any advice would be appreciated!

Thanks,

~Nick

Mapping file advice

Hi,

First I would like to produce a mapping file for coverage estimation and then plot a blobplot.

For removing contaminations, I will select the contigs that have been assigned to the desire organisms and mapped my initial reads to those contigs.

Do you have any advice for the mapping file (parameter etc.)?

I'm using bwa mem but I'm wondering whether I should use the -m option (discard reads mapping to multiple locations)

no taxonomy hits

Hi Dom,

I put my data into blobtools and generated a plot. However, using the blobtools view option the output json file only has no-hit taxonomy matches. I'm not sure why this is happening as a blastn search using my contig Fasta file against the nt/nr database produced taxonomic data;

blastn -task megablast -query GQ13_assembly -db nt -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' -culling_limit 5 -evalue 1e-25 -o out

Producing taxonomy hits;

NODE_42_length_38542_cov_0.151908_ID_83 N/A 1393 gi|928489081|gb|CP012685.1| 86.77 1255 161 5
13027 14278 476961 475709 0.0 N/A N/A Serratia marcescens strain SmUNAM836, complete genome
NODE_42_length_38542_cov_0.151908_ID_83 N/A 5376 gi|926475601|gb|CP012639.1| 84.48 5515 784 49
16184 21679 940678 946139 0.0 N/A N/A Serratia marcescens strain RSC-14, complete genome
N

But when I ran blobtools view I get no-hits at all taxonomic levels in the json file, using this command;

blobtools create -i GQ13_assembly -y spades -o GQ13_blob --nodes taxdump/nodes.dmp --names taxdump/names.dmp -b 1_sorted.bam -t GQ13_blast_plus --title fa

producing a lot of;

"species": {"score": 0.0, "tax": "no-hit", "c_index": null}}}

I get a plot but as you would expect there is only no-hits.

I'm not quite sure why the taxonomy annotations aren't making it into blobtools view output?

Thanks,

James

Blobtools downloaded 2 days ago from this site
Blast+ v2.2.31

Missing "module named req"

I'm installing Blobtool v1.0 and I keep getting this error thrown back at me following the ./install command.

[+] Checking dependencies...
[+] [wget] /usr/local/bin/wget
[+] [tar] /usr/bin/tar
[+] [pip] pip
[+] [python2.7] /Users/kdowney/anaconda3/envs/blob_tools/bin/python
[+] Installing python dependencies...
Traceback (most recent call last):
File "setup.py", line 4, in
from pip.req import parse_requirements
ImportError: No module named req
FAIL.

What can I do to fix this? There are some tips online, but when I modify your code as they recommend, other pieces start failing.

Read_cov plots error

Read_cov plots for multiple libraries are wrong

blobtools create with coverage file.

Hi,

Based on that documentation:

https://blobtools.readme.io/docs/create

the option -c take as input a TAB separated file (seqID\tcoverage).

I have ran bam2cov and I have at the end a file with 3 columns

contig_id read_cov base_cov

Should I parse this output and if yes how should it be at the end:

contig_id read_cov

contig_id base_cov

TypeError: not all arguments converted during string formatting

Hello,
I've been trying to run Blobtools a number of times recently, on various sets of input data, and regardless of the input I keep getting the same error, given in the title of this post. More details are shown here:

Is there something wrong with my version of the tools, or are the input files in wrong format somehow? I'm aware that this might be a beginner-level mistake I've made, but I'm a biologist with very little experience with anything related to bioinformatics.
Thanks for your support!
KM

Feature request: Access to ggplot2 R object

It would be nice if blobtools plot and blobtools covplot had flags which allowed the user to access the underlying R plot objects so that they could be further edited. E.g in order to add lines which show filter criteria, or to hide certain groups without having to filter the bam file.

This would be fairly easy to achieve by just outputting an Rds file, I'd be happy to implement in a pull request if you're game.

Recent commit breaks the example blobplot with log(0)

Straight from installation I ran into a showstopper bug. It looks like self.max_cov and possibly others failed to initialize:

$ ./blobtools plot  -i example/blobDB.json  -o example/
[+] Reading BlobDB example/blobDB.json
[+]     Loading BlobDB into memory ...
[+]     Deserialising BlobDB (using 'ujson' module) (this may take a while) ...
[+]     Finished in 0.00183987617493s
[+] Extracting data for plots ...
/python2.7/site-packages/matplotlib/axes/_base.py:3193: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=100.0, top=100.0
  'bottom=%s, top=%s') % (bottom, top))
Traceback (most recent call last):
  File "./blobtools/lib/blobplot.py", line 183, in <module>
    main()
  File "./blobtools/lib/blobplot.py", line 175, in main
    plotObj.plotScatter(cov_lib, info_flag, out_f)
  File "./blobtools/lib/BtPlot.py", line 598, in plotScatter
    fig, axScatter, axHistx, axHisty, axLegend, top_bins, right_bins = self.setupPlot(self.plot)
  File "./blobtools/lib/BtPlot.py", line 495, in setupPlot
    right_bins = logspace(0, (int(math.log(self.max_cov)) + 1), 200, base=10.0)
ValueError: math domain error

I deduced that 58633de introduced this bug because it changes some relevant code and reverting it fixes the problem.

$ git checkout 2a236ea194e1129e97e9f85cb64f81030c70a856
$ ./blobtools plot  -i example/blobDB.json  -o example/
[+] Reading BlobDB example/blobDB.json
[+]     Loading BlobDB into memory ...
[+]     Deserialising BlobDB (using 'ujson' module) (this may take a while) ...
[+]     Finished in 0.00210118293762s
[+] Extracting data for plots ...
[I]     no-hit : sequences = 1, span = 0.01 MB, N50 = 6,273 nt
[I]     Nematoda : sequences = 3, span = 0.01 MB, N50 = 4,060 nt
[I]     Actinobacteria : sequences = 4, span = 0.0 MB, N50 = 951 nt
[I]     unresolved : sequences = 1, span = 0.0 MB, N50 = 2,346 nt
[I]     Tardigrada : sequences = 1, span = 0.0 MB, N50 = 216 nt
[+] Plotting example/blobDB.json.bestsum.phylum.p7.span.100.blobplot.bam0.png
[+] Plotting example/blobDB.json.bestsum.phylum.p7.span.100.blobplot.read_cov.bam0.png
[+] Writing example/blobDB.json.bestsum.phylum.p7.span.100.blobplot.stats.txt

Good luck and thanks for the great tool.

Distinguishing 'BLASTed and no match' and 'not blasted' in blob plots

Hi,

For resource conservation we don't BLAST short contigs from our de novo assemblies. The consequence when using BAM files with reads mapped against a reference that does contain those contigs is Blob plots with grey clouds containing both contigs that weren't BLASTed and contigs that were BLASTed but produced no hits (and uninformative bars in the ReadCovPlot).

My proposal to deal with this is an additional parameter to specify the contigs we acutally BLASTed, producing plots with separate BLASTED and non-BLASTED no-hits categories. I'm happy to have a go at coding that if necessary. But is there a better way?

Thanks,

Jon

Span in blobplot ?

Hello,

When plotting the blobplot: what does mean the "Span (kb)" on Y axis (top figure)?

Rank: kingdom?

Hello,

I find the rank phylum most useful to me, but even then often I need to look up many of the phyla to realize what kingdom the taxonomy belongs to.

I would suggest adding a rank for kingdom, fungi, plants, animals....

Rank order and family seem to generate the same output.

Thanks,
Adrian

Could not download NCBI taxdump.

Hi
I am trying to install blobtools and I am finding following issue

Python dependencies installed.
[+] Creating BlobTools executable...done.
[+] Downloading samtools-1.5...done.
[+] Unpacking samtools-1.5...done.
[+] Configuring samtools-1.5...done.
[+] Compiling samtools-1.5...done.
[+] Cleaning up...
[+] Downloading NCBI taxdump from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz ...FAIL.

Any suggestions
Thanks in advance

trouble with bamfilter

Hi I have two issues with running the bamfilter program.

Firstly, I have an -i include list called scaffolds_vs_nt_1e-10_eukaryota_test.txt:

NODE_10002_length_2789_cov_4.69971
NODE_10005_length_2789_cov_41.3277
NODE_1000_length_17179_cov_4.50444
NODE_10012_length_2785_cov_44.919

I also have a sorted+indexed bam file that I created by indexing the scaffolds.fasta with bwa and then mapping my individual libraries with bwa mem, then merging those bams to one final bam with samtools. All v1.3. Then followed with samtools sort and index.

When I try the code below, it works fine.

samtools view -b scaffolds_mapped_all_reads_sorted.bam NODE_10002_length_2789_cov_4.69971 > test.bam

I get the reads mapped to that scaffolds, but when I try:

blobtools bamfilter -b scaffolds_mapped_all_reads_sorted.bam -i scaffolds_vs_nt_1e-10_eukaryota_test.txt --threads 16
[+] Reading ../MAPPING/scaffolds_mapped_all_reads_sorted.bam
[+] Filtering ../MAPPING/scaffolds_mapped_all_reads_sorted.bam ...
[+] Filtered InIn (pairs=0) ...
[+] Filtered InUn (pairs=0) ...
[+] Filtered ExIn (pairs=0) ...
Traceback (most recent call last):
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/bamfilter.py", line 64, in <module>
    main()
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/bamfilter.py", line 56, in main
    BtIO.parseBamForFilter(bam_f, include_unmapped, out_f, sequence_list, None, gzip, do_sort, keep_sorted, sort_threads)
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/BtIO.py", line 370, in parseBamForFilter
    info_string.append((read_pair_type + ' pairs', "{:,}".format(count), '{0:.1%}'.format(count / int(seen_reads / 2))))
ZeroDivisionError: division by zero

Well you can see the error message - none of the scaffold IDs match anything in the BAM - which can't be the case as the samtools command above works fine. Is there something obvious I have missed?

Secondly, almost incidentally - and this may be because blobtools uses samtools v1.5 (though there doesn't seem to be anything in the release notes to suggest this behaviour) but if I repeat the same command above but with "--sort" on my unsorted BAM file, instead of the 5.5GB file I expect it comes out as 13GB and then working with that file is impossible.

samtools view -b scaffolds_mapped_all_reads.bam.readsorted.bam NODE_10002_length_2789_cov_4.69971 > testing.bam
[main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

samtools index scaffolds_mapped_all_reads.bam.readsorted.bam 
[E::hts_idx_push] NO_COOR reads not in a single block at the end 31 -1
samtools index: "../MAPPING/scaffolds_mapped_all_reads.bam.readsorted.bam" is corrupted or unsorted

Any help is much appreciated! :)

Error 19

Hi,

I am trying to run blobtools on my data and have run into the following error:
[ERROR:19] : Sequence Calcutta in file /home/s1670484/Hill_contigs_blast.o2727632 is not part of the assembly.

What should I interpret from this error?

Thank you in advance.
Best regards,
Andrea

Any experience with MaSuRCA assembler ?

Hi,

I would like to try blobtool with my result from MaSuRCA assembler, do you have any solution with this one ?

The header of the output is rather weird; i mean the coverage is not mentionned ?

jcf7180002879179
jcf7180003100420
jcf7180002969784

Comparing multiple samples using the same color representation of blobs

Hello

Is it possible to use the same key or color representation of blobs for different samples? When I run blobtools plot command it generates the gc-cov plot but assigns different colors to same taxonomically classified contig. For example, E.coli in one sample is represented by blue whereas in other sample by red.
There is a --color option, but could you explain the file format for it?

Problems with more "global" install

Hey Dom

Sorry for being dumb, I'm sure this is my issue, but nonetheless I could do with some help.

So I have python2.7.3 installed via conda. I activate the environment and install according to your instructions. It all works!

$ ./blobtools
usage: blobtools [<command>] [<args>...] [--help] [--version]

commands:

    create        create a BlobDB
    view          generate tabular view, CONCOCT input or COV files from BlobDB
    plot          generate a BlobPlot from a BlobDB
    covplot       generate a CovPlot from a BlobDB and a COV file

    map2cov       generate a COV file from BAM file
    taxify        generate a BlobTools compatible HITS file (TSV)
    bamfilter     subset paired-end reads from a BAM file
    seqfilter     subset sequences in FASTA file based sequence IDs in list
    nodesdb       create nodesdb based on NCBI Taxdump's names.dmp and nodes.dmp

    -h, --help      show this
    -v, --version   show version number

See 'blobtools <command> --help' for more information on a specific command.

examples:

    # 1. Create a BlobDB
    ./blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/test

    # 2. Generate a tabular view
    ./blobtools view -i example/test.blobDB.json

    # 3. Generate a blobplot
    ./blobtools blobplot -i example/test.blobDB.json

However, I don'#t want to be in the blobtools directory to run it. But if I go outside:

cd ..
./blobtools/blobtools
Traceback (most recent call last):
  File "/home/ubuntu/blobtools/lib/blobtools.py", line 77, in <module>
    exit(call(['./blobtools', '-h']))
  File "/usr/local/anaconda3/envs/python2.7.3/lib/python2.7/subprocess.py", line 493, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/local/anaconda3/envs/python2.7.3/lib/python2.7/subprocess.py", line 679, in __init__
    errread, errwrite)
  File "/usr/local/anaconda3/envs/python2.7.3/lib/python2.7/subprocess.py", line 1249, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

Sorry I am not knowledgeable enough about Python to figure this out, though have done 100s of python installs

Any idea why this is happening?

Mick

Sam parsing does not work

Sam parsing throws an error

bam2cov question

Hi,

Thanks for the great tools.

I have used blobtools bam2cov to get the coverage values from a different set of reads than used to make an assembly.

The output, as you know, looks something like this:

# Total Reads = 20745678
# Mapped Reads = 19921290
# Unmapped Reads = 824388
# Parameters : MQ = 0, No_base_cov_flag = False
# contig_id     read_cov        base_cov
seq656447_len82_cov114  8       4.87804878049
seq7374_len1367_cov82   47      1.71909290417
seq411348_len309_cov55  19      3.07443365696
seq246648_len163_cov164 11      3.37423312883
seq679521_len1744_cov77 137     3.92717889908
.
.
.
.

It looks like base_cov is simply read_length*read_cov/contig_length. Let me know I am wrong in that.

Is it the 3rd column (base coverage) that makes more sense to use with blobtools create since it normalizes the read count by contig length? If so, then I guess one needs to take columns 1 and 3 into a new file to be able to pass to blobtools create using --cov ... correct?

Thanks for your time and thoughts.

--John

Incompatibility with dual-python system

Our system is setup with python 2.7 and python 3.6, but defaults to python 3.6.

Attempting to run blobtools with python2 leads to SyntaxErrors since the main script calls its submodules using generic python. Swapping all instances of 'python' to 'python2' in blobtools.py fixes this for me, but perhaps there is a better solution.

Blobtools goes straight to help

Hi,

I am trying to use blobtools and for the life of me I can't get it to get passed the help screen.

$ python create.py -i transcript-60_clean.fa -t blobplot.txt -x bestsum --nodes /Databases/Database/taxBlastdb/nodes.dmp --names /Databases/Database/taxBlastdb/names.dmp -b blobology.out.bam -y spades
usage: blobtools create -i FASTA [-y FASTATYPE] [-o OUTFILE] [--title TITLE]
[-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
[--nodes ] [--names ] [--db ]
[-t TAX...] [-x TAXRULE...]
[-h|--help]

I've even put bestsum in quotes, can you please help me with what I am doing wrong? I'm sure I must have a typo somewhere.

If I try blobtools create I get the following python error.

./blobtools create -i transcript-60_clean.fa -t blobplot.txt -x "bestsum" --nodes /Databases/Database/taxBlastdb/nodes.dmp --names /Databases/Database/taxBlastdb/names.dmp -b blobology.out.bam -y spades
Traceback (most recent call last):
File "./create.py", line 36, in
import lib.BtCore as bt
File "~/lib/BtCore.py", line 97
'dict_of_blobs' : {name : blObj.dict for name, blObj in self.dict_of_blobs.items()},
^
SyntaxError: invalid syntax

I am using python 2.7
$ python -V
Python 2.7.8

Suggestion- more efficient contaminant read removal

First of all, let me say that blobtools has been an excellent program to work with. However, I initially was a little bit disappointed with the results of the contaminant cleaning, my second and third re-assemblies still had a significant portion of contamination being assembled, but I did find a very easy way around this. I didn't want to label too many contigs in my initial assembly as contaminants and risk losing useful information, but this means I ended up with a not of "no-hit" small contigs that were contaminants.

But I figured most of my contaminants would be of similar origin, so instead of extracting reads from my whole genome mapping, I re-mapped my reads to ONLY the contigs I identified as contaminants. From this mapping, I took only the UNMAPPED reads for my next assemblies and got much nicer results. This means I essentially used contigs that I was pretty sure were contaminant assemblies as a magnet. I'm now really happy with how my reassembled genomes look, and I figured it might help someone who is struggling to get the decontamination result they were after. I didn't even see the need to do a second round of cleaning up with this approach, whereas with the initial approach, I was still unhappy with my assemblies after the third iteration.

Hope this helps someone!

Missing node in a taxonomy - undef

Hi,

Thank you for creating Blobtools. It's a very promissing program!
I installed it in order to help with the decontamination of a genome assembly. I just got my first job completed and I discover the "undef" suffix in the table created by "view". Consequently, I've got no plots. After going through Blobtools documentation, I understood the reason: I'm assembling the genome of a Sphaeroforma that misses taxonomic nodes between the kingdom and the class. However, my bug has a class/order/genus/species assigned. I launched a second job and added the -r flag with "genus" to see if I could bypass the problem but it seems not.
I find this behaviour odd and pretty unfortunate: some taxonomic informations are missing and we can't do anything about that but, not being able to use the taxonomic infos downstream the missing node is quite a loss. I see this as a serious limitation of the program.
Thus, I wonder if I missed a trick that could help or maybe if the "undef" suffix/missing node could be mentioned on the main page in order to warn users in a more obvious way.

Thank you,

Anne-Lise

genus & phylum taxonomic affiliation do not correspond

Dear developers,

blobtools view -i refseq.blobDB.json

The command outputs a tabular file, in column 6 the taxonomic affiliation at a phylum level is reported based on bitscore bestsum. When adding the rank with "-r genus" option, some sequences have distinct affiliation, and not simply undef, but bacteria when phylum, and eukaryote when genus.

The hit file was done using refseq diamond with e-value 1-10 and 10 best hits.

thanks for your time and effort.

comparecov only working for "-r superkingdom"

Will fix soon ...

DIAMOND output integration

Hi Blobtools developers,

I would like to create a blobplot using DIAMOND output. In the usage statement of the script for DIAMOND output conversion (daa_to_tagc.pl), it mentions using 'uniref.100.taxlist'. I assume this is contained here.

Can you describe how the uniref.100.taxlist was created? What version of UniRef 100 was used? I would like to incorporate taxonomy, but do not know if this is possible if I am using a different version of UniRef 100.

map2cov ignores -o / --output

blobtools map2cov -i assembly.fasta -b mapping.bam -o out

Resulting cov file is named mapping.bam.cov, instead of out.cov. Same happens if using --output

division by zero when creating

Hi,

I was running blobtools but failed at first stage:

/home/ijt/bin/blobtools/blobtools create -i ref.fa -b 180bp-pe.sorted.bam -t assembly.vs.uniref90.dmnd -o first

and

Traceback (most recent call last): File "/home/ijt/bin/blobtools/bloblib/create.py", line 115, in <module> main() File "/home/ijt/bin/blobtools/bloblib/create.py", line 108, in main blobDb.parseCoverage(covLibObjs=cov_libs, no_base_cov=None, prefix=prefix) File "/home/ijt/bin/blobtools/bloblib/BtCore.py", line 355, in parseCoverage cov = base_cov / self.dict_of_blobs[name].agct_count ZeroDivisionError: division by zero

Any suggestions would be great!

Bug in parseFasta ??

Hi,

Using that command:

blobtools create -i sm.scafSeq -t assembly_se_uniref.daa.tagc -t assembly_se_nt.blastn -y soap --nodes nodes.dmp --names names.dmp

I got an error:

[STATUS] : Parsing FASTA - sm.scafSeq
Traceback (most recent call last):
File "blobtools/create.py", line 56, in
blobDb.parseFasta(fasta_f, fasta_type)
File "blobtools/lib/BtCore.py", line 265, in parseFasta
cov = BtIO.parseCovFromHeader(fasta_type, blObj.name)
File "blobtools/lib/BtIO.py", line 146, in parseCovFromHeader
return float(temp[2]/(temp[1]+1-75))

Do you have any solution ??

blobtools map2cov fails using bam file from blasr aligner

Hi there,

I am using blobtools with a PacBio-only assembly and I'm getting problems to get the coverage file from an alignment done with blasr (PB reads against assembly) in bam. map2cov makes a WARNING for each of the contigs in the assembly, complaining that they do not belong to such assembly, and finally no coverage info is written to the cov file. Any idea what the problem might be and how to solve this?

Software:
bloobtools v0.9.19.3
blasr v5.3.8c16f52
samtools 1.3.1

Running on a Centos 6.8

First, I assembled a preliminary assembly with Canu, using PB subreads in fastq, then aligned the PB reads to this preliminary assembly using blasr, output in bam format. Commands:

blasr input.fofn $ASSEMBLY/canu20160810/Ttra_canu1.contigs.fasta --bam --nproc 24 --out fastq_vs_TtraCanu1.default.blasr.bam > fastq_vs_TtraCanu1.default.blasr.log 2>&1
samtools sort -o fastq_vs_TtraCanu1.default.blasr.sorted fastq_vs_TtraCanu1.default.blasr.bam
samtools index fastq_vs_TtraCanu1.default.blasr.sorted.bam

virtualenv ~/virtualenvs/blobtools
source ~/virtualenvs/blobtools/bin/activate
~/Software/blobtools/blobtools map2cov -i $ASSEMBLY/canu20160810/Ttra_canu1.contigs.fasta -b fastq_vs_TtraCanu1.default.blasr.sorted.bam -o fastq_vs_TtraCanu1.default.blasr > fastq_vs_TtraCanu1.default.blasr.map2cov.log 2>&1

This is the STDOUT I get:

head fastq_vs_TtraCanu1.default.blasr.map2cov.log 
[STATUS]        : Parsing FASTA - /home/champi/Documents/TerebrataliaGenome/assemblies/canu/canu20160810/Ttra_canu1.contigs.fasta
[STATUS]        : Parsing bam0 - /home/champi/Documents/TerebrataliaGenome/assemblies/canu/blobologyFiltering/fastq_vs_TtraCanu1.default.blasr.sorted.bam
[STATUS]        :       Checking with 'samtools flagstat'
[STATUS]        :       Mapping reads = 2,596,752, total reads = 16,556,960 (mapping rate = 15.7%)
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly

$ tail fastq_vs_TtraCanu1.default.blasr.map2cov.log
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[PROGRESS]      :       100%
[STATUS]        :       Writing fastq_vs_TtraCanu1.default.blasr.sorted.bam.cov
[WARN]          : Sum of coverage in cov lib bam0 is 0.0. Please ignore this warning if "--no_base_cov" was specified.

And this is the output file:

$ head fastq_vs_TtraCanu1.default.blasr.sorted.bam.cov
## blobtools v0.9.19.3
## Total Reads = 16556960
## Mapped Reads = 2596752
## Unmapped Reads = 13960208
## Source(s) : /home/champi/Documents/TerebrataliaGenome/assemblies/canu/blobologyFiltering/fastq_vs_TtraCanu1.default.blasr.sorted.bam
# contig_id     read_cov        base_cov
tig00000000     0       0.0
tig00000001     0       0.0
tig00000002     0       0.0
tig00000003     0       0.0

matplotlib future warning

Hi,

Thanks for the great set of tools.

I just started using blobtools, but I have seen this message twice when doing blobtools blobplot:

/users/jurban/software/localpy/lib/python2.7/site-packages/matplotlib-1.4.3-py2.7-linux-x86_64.egg/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Not sure if it is affecting anything, but thought I'd pass it along.

best,

John

filtering reads tutorial issue

Dear developers, sorry if this is more a tutorial/conceptual issue, than a program/code one, thanks a lot for your work.

I have two RNAseq libraries that contain 2 eukaryotes (target and food) & several bacteria (polyA selection didn't work) for which I assembled together with trinity and now I want blobtools to filter the reads for my target eukaryote in two rounds, first remove bacteria, then the food eukaryote.

First question, would you work on each library independently or together? I don't know if there's any conceptual difference.

Then, I'd like to know if you could provide some commands to get a list of contigs ranked as superkingdom instead of phylum as is printed in blobDB.table.txt. Also, why the "kingdom" rank is not available? That would be really useful.

Finally, the read filtering strategy that you provide is based on a list of contigs of interest. I managed to get a taxonomy-based list, so I wonder if there is a strategy to get bins of contigs considering also %GC and coverage. Maybe it's exactly what is missing in the "under construction" sections for filtering assemblies, then I would like to know if you have a release date in mind for that.

Thank you very much, keep on the good work!

Feature request:

Problem with installation

Where is the blast?

Hi guys!

Being an old-school user, and maybe just ignorant here, but where/when is the blast done now?

Cheers

Phil

./blobtools create not working properly

Hello,

I'm going through the tutorial on your site, but am running into issues. Below is the command that I'm trying and the errors that I'm getting. I'm using the new version 0.9.19.4.

Command from tutorial:

./blobtools create   -i test_files/assembly.fna   -b test_files/mapping_1.bam   -t test_files/blast.out   -o test_files/my_first_blobplot

Errors:

[STATUS]        : Parsing FASTA - test_files/assembly.fna
[STATUS]        : names.dmp/nodes.dmp not specified. Retrieving nodesDB from /home/mcclintock/ta2007/bin/blobtools/data/nodesDB.txt
[PROGRESS]      :       100%
[STATUS]        : Parsing tax0 - /home/mcclintock/ta2007/bin/blobtools/test_files/blast.out
[STATUS]        : Computing taxonomy using taxrule(s) bestsum
[PROGRESS]      :       100%
[STATUS]        : Parsing bam0 - /home/mcclintock/ta2007/bin/blobtools/test_files/mapping_1.bam
[STATUS]        :       Checking with 'samtools flagstat'
Traceback (most recent call last):
  File "./bloblib/create.py", line 115, in <module>
    main()
  File "./bloblib/create.py", line 108, in main
    blobDb.parseCoverage(covLibObjs=cov_libs, no_base_cov=None)
  File "/home/mcclintock/ta2007/bin/blobtools/bloblib/BtCore.py", line 347, in parseCoverage
    base_cov_dict, covLib.reads_total, covLib.reads_mapped, read_cov_dict = BtIO.parseBam(covLib.f, set(self.dict_of_blobs), no_base_cov)
  File "/home/mcclintock/ta2007/bin/blobtools/bloblib/BtIO.py", line 384, in parseBam
    reads_total, reads_mapped = checkBam(infile)
  File "/home/mcclintock/ta2007/bin/blobtools/bloblib/BtIO.py", line 198, in checkBam
    reads_secondary = int(reads_secondary_re.search(output).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

Thanks for your help!

Python unicode encode error

Hi!
I've got these errors when execute blobtools plot using example data.
FYI, os is centos 6 and python version is 2.7.0.
Could you handle these problems?

[@server2 blobtools]# ./blobtools plot \

-i example/blobDB.json
-o example/
[+] Reading BlobDB example/blobDB.json
[+] Loading BlobDB into memory ...
[+] Deserialising BlobDB (using 'ujson' module) (this may take a while) ...
[+] Finished in 0.00179696083069s
[+] Extracting data for plots ...
Traceback (most recent call last):
File "/Program/blobtools/lib/blobplot.py", line 183, in
main()
File "/Program/blobtools/lib/blobplot.py", line 146, in main
plotObj.relabel_and_colour(colour_dict, user_labels)
File "/Program/blobtools/lib/BtPlot.py", line 405, in relabel_and_colour
colour_dict = generateColourDict(colour_groups)
File "/Program/blobtools/lib/BtPlot.py", line 75, in generateColourDict
colour_d = {group: rgb2hex(cmap(b)) for b, group in izip(breaks, colour_groups)}
File "/Program/blobtools/lib/BtPlot.py", line 75, in
colour_d = {group: rgb2hex(cmap(b)) for b, group in izip(breaks, colour_groups)}
File "/usr/local/lib/python2.7/site-packages/matplotlib/colors.py", line 528, in call
lut.take(xa, axis=0, mode='clip', out=rgba)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

How to run blast+

Hi Dom

Sorry, I know this isn't a problem with blobtools per se....

My blast+ output looks like this:

tig00000001 N/A 54.3
tig00000001 N/A 53.9
tig00000001 N/A 53.1
tig00000001 N/A 53.5
tig00000001 N/A 53.9
tig00000001 N/A 47.8
tig00000001 N/A 27.7
tig00000001 N/A 53.1
tig00000001 N/A 52.4
tig00000001 N/A 53.1
tig00000001 N/A 52.4
tig00000001 N/A 53.1
tig00000001 N/A 51.6
tig00000001 N/A 53.1
tig00000001 N/A 52.8
tig00000001 N/A 52.8
tig00000001 N/A 53.5
tig00000001 N/A 53.1
tig00000001 N/A 52.8
tig00000001 N/A 52.8
tig00000001 N/A 52.4
tig00000001 N/A 52.4

Obviously N/A us not very useful.

Command ran was:

blastx -db nr_blastplus -query test.fa -outfmt '6 qseqid staxids bitscore' -num_threads 8

Building the database was:

zcat nr.gz | makeblastdb -in - -parse_seqids -dbtype prot -title nr_blastplus -out nr_blastplus

Have you seen this type of behaviour before?

Cheers
Mick

In blobtools view tabular output, what does .c indicate?

Someone asked this because it wasn't clear in the README (or in the output)

Columns like phylum.c or order.c are not described properly on the github page, but they show the confusion-index - how many other taxa did it hit at that tax level (eg Phylum).
None indicates no hit
0 indicates no confusion (all hits were to one phylum, or whichever tax-level was chosen)
1 indicates one other phylum was hit
2 indicates 2 other phyla were hit, and so on

Can you please add this in the README or in the help text of blobtools view? Thanks!

Python issues

Hey

I'm using Python 3.4. I get the error:

./blobtools/blobtools create --help
File "./blobtools/bloblib/create.py", line 105
print BtLog.warn_d['0']
^
SyntaxError: Missing parentheses in call to 'print'

Is this a Python 2 vs 3 issue?

Cheers
Mick

No Hits

Hi,

I am interested in running this tool to possibly visualize and pull out some symbionts from my sample. I was able to use bloobtools to and get the graph and table with view and blobplot commands, but I get all my entries as "no hits". Would you be able to help me to troubleshoot this issue? My assembly is of a 260MB plant, closely related to A. thaliana so there should be lots of hits that are properly annotated.

Here is the summary stats:

C.microcarpa-plot-4.C.microcarpa.BlobDB.json.span.phylum.p7.100.bestsum - spades
Group colour count visible (%) span visible(%) n50 GC GC (std) cov_mean cov_std read map read map (%)
all None 49,622 100.0% 203,388,998 100.0% 107,031 0.41 0.1 25.2 172.8 0 0.0%
no-hit #d3d3d3 49,622 100.0% 203,388,998 100.0% 107,031 0.41 0.1 25.2 172.8 0 0.0%

And a few lines from the "view" output (the last is a hit to A. thaliana):

NODE_1_length_1111123_cov_10.2151_ID_114688196 6429 gi|727511611|ref|XM_010433800.1| 97.47 3795 44 29 708277 712041 3773 1 0.0
NODE_1_length_1111123_cov_10.2151_ID_114688196 5991 gi|727522685|ref|XM_010439039.1| 99.57 3286 14 0 992226 995511 6183903 0.0
NODE_1_length_1111123_cov_10.2151_ID_114688196 5485 gi|727547328|ref|XM_010448413.1| 95.46 3483 80 44 708279 711744 3425 4 0.0
NODE_1_length_1111123_cov_10.2151_ID_114688196 4427 gi|7270623|emb|AL161590.2| 87.72 3958 280 121 708239 712136 166312 170123 0.0

map2cov parses more reads than expected by samtools flagstat

Hi Dominik,

My issue is similar to #39 where blobtools also reports 'more' reads than expected by flagstat. My bam file with corrected PacBio fasta reads mapped to the reference genome by bwa mem raises a warning with map2cov. It does not seem that the differences in numbers represent multimappings somehow missed by samflags used by map2cov. samtools view -f 1 -F 1024 -F 256 -F 2048 according to #40

I have quite a lot of supplementary reads. Is there maybe a problem in the way how BtIO.py gets the numbers of mapping and total reads? The warning says Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed but when I run flagstat separately it correctly prints 5529014 + 0 mapped (99.71% : N/A)

Thanks!

Filip

My blobtools version is blobtools v0.9.19.
I have only samtools 1.3.1 (+htslib 1.3.1) in my $PATH and this is its flagstat output:

5545330 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
497196 + 0 supplementary
0 + 0 duplicates
5529014 + 0 mapped (99.71% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

This is the map2cov warning:

[STATUS]	: 	Checking with 'samtools flagstat'
[STATUS]	: 	Mapping reads = 5,031,818, total reads = 5,048,134 (mapping rate = 99.7%)
[PROGRESS]	: 	100%
[PROGRESS]	: 	109% 
[WARN]		: Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed

Coverage file incorrect

Hi,

I seem to have a problem with the coverage file generated by blobtools. This is the head of my .cov file (the alignment was done with bwa 0.7.8):

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 50620400

Unmapped Reads = -8381386

The total reads number is correct. I guess I really might have 8381386 unmapped reads , and then 42239014 - 8381386 = 33,857,628 mapped reads.

And the stat file looks like this:

blobtools v0.9.19

bam0=mybam.bam

name colour count_visible count_visible_perc span_visible span_visible_perc n50 gc_mean gc_std bam0_mean bam0_std bam0_read_map bam0_read_map_p

all None 63,908 100.0% 59,761,021 100.0% 1,345 0.4 0.047 90.3 872.7 50,620,400 119.8%

Any idea about what causes this error?

Thanks,
Estelle

Using standard setuptools for package instalalation

Hi,
would you mind using setuptools to install your files site-wide? It seemed all *.py should be placed to /usr/bin but that does not work because ./lib/ is missing.

$ /usr/bin/covplot.py
Traceback (most recent call last):
  File "/usr/bin/covplot.py", line 47, in <module>
    import lib.BtCore as bt
ImportError: No module named lib.BtCore
$

Thank you

drl / blobtools Goto Github PK

blobtools's People

Contributors

Stargazers

Watchers

Forkers

blobtools's Issues

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 50620400

Unmapped Reads = -8381386

blobtools v0.9.19

bam0=mybam.bam

name colour count_visible count_visible_perc span_visible span_visible_perc n50 gc_mean gc_std bam0_mean bam0_std bam0_read_map bam0_read_map_p

Recommend Projects

Recommend Topics

Recommend Org