decodegenetics / bamhash Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
The read count of a bamhash computation of paired fastq files is not the same as the read count of a bamhash computation of the bam file the paired fastq files were converted from (as already mentioned in #4 (comment) ... actually it is half the amount of reads in the fastq bamhash computation):
$ bamhash_checksum_bam sample.bam
c0039f91693d4bfd 1749217454
$ bamhash_checksum_fastq sample_R1.fq sample_R2.fq
c0039f91693d4bfd 874608727
My expectation would be that the read count numbers are the same (I would expect the number from the bam bamhash computation). Is this behavior intentional? Otherwise it would be great if this could be fixed!
$ bamhash_checksum_fastq --version
bamhash_checksum_fastq version 1.1
Thanks,
Oliver
Like for the std md5sum command it would be really handy to have a 'check' option (-c) in order to check a fastq/bam vs it's bamhash to programmatically verify it's consistency.
Also, it's confusing that fastq read pairs are only counted once whereas they're counted twice in the bam. It would be clearer if the numbers agreed as well as the hashes.
Thanks!
When running bamhash on a bam file, should the bam be fully processed (i.e. sorted, deduped, and recalibrated)?
I just tested bamhash on a fully-processed bam file and its source paired-end fastq files, however the resulting hashes differ, so I'm wondering if it's because the input bam was fully processed.
UPDATE:
I just re-ran bamhash on a non-fully processed bam file, and got the following results:
bam result:
2490f971d6f15fa2 764438888
source fastq result:
2490f971d6f15fa2 382219444
What do the two columns represent? Is it enough that one of the columns match between the files?
A new version would be great so we can integrate this tool downstream in Galaxy.
Thanks!
Hello, are there any plans to support CRAM directly?
Thanks,
Andreas
Thanks for developing this! This will certainly allow us to confidently delete original data files by first verifying data integrity.
I do have a feature request: it would be useful to have support for interleaved FASTQ files (where read pairs are consecutive in the file; compatible with BWA MEM).
Also, perhaps once this is implemented (if you choose to do so), could you create a new release with all the commits since v1.0? Thanks.
Hi,
I have a question regarding how you use the FASTQ description field to calculate the hash. The question arises because I've been doing some tests and I only manage to get the same md5 using the -R
option in both the BAM file and the original FASTQ files.
More details following:
I've trimmed a FASTQ sample for testing purposes, the reads look like this:
~> zcat ../data/NA12878_trimmed_1.fastq.gz | head -n 4
@ERR194147.1 HSQ1004:134:C0D8DACXX:1:1104:3874:86238/1
GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG
+
CC@FFFFFHHHHHJJJFHIIJJJJJJIHJIIJJJJJJJJIIGIJJIJJJIJJJIJIJJJJJJJJJJIJHHHHFFFDEEEEEEEEDDDCDDEEDDDDDDDDD
When, after the analysis, I run bamhash_checksum_fastq
in the FASTQ files and bamhash_checksum_bam
in the resulting BAM file I get different md5's:
~> bamhash_checksum_fastq ../data/NA12878_trimmed_*
a05de49644a0fb5d 10000
~> bamhash_checksum_bam final/NA12878_trimmed/NA12878_trimmed-ready.bam
d4d5ece0f619d83d 20000
If I convert the BAM file back to FASTQ I realised that the FASTQ read description disappears, i.e:
~> samtools fastq final/NA12878_trimmed/NA12878_trimmed-ready.bam | head -n 4
@ERR194147.6389/2
CATCGGATTTTTGTTTTTTTTGTTTTGGGTGGGGGGGGTTGGTGGGGTTGTGTGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGTGGTTGG
+
11++40+2)2,+)2:3AA8)))00)))((('''''&&&))&&(((&&&&&&(((++(&05;BB7@>B@BDBBDDB>BDD@@3>9<&5-&&&&&&)&&)+(&
Is that description after the readname used to calculate the hash? I'm pretty confident that this is the problem, since if I run BamHash with -R
it does return the expected result:
~> bamhash_checksum_fastq -R ../data/NA12878_trimmed_1.fastq.gz ../data/NA12878_trimmed_2.fastq.gz
f4524c00c70e9b83 10000
~> bamhash_checksum_bam -R final/NA12878_trimmed/NA12878_trimmed-ready.bam
f4524c00c70e9b83 20000
Thanks for your help!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.