Comments (15)
These reads don't even look the same ?
GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT
TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA
from gatk.
That is true! Should I update the title of the issue to reflect this?
from gatk.
Can you check the REF_CACHE? Or alternatively you may provide samtools the reference fasta during cram view.
from gatk.
Thanks @gokalpcelik!
Seems that -T does not affect much:
samtools view -T /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226 0 MT 3470 60 54M * 0 0 GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD AS:i:54 X1:i:0 XD:Z:CATGAGG_GGTGATC a3:i:92 bi:Z:5346565226f1:Z:CATGAGG f2:Z:GGTGATC i1:Z:Q44 i2:Z:Q27 pr:i:22 pt:i:15 px:i:3813 py:i:1262 rq:f:0.03 si:i:3750 tm:Z:AQ tq:i:195 MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0 NM:i:54 RG:Z:Z0011
from gatk.
@ilyasoifer Is there any way I can access the original cram (or better yet, a small subset thereof consisting of just MT) that illustrates this issue) and the reference ? It might be hard to debug without that. If thats not possible, a few suggestions: can you try using PrintReads to write the original cram (I would try just MT) first to a cram, then to a sam, and also the original cram to a sam, and see how those compare? It would also be useful to see what that read looks like if you use samtools view on the ORIGINAL cram. Do you know what software/version was used to write the original cram ?
from gatk.
Hi @cmnbroad - thanks! I can upload to one of the buckets that are shared between Ultima and the Broad.
Could you provide your gcp account so I can give you permissions?
I am guessing that Megan Shand can share it through our joint slack channel if you prefer.
The original file was created using samtools v1.17.
I provided above the result of writing BAM and Cram with PrintReads. Is it helpful or you prefer SAM? And if I just do
samtools view ultMerge.mt.cram
I get
038958_1-Z0011-5346565226 0 MT 3470 60 54M * 0 0 TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD bi:Z:5346565226 rq:f:0.03 pr:i:22 pt:i:15 px:i:3813 py:i:1262 si:i:3750 tq:i:195 tm:Z:AQ i1:Z:Q44 f1:Z:CATGAGG i2:Z:Q27 f2:Z:GGTGATC a3:i:92 XD:Z:CATGAGG_GGTGATC X1:i:0 AS:i:54 MD:Z:54 NM:i:0 RG:Z:Z0011
from gatk.
@ilyasoifer [email protected]. And don't worry about doing the PrintReads conversions I requested - if I have access to the original file and the reference I can debug this directly.
from gatk.
@cmnbroad, OK, great. Shared the files with Megan, she will transfer them over!
from gatk.
@cmnbroad - the reference is from here: gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/
from gatk.
Thanks for reporting this @ilyasoifer - it looks like a serious bug. I see whats going on and am working on a fix.
from gatk.
Thank you, good to hear that it was useful
from gatk.
@cmnbroad - hope all is well. I was wondering if there is any ETA when the fix that you made will be released?
Thanks!
Ilya
from gatk.
@ilyasoifer It will be part of the GATK 4.6 release, which should come out this month. We've also implemented a cram scanning tool that users can use to scan their crams to see if any are affected.
from gatk.
Ah, great, thanks for updating me!
from gatk.
Summary information about this bug:
This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0. The PR with the fix is: #8900
This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.
This bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:
- At least one read is mapped to the very first base of a reference contig
- The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig
When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.
Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.
The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.
A CRAM scanning tool called CRAMIssue8768Detector
that can detect whether a particular CRAM file is affected by this bug was added in #8819, and was released as part of GATK 4.6.0.0
from gatk.
Related Issues (20)
- BwaSpark parameter optimization HOT 1
- About DP4 HOT 1
- MarkDuplicates results in Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" HOT 2
- What about this GATK 4 pipeline script, written by Chat-GPT HOT 1
- Follow up on CNN deprecation done in the update to python 3.10. HOT 2
- Problem with PathSeqPipelineSpark : Not generating bam
- [question] Are large files only required for tests, or also required at build and run-time ? HOT 4
- gatk Funcotator error HOT 1
- CreateSomaticPanelOfNormals: multiallelic sites wrongly added to PON despite --min-sample-count set to total input samples
- GenotypeGVCFs memory issues on GATK 4.6.0.0 HOT 12
- Tests should print per-test status, otherwise it is difficult to see what tests fail or are skipped
- GermlineCNVCaller - python exited with 2 HOT 1
- Tests fail to find libgkl libraries in /usr/local/lib
- 301 tests fail, 37 are skipped
- HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs HOT 2
- PreprocessIntervals missing results HOT 1
- SortSamSpark Required array length is too large HOT 5
- Convergence Error running GATK GermlineCNVCaller cohort mode HOT 5
- alt allels error HOT 1
- Discrepancy Between IGV and VCF File for HaplotypeCaller HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gatk.