What datasets are available for calibrating, evaluating and/or comparing different met

For convenience <a href="https://github.com/malariagen/ag1000g-bakeoff/blob/master/cha

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Validation/truth datasets for variant calling calibration, comparison and evaluation about pipelines HOT 26 CLOSED

malariagen commented on September 11, 2024

Validation/truth datasets for variant calling calibration, comparison and evaluation

from pipelines.

Comments (26)

alimanfoo commented on September 11, 2024

The P. falciparum genetic crosses project sequenced three crosses. BAMs are available, also VCF from two different calling methods (GATK and Cortex), see here for data, see here for paper describing the data.

from pipelines.

alimanfoo commented on September 11, 2024

For Anopheles gambiae we have released data on 11 crosses as part of Ag1000G phase 2. See here for overview of data which includes sample metadata for the crosses.

Back in 2018 the DeepVariant team used one of these crosses to improve their variant calling models.

from pipelines.

alimanfoo commented on September 11, 2024

For convenience here is metadata for one of the Anopheles gambiae crosses, including FTP links to sequence data on ENA. This is the cross that the DeepVariant team used.

from pipelines.

alimanfoo commented on September 11, 2024

Other data I am aware of but don't have links at hand, the Pf3K project assembled some high quality genomes I believe using pac bio, e.g., see Otto et al (2018).

from pipelines.

alimanfoo commented on September 11, 2024

Also for both Anopheles and Plasmodium we have a number of technical replicates, where the same biological sample went through library prep and Illumina sequencing twice.

from pipelines.

alimanfoo commented on September 11, 2024

Also for Plasmodium there are a set of experimentally created mixtures of multiple parasite clones. I don't know where those data are.

from pipelines.

alimanfoo commented on September 11, 2024

cc @fleharty @samuelklee @podpearson @roamato

from pipelines.

fleharty commented on September 11, 2024

Just trying to familiarize myself with the data, I took a quick look at:
ftp://ngs.sanger.ac.uk/production/malaria/pf-crosses/1.0/bam/ERR012788.realigned.bam
In the header, it refers to 3d7_v3. So I downloaded the reference from,
ftp://ftp.sanger.ac.uk/pub/project/pathogens/Plasmodium/falciparum/3D7/3D7.latest_version/version3/Pf3D7_v3.fasta.gz
and tried viewing it in IGV, but got lots of weird errors including a null pointer exception.

@alimanfoo Do you have a reference you would recommend using other than the one I tried?

from pipelines.

samuelklee commented on September 11, 2024

Thanks, @alimanfoo, this is excellent! Down the road, we can think about categorizing/prioritizing these resources (and the corresponding validations they enable), whether it's worth having parallels in resources/validations across plasmodium/anopheles, etc.

from pipelines.

alimanfoo commented on September 11, 2024

Just trying to familiarize myself with the data, I took a quick look at:
ftp://ftp.sanger.ac.uk/pub/project/pathogens/Plasmodium/falciparum/3D7/3D7.latest_version/version3/Pf3D7_v3.fasta.gz
and tried viewing it in IGV, but got lots of weird errors including a null pointer exception.

Can't remember where I sourced reference from, @podpearson / @roamato can advise, but FWIW that reference looks fine to me:

In [13]: import pyfasta                                                                                                                                       

In [14]: genome = pyfasta.Fasta('Pf3D7_v3.fasta')                                                                                                             

In [16]: sorted(genome)                                                                                                                                       
Out[16]: 
['PF_apicoplast_genome_1',
 'Pf3D7_01_v3',
 'Pf3D7_02_v3',
 'Pf3D7_03_v3',
 'Pf3D7_04_v3',
 'Pf3D7_05_v3',
 'Pf3D7_06_v3',
 'Pf3D7_07_v3',
 'Pf3D7_08_v3',
 'Pf3D7_09_v3',
 'Pf3D7_10_v3',
 'Pf3D7_11_v3',
 'Pf3D7_12_v3',
 'Pf3D7_13_v3',
 'Pf3D7_14_v3',
 'Pf_M76611']

In [17]: genome['Pf3D7_01_v3'][:10]                                                                                                                           
Out[17]: 'tgaaccctaa'

In [18]: len(genome['Pf3D7_01_v3'])                                                                                                                           
Out[18]: 640851

from pipelines.

fleharty commented on September 11, 2024

@alimanfoo Do you ever use IGV to view the plasmodium data? Maybe I'm doing something strange.

from pipelines.

podpearson commented on September 11, 2024

@fleharty , I don't remember off the top of my head exactly which reference was used in the Pf crosses. However, from the header of ftp://ngs.sanger.ac.uk/production/malaria/pf-crosses/1.0/bam/ERR012788.realigned.bam it looks like the reference only included the autosomes (Pf3D7_01_v3 to Pf3D7_14_v3) and not the mitochondrial or apicoplast sequences.

I don't think you should use the reference at ftp://ftp.sanger.ac.uk/pub/project/pathogens/Plasmodium/falciparum/3D7/3D7.latest_version/version3/Pf3D7_v3.fasta.gz. This has the chromosomes in a weird order, and I think the apicoplast and mitochondrial sequences are out of date (though I think Pf3D7_01_v3 to Pf3D7_14_v3 should be the latest sequences). The definitive latest version of the 3D7 reference can be found at GeneDB (ftp://ftp.sanger.ac.uk/pub/genedb/releases/latest/Pfalciparum/Pfalciparum_contigs.fasta.gz). Having said that for our last major release (Pf6) and our upcoming Pf7 release, we used a slightly different reference that can be found at ftp://ngs.sanger.ac.uk/production/pf3k/release_5/Pfalciparum.genome.fasta.gz. The only difference between this and the latest GeneDB version is the mitochondrial sequence (Pf3D7_MIT_v3 in latest GeneDB and Pf_M76611 in the pf3k reference). I think I'm right in saying that the sequences Pf3D7_MIT_v3 and Pf_M76611 are identical apart from 2 SNPs.

from pipelines.

podpearson commented on September 11, 2024

I have used the IGV desktop version with Plasmodium data, though not with the crosses bams such as ftp://ngs.sanger.ac.uk/production/malaria/pf-crosses/1.0/bam/ERR012788.realigned.bam

from pipelines.

podpearson commented on September 11, 2024

We currently thinking about setting up IGV using either the webapp (https://github.com/igvteam/igv-webapp) or igv.js/igv-jupyter, rather than the desktop version. I'd be interested to hear if you have any experience of either of these.

from pipelines.

podpearson commented on September 11, 2024

As Alistair says above, there are 16 high-quality assemblies of different P. falciparum isolates created using a combination of PacBio and 250bp MiSeq data. You can find details of these in the Otto paper Alistair refers to. We also have 100bp Illumina HiSeq data from the same set of samples using our standard sequencing pipeline which I think we could make available if that would be helpful.

from pipelines.

podpearson commented on September 11, 2024

Details on the experimentally created mixtures Alistair refers to can be found at ftp://ngs.sanger.ac.uk/production/pf3k/release_5/ (see pf3k_release_5_mixtures_metadata.txt and also release_5_README_20170913.txt)

from pipelines.

fleharty commented on September 11, 2024

Has much work been done on characterizing the base error rates of the bams? There is a tool in Picard called CollectSamErrorMetrics, I'm thinking it would be a good idea to use it to collect error metrics on various contexts.

It might also be a good idea to run the sam error collection metrics on a human bam (sequenced on the same platform), that may help us identify the relative abundance of contexts and their errors in falciparum vs h. sapiens. And may give us some idea about what the various error modes actually look like.

from pipelines.

roamato commented on September 11, 2024

@fleharty I guess it really depends on how you measure errors (you can/will have a population of genomes in each sample) - essentially even an allele supported by a single read out of hundreds is biologically plausible. I don't know much about CollectSamErrorMetrics though, and what is the model underneath

from pipelines.

fleharty commented on September 11, 2024

@roamato CollectSamErrorMetrics assumes everything that is non-reference is an error. It ignores loci specified by the user that are known sites of common variation in the population, or known sites of variation in the sample you are providing. It has a genotyping model built in as well to ignore sites that appear to be het, or hom-var, but this genotyping model is only valid for diploid organisms.

CollectSamErrorMetrics allows the user to specify a wide variety of contexts. It also has ways of quantifying error rates in reads that are overlapping with their mates.

from pipelines.

fleharty commented on September 11, 2024

@samuelklee Do you have enough here to start putting together truth resources and getting estimates of how good the truth data is?

I'm wondering if we can use mendelian violations as a proxy for false positives. And use the crosses as a proxy for sensitivity.

I'd like to get an idea for what we do well on with basic calling using HC or M2, and get an idea on what we do poorly at. Once we have an understanding of the variants we could improve upon, we can start improving the actual methods.

from pipelines.

samuelklee commented on September 11, 2024

@fleharty if we want a concrete action item to close this issue, perhaps let's open a PR that cleanly organizes the above information in the form of a doc? This can later be incorporated into specs/docs for our validations (at which point, we can debate on the specifics of using Mendelian violations, concordance, etc.). Want to take a first crack at it?

I think we still need to gather pointers in a few categories (e.g., matched Amplicon/WGS samples) and might need to do some work to identify samples to build test cohorts of various sizes.

from pipelines.

fleharty commented on September 11, 2024

This sounds like a good idea. I've created a specifications document (it has hardly anything in it so far)
https://docs.google.com/document/d/1puC4n9S7aNOoxgRdzEDcc0b60Ooe21oLGwud4fW7_Gw/edit

Did you mean open a PR as in pull request? I'm not sure why we would do that.
There are a number

from pipelines.

samuelklee commented on September 11, 2024

Yes, I think it would be best to commit docs for data/resources, as well as specs/docs for both pipelines and their corresponding validations. The repo wiki might be another option for the former, but it looks like there is already a directory for the latter. We can use Google Docs for staging and initial collaborative editing, though, if you like.

from pipelines.

podpearson commented on September 11, 2024

@fleharty thanks for starting the google doc. Would you be OK giving me access to this? I've just made a request.

from pipelines.

fleharty commented on September 11, 2024

@podpearson You should have access to it now.

from pipelines.

alimanfoo commented on September 11, 2024

Closing as inactive but feel free to reopen if still useful.

from pipelines.

Validation/truth datasets for variant calling calibration, comparison and evaluation about pipelines HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent