Check that genotypes from new WDL pipelines are >99% similar to genotypes from lega

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Genotype Concordance Validation of genotyping pipelines about pipelines HOT 10 CLOSED

malariagen commented on September 11, 2024

Genotype Concordance Validation of genotyping pipelines

from pipelines.

Comments (10)

gbggrant commented on September 11, 2024 1

I have downloaded the results for the ag1000g-phase3-validation data set, as run through the ShortReadAlignmentAndGenotyping pipeline to /lustre/scratch118/malaria/team112/personal/gg18/ag1000g-phase3-validation/results/gcp- these were run on the google cloud platform and I have downloaded them to lustre. I ran the metadata for the 10 workflows through a script we use to calculate cost added the output to this spreadsheet https://docs.google.com/spreadsheets/d/1jw6aqHP4lqyCYvodM8GEhAQOu1O5SkAeCpxtGzMcxx8/edit?usp=sharing. The spreadsheet also contains the run times for the tasks (FixMateInformation, IndelRealigner, UnifiedGenotyper, and VcfToZarr are the most time consuming). The cost summary is not so useful (could be lower) as I've been running these tasks w/o preemption which definitely ups the costs.

from pipelines.

alimanfoo commented on September 11, 2024 1

Hi @tnguyensanger, @seretol, I'm sure you have this covered already, but just to say that given you are comparing data from a single sample, you can probably just bring all data into memory and use numpy, rather than using dask, if that is more convenient. E.g., something like the following should do it:

import zarr
import allel
import numpy as np
sample = 'ABXXXX-C'  # sample identifier
callset1 = zarr.open(f'/path/to/legacy/{sample}.zarr.zip')
callset2 = zarr.open(f'/path/to/new/{sample}.zarr.zip')
chrom = '2L'
gt1 = allel.GenotypeVector(callset1[sample][chrom]['calldata/GT'][:, 0])
gt2 = allel.GenotypeVector(callset2[sample][chrom]['calldata/GT'][:, 0])
# optionally apply site filters here
gt1f = gt1.compress(filter_pass, axis=0)
gt2f = gt2.compress(filter_pass, axis=0)
n = gt1f.shape[0]
# locate genotypes
x1 = [gt1f.is_missing(), gt1f.is_hom_ref(), gt1f.is_het(), gt1f.is_hom_alt()]
x2 = [gt2f.is_missing(), gt2f.is_hom_ref(), gt2f.is_het(), gt2f.is_hom_alt()]
# build contingency table of comparisons
contingency_table = np.zeros((4, 4), dtype=float)
for i, t1 in enumerate(x1):
    for j, t2 in enumerate(x2):
        contingency_table[i, j] = np.count_nonzero(t1 & t2) / n

from pipelines.

alimanfoo commented on September 11, 2024 1

Hi @tnguyensanger

To clarify, are the pass filter sites the same as the accessiblity map?

Yes, that's right. We don't have separate accessibility maps and variant filters any more, we just define a filter over all genomic sites. We tend to call this the "site filter" now to reflect that.

As an aside, we actually have three separate site filters for different combinations of mosquito species. However, the one I pointed to above is the "gamb_colu_arab" filter which is designed to work across all three species, and so which should be the best one to use for this task.

from pipelines.

tnguyensanger commented on September 11, 2024

@seretol

from pipelines.

tnguyensanger commented on September 11, 2024

@jessicaway Hi Jessica. Do you know where we can find the genotyped zarrs produced by the Malariagen vector genotyping WDLs? Or alternatively, the cromwell job IDs whose output we should be checking?

Thanks!
Thuy

from pipelines.

jessicaway commented on September 11, 2024

Thanks @gbggrant!

from pipelines.

alimanfoo commented on September 11, 2024

Hi @tnguyensanger, @seretol, here's a short snippet of code that will load the site filter array for a given chromosome arm, e.g., 2L:

import intake
cat = intake.open_catalog('https://malariagen.github.io/intake/gcs.yml')
site_filters = cat.ag3.site_filters_dt_20200416_gamb_colu_arab.to_zarr()
filter_pass = site_filters['2L/variants/filter_pass'][:]

This filter_pass array should be the same length as the genotype arrays you are comparing. So, e.g., if gt1 is an array of genotypes from the legacy pipeline, and gt2 is an array of genotypes from the new pipeline, you can apply this filter with:

assert filter_pass.shape[0] == gt1.shape[0] == gt2.shape[0]
gt1_filtered = gt1.compress(filter_pass, axis=0)
gt2_filtered = gt2.compress(filter_pass, axis=0)
# now compare gt1_filtered with gt2_filtered

Note that the code at the top will load the site filters data directly from google cloud. This will take a few seconds, but the data are not large, so should be OK even if you are running from a computer outside google cloud. But if you would prefer to make a local copy let me know, I can give an example how to do that.

Hth.

from pipelines.

alimanfoo commented on September 11, 2024

Hi @tnguyensanger, @seretol, just to say I corrected an error in the comment above so please use latest version.

from pipelines.

tnguyensanger commented on September 11, 2024

@alimanfoo

import intake
cat = intake.open_catalog('https://malariagen.github.io/intake/gcs.yml')
site_filters = cat.ag3.site_filters_dt_20200416_gamb_colu_arab.to_zarr()

To clarify, are the pass filter sites the same as the accessiblity map? Or is it more stringent?

from pipelines.

tnguyensanger commented on September 11, 2024

@alimanfoo

Do the pass filter sites only include biallelic sites?

from pipelines.

Genotype Concordance Validation of genotyping pipelines about pipelines HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent