Coder Social home page Coder Social logo

Comments (10)

gbggrant avatar gbggrant commented on September 11, 2024 1

I have downloaded the results for the ag1000g-phase3-validation data set, as run through the ShortReadAlignmentAndGenotyping pipeline to /lustre/scratch118/malaria/team112/personal/gg18/ag1000g-phase3-validation/results/gcp- these were run on the google cloud platform and I have downloaded them to lustre. I ran the metadata for the 10 workflows through a script we use to calculate cost added the output to this spreadsheet https://docs.google.com/spreadsheets/d/1jw6aqHP4lqyCYvodM8GEhAQOu1O5SkAeCpxtGzMcxx8/edit?usp=sharing. The spreadsheet also contains the run times for the tasks (FixMateInformation, IndelRealigner, UnifiedGenotyper, and VcfToZarr are the most time consuming). The cost summary is not so useful (could be lower) as I've been running these tasks w/o preemption which definitely ups the costs.

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024 1

Hi @tnguyensanger, @seretol, I'm sure you have this covered already, but just to say that given you are comparing data from a single sample, you can probably just bring all data into memory and use numpy, rather than using dask, if that is more convenient. E.g., something like the following should do it:

import zarr
import allel
import numpy as np
sample = 'ABXXXX-C'  # sample identifier
callset1 = zarr.open(f'/path/to/legacy/{sample}.zarr.zip')
callset2 = zarr.open(f'/path/to/new/{sample}.zarr.zip')
chrom = '2L'
gt1 = allel.GenotypeVector(callset1[sample][chrom]['calldata/GT'][:, 0])
gt2 = allel.GenotypeVector(callset2[sample][chrom]['calldata/GT'][:, 0])
# optionally apply site filters here
gt1f = gt1.compress(filter_pass, axis=0)
gt2f = gt2.compress(filter_pass, axis=0)
n = gt1f.shape[0]
# locate genotypes
x1 = [gt1f.is_missing(), gt1f.is_hom_ref(), gt1f.is_het(), gt1f.is_hom_alt()]
x2 = [gt2f.is_missing(), gt2f.is_hom_ref(), gt2f.is_het(), gt2f.is_hom_alt()]
# build contingency table of comparisons
contingency_table = np.zeros((4, 4), dtype=float)
for i, t1 in enumerate(x1):
    for j, t2 in enumerate(x2):
        contingency_table[i, j] = np.count_nonzero(t1 & t2) / n

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024 1

Hi @tnguyensanger

To clarify, are the pass filter sites the same as the accessiblity map?

Yes, that's right. We don't have separate accessibility maps and variant filters any more, we just define a filter over all genomic sites. We tend to call this the "site filter" now to reflect that.

As an aside, we actually have three separate site filters for different combinations of mosquito species. However, the one I pointed to above is the "gamb_colu_arab" filter which is designed to work across all three species, and so which should be the best one to use for this task.

from pipelines.

tnguyensanger avatar tnguyensanger commented on September 11, 2024

@seretol

from pipelines.

tnguyensanger avatar tnguyensanger commented on September 11, 2024

@jessicaway Hi Jessica. Do you know where we can find the genotyped zarrs produced by the Malariagen vector genotyping WDLs? Or alternatively, the cromwell job IDs whose output we should be checking?

Thanks!
Thuy

from pipelines.

jessicaway avatar jessicaway commented on September 11, 2024

Thanks @gbggrant!

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024

Hi @tnguyensanger, @seretol, here's a short snippet of code that will load the site filter array for a given chromosome arm, e.g., 2L:

import intake
cat = intake.open_catalog('https://malariagen.github.io/intake/gcs.yml')
site_filters = cat.ag3.site_filters_dt_20200416_gamb_colu_arab.to_zarr()
filter_pass = site_filters['2L/variants/filter_pass'][:]

This filter_pass array should be the same length as the genotype arrays you are comparing. So, e.g., if gt1 is an array of genotypes from the legacy pipeline, and gt2 is an array of genotypes from the new pipeline, you can apply this filter with:

assert filter_pass.shape[0] == gt1.shape[0] == gt2.shape[0]
gt1_filtered = gt1.compress(filter_pass, axis=0)
gt2_filtered = gt2.compress(filter_pass, axis=0)
# now compare gt1_filtered with gt2_filtered

Note that the code at the top will load the site filters data directly from google cloud. This will take a few seconds, but the data are not large, so should be OK even if you are running from a computer outside google cloud. But if you would prefer to make a local copy let me know, I can give an example how to do that.

Hth.

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024

Hi @tnguyensanger, @seretol, just to say I corrected an error in the comment above so please use latest version.

from pipelines.

tnguyensanger avatar tnguyensanger commented on September 11, 2024

@alimanfoo

import intake
cat = intake.open_catalog('https://malariagen.github.io/intake/gcs.yml')
site_filters = cat.ag3.site_filters_dt_20200416_gamb_colu_arab.to_zarr()

To clarify, are the pass filter sites the same as the accessiblity map? Or is it more stringent?

from pipelines.

tnguyensanger avatar tnguyensanger commented on September 11, 2024

@alimanfoo

Do the pass filter sites only include biallelic sites?

from pipelines.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.