Umbrella issue to communicate re development of the <a href="https://github.com/malari

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Mosquito phasing pipeline development,about malariagen/pipelines

Comments (22)

alimanfoo commented on September 11, 2024

Hi @jessicaway, here's some info regarding the set of sites and alleles we want to phase at. For convenience I've uploaded all of the sites VCFs to a public GCS bucket on the MalariaGEN account - feel free to copy these over when you have a bucket created for pipeline development work on your side.

The files are all here:

https://console.cloud.google.com/storage/browser/malariagen-pipelines-dev/ag3-phasing-20200918

I have made one VCF file for each chromosome arm with sites and alleles to be phased. (There are also three species combinations we'll ultimately want to run, but you can ignore that detail for the moment.) For example, the sites for chromosome arm 3L (all species) are here:

https://storage.googleapis.com/malariagen-pipelines-dev/ag3-phasing-20200918/ag3_phased_sites_gamb_colu_arab_3L.vcf.gz

Here's a snippet of that file, if it's helpful:

$ zcat ag3_phased_sites_gamb_colu_arab_3L.vcf.gz | head -n15
##fileformat=VCFv4.1
##fileDate=20200902
##source=scikit-allel-1.2.1
##FILTER=<ID=PASS,Description="">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
3L      20244   .       A       C       .       PASS    .
3L      20254   .       T       G       .       PASS    .
3L      20264   .       T       G       .       PASS    .
3L      20277   .       T       C       .       PASS    .
3L      20279   .       T       C       .       PASS    .
3L      20294   .       T       C       .       PASS    .
3L      20303   .       G       A       .       PASS    .
3L      20306   .       A       T       .       PASS    .
3L      20309   .       C       T       .       PASS    .
3L      20325   .       T       A       .       PASS    .

I.e., there are a set of sites, and a single REF and ALT allele for each site.

Hope that makes sense, give me a shout via slack if any questions.

cc @kbergin, @puethe

from pipelines.

jessicaway commented on September 11, 2024

@alimanfoo Thanks so much, this is very helpful!!

from pipelines.

alimanfoo commented on September 11, 2024

Hi @jessicaway, thanks so much for the question re selecting variants on Slack, apologies for slow reply. I thought I'd surface and respond here if that's OK. Your question:

Subsetting a VCF for WhatsHap processing:

Starting with:

An input VCF. The ref is defined for each location and the alt includes all 3 remaining possible alleles
An “interval” list (in VCF format) that defines a single ref allele and a single alt allele for a subset of positions

Need to subset as follows:

Subset the original VCF to have only the variants from the interval list
Reduce alternate alleles to a single alternate that matches the alt allele in the interval list and recode alleles to be 0/1 encoded (as opposed to 0/2 etc.)

Additional requirements:

Must include a single alt allele (even if the genotype is homozygous reference or missing)
If more than 1 alternate is found, keep the alt that matches the interval list and recode the genotype as missing
If a sample includes an alternate allele that differs from interval list, record the alt to match the interval list and change the genotype to missing

E.g., input VCF looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   C,T,G   .   .   .   GT  ./.
2R  2   .   C   A,T,G   0.16    .   .   GT  0/0
2R  3   .   C   A,T,G   0.16    .   .   GT  0/1
2R  4   .   G   A,C,T   0.16    .   .   GT  1/1
2R  5   .   A   C,T,G   0.16    .   .   GT  0/2
2R  6   .   T   A,C,G   0.16    .   .   GT  2/2
2R  7   .   G   A,C,T   0.16    .   .   GT  0/3
2R  8   .   T   A,C,G   0.16    .   .   GT  3/3
2R  9   .   C   A,T,G   0.16    .   .   GT  1/3 * We do not expect to get two alterenates like this
2R  10  .   A   C,T,G   0.16    .   .   GT  0/3 * Notice the alternate here differs from that in the interval list

Interval list VCF looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
2R  1   .   A   G   .   PASS    .
2R  2   .   C   A   .   PASS    .
2R  3   .   C   A   .   PASS    .
2R  4   .   G   A   .   PASS    .
2R  7   .   G   C   .   PASS    .
2R  8   .   T   G   .   PASS    .
2R  9   .   C   A   .   PASS    .
2R  10  .   A   T   .   PASS    .

After running select variants (gatk SelectVariants -R Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa -V tiny_sample.vcf -O tiny_sample.subset.vcf --remove-unused-alternates true --intervals tiny_intervals.vcf ), subset VCF looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   .   .   .   .    GT  ./.
2R  2   .   C   .   0.16    .   .  GT    0/0
2R  3   .   C   A   0.16    .   .    GT  0/1
2R  4   .   G   A   0.16    .   . GT  1/1
2R  7   .   G   T   0.16    .   .    GT  0/1
2R  8   .   T   G   0.16    .   . GT  1/1
2R  9   .   C   A,G 0.16    .   .    GT  1/2
2R  10  .   A   G   0.16    .   .    GT  0/1

Subset VCF needs to look like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   G   .   .   .    GT  ./.
2R  2   .   C   A   0.16    .   .  GT    0/0
2R  3   .   C   A   0.16    .   .    GT  0/1
2R  4   .   G   A   0.16    .   . GT  1/1
2R  7   .   G   T   0.16    .   .    GT  0/1
2R  8   .   T   G   0.16    .   . GT  1/1
2R  9   .   C   A   0.16    .   .    GT  ./.
2R  10  .   A   T   0.16    .   .    GT  ./.

(Note the differences in the way that positions 1, 2, 9, & 10 are processed)

from pipelines.

alimanfoo commented on September 11, 2024

Hi @jessicaway, just to say your example above looks right to me, thanks again. @hardingnj are you happy with how we're proposing to do this?

from pipelines.

alimanfoo commented on September 11, 2024

Noticed a small correction at position 7.

Subset VCF needs to look like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   G   .   .   .    GT  ./.
2R  2   .   C   A   0.16    .   .  GT    0/0
2R  3   .   C   A   0.16    .   .    GT  0/1
2R  4   .   G   A   0.16    .   . GT  1/1
2R  7   .   G   C   0.16    .   .    GT  ./.
2R  8   .   T   G   0.16    .   . GT  1/1
2R  9   .   C   A   0.16    .   .    GT  ./.
2R  10  .   A   T   0.16    .   .    GT  ./.

(Note the differences in the way that positions 1, 2, 7, 9, & 10 are processed)

from pipelines.

hardingnj commented on September 11, 2024

Sorry, missed this while on leave.

Looks good. Couple of Qs:

Perhaps worth making explicit that the input VCF is single sample?

If more than 1 alternate is found, keep the alt that matches the interval list and recode the genotype as missing

As in variant 9?

If a sample includes an alternate allele that differs from interval list, record the alt to match the interval list and change the genotype to missing

Slight typo? Record > recode?

The only concern I have is around a situation like variant 7. Which means that the output file will change based on the genotype of the sample? I don't think this works when you recombine all the output VCFs, as the ALT won't be consistent.

Ah- now I see your comment above. Maybe

If more than 1 alternate is found, keep the alt that matches the interval list and recode the genotype as missing

could be

If an alternate other than that in the interval list is called report the alt that matches the interval list and recode the genotype as missing.

The tool I mentioned in the last call was https://github.com/vcflib/vcflib#vcfbreakmulti. Which I think may be useful depending on how it handles ALT1/ALT2 genotypes.

from pipelines.

jessicaway commented on September 11, 2024

Thanks for taking a look at this! And linking to the vcflib tool

Noticed a small correction at position 7

Good catch! You are correct, variant 7 should have been recoded as missing in the example (in the same way that variant 10 was).

Slight typo? Record > recode?

Yes, I will correct this as well as the preferred wording for the variant 9 example

Please see the updates included below

from pipelines.

jessicaway commented on September 11, 2024

Subsetting a VCF for WhatsHap processing:

Starting with:

A single sample input VCF. The ref is defined for each location and the alt includes all 3 remaining possible alleles
An “interval” list (in VCF format) that defines a single ref allele and a single alt allele for a subset of positions

Need to subset as follows:

Subset the original VCF to have only the variants from the interval list
Reduce alternate alleles to a single alternate that matches the alt allele in the interval list and recode alleles to be 0/1 encoded (as opposed to 0/2 etc.)

Additional requirements:

Must include a single alt allele (even if the genotype is homozygous reference or missing)
If an alternate other than that in the interval list is called report the alt that matches the interval list and recode the genotype as missing
If a sample includes an alternate allele that differs from interval list, recode the alt to match the interval list and change the genotype to missing

E.g., input VCF looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   C,T,G   .   .   .   GT  ./.
2R  2   .   C   A,T,G   0.16    .   .   GT  0/0
2R  3   .   C   A,T,G   0.16    .   .   GT  0/1
2R  4   .   G   A,C,T   0.16    .   .   GT  1/1
2R  5   .   A   C,T,G   0.16    .   .   GT  0/2
2R  6   .   T   A,C,G   0.16    .   .   GT  2/2
2R  7   .   G   A,C,T   0.16    .   .   GT  0/3
2R  8   .   T   A,C,G   0.16    .   .   GT  3/3
2R  9   .   C   A,T,G   0.16    .   .   GT  1/3 * We do not expect to get two alternates like this
2R  10  .   A   C,T,G   0.16    .   .   GT  0/3 * Notice the alternate here differs from that in the interval list

Interval list VCF looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
2R  1   .   A   G   .   PASS    .
2R  2   .   C   A   .   PASS    .
2R  3   .   C   A   .   PASS    .
2R  4   .   G   A   .   PASS    .
2R  7   .   G   C   .   PASS    .
2R  8   .   T   G   .   PASS    .
2R  9   .   C   A   .   PASS    .
2R  10  .   A   T   .   PASS    .

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   .   .   .   .    GT  ./.
2R  2   .   C   .   0.16    .   .  GT    0/0
2R  3   .   C   A   0.16    .   .    GT  0/1
2R  4   .   G   A   0.16    .   . GT  1/1
2R  7   .   G   T   0.16    .   .    GT  0/1
2R  8   .   T   G   0.16    .   . GT  1/1
2R  9   .   C   A,G 0.16    .   .    GT  1/2
2R  10  .   A   G   0.16    .   .    GT  0/1

Subset VCF needs to look like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AV0148-C
2R  1   .   A   G   .   .   .    GT  ./.
2R  2   .   C   A   0.16    .   .  GT    0/0
2R  3   .   C   A   0.16    .   .    GT  0/1
2R  4   .   G   A   0.16    .   . GT  1/1
2R  7   .   G   C   0.16    .   .    GT  ./.
2R  8   .   T   G   0.16    .   . GT  1/1
2R  9   .   C   A   0.16    .   .    GT  ./.
2R  10  .   A   T   0.16    .   .    GT  ./.

(Note the differences in the way that positions 1, 2, 7, 9, & 10 are processed)

from pipelines.

alimanfoo commented on September 11, 2024

Hi @hardingnj,

The tool I mentioned in the last call was https://github.com/vcflib/vcflib#vcfbreakmulti. Which I think may be useful depending on how it handles ALT1/ALT2 genotypes.

I gave this some thought and breaking multiallelics into multiple variants like this could in principle provide a route to phasing of multiallelics. There are a couple of potential issues though with that approach, so I'm thinking for now we would do better to stick to a strict biallelic approach, with genotypes recoded as missing if they contain an "unknown" (i.e., unexpected) allele.

Ultimately it would be better I think if the phasing tools just supported multiallelic variants (xref whatshap/whatshap#128) so I think we'd do better to push for that rather than go down the route of breaking multiallelics into multiple variants.

For the record, there are two main issues I think with breakmulti approach.

First is that we'd have to break every SNP into three variants, to handle all possible alternate alleles, and to avoid introducing a reference bias. That will treble the number of variants that need phasing. That may not be a serious problem, but will have some impact.

The second issue is that it becomes quite tricky to actually figure out exactly what allele is carried by a given haplotype at a given SNP. E.g., if you want to know in some downstream analysis what allele is at the SNP on 3L position 10,000, you have to examine three records in the data, and you have to figure out that the sequence (0, 0, 0) means a reference allele, (1, 0, 0) means the first alt, (0, 1, 0) means the second alt, and (0, 0, 1) means the third alt. We could of course write some code to reverse breakmulti and repack the data back into multiallelic SNPs, that would be an option. But there is some complexity to handle there.

So that's why I'm thinking we stick to biallelics for now. We can always revisit this later. If we did decide to try breakmulti at some point down the road then it would only require a change to the first step in the phasing pipeline, all subsequent steps should stay the same.

from pipelines.

alimanfoo commented on September 11, 2024

Just to add, because we've ascertained biallelic SNPs across a reasonably large cohort with representation of multiple species and geographies (Ag1000G phase 3), hopefully it should be relatively rare that we encounter unexpected alleles when we phase new samples.

from pipelines.

hardingnj commented on September 11, 2024

I gave this some thought and breaking multiallelics into multiple variants like this could in principle provide a route to phasing of multiallelics. There are a couple of potential issues though with that approach, so I'm thinking for now we would do better to stick to a strict biallelic approach, with genotypes recoded as missing if they contain an "unknown" (i.e., unexpected) allele.

We might be talking at cross purposes here. I could have communicated better- apologies!

I was suggesting that vcfbreakmulti might provide a path to move from single sample VCF + interval list, to the output VCF listed above.

The tool will split all SNPs into 3 rows, with missing genotypes where the ALT does not correspond to the genotype. Then a tool such as GATK SelectVariants can discard all rows that do not match the ALT described in the interval list. Which I think should get us a reasonable way towards the desired output?

from pipelines.

hardingnj commented on September 11, 2024

Though it seems this is solved by #50 (comment)

from pipelines.

alimanfoo commented on September 11, 2024

We might be talking at cross purposes here. I could have communicated better- apologies!

I was suggesting that vcfbreakmulti might provide a path to move from single sample VCF + interval list, to the output VCF listed above.

The tool will split all SNPs into 3 rows, with missing genotypes where the ALT does not correspond to the genotype. Then a tool such as GATK SelectVariants can discard all rows that do not match the ALT described in the interval list. Which I think should get us a reasonable way towards the desired output?

My bad, I had assumed that vcfbreakmulti would behave differently. I had assumed that a genotype like 0/3 would get broken into three genotypes 0/0, 0/0 and 0/1, and I was concerned this could lead to a reference bias. However I should've actually tested what vcfbreakmulti is doing, because as you say it inserts a missing allele if there is an allele present in the call that is not included in the ALT list. E.g., given this input:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	AV0148-C
2R	1	.	A	C,T,G	.	.	.	GT	./.
2R	2	.	C	A,T,G	0.16	.	.	GT	0/0
2R	3	.	C	A,T,G	0.16	.	.	GT	0/1
2R	4	.	G	A,C,T	0.16	.	.	GT	1/1
2R	5	.	A	C,T,G	0.16	.	.	GT	0/2
2R	6	.	T	A,C,G	0.16	.	.	GT	2/2
2R	7	.	G	A,C,T	0.16	.	.	GT	0/3
2R	8	.	T	A,C,G	0.16	.	.	GT	3/3
2R	9	.	C	A,T,G	0.16	.	.	GT	1/3
2R	10	.	A	C,T,G	0.16	.	.	GT	0/3

...vcfbreakmulti generates this:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	AV0148-C
2R	1	.	A	C	0	.	.	GT	./.
2R	1	.	A	T	0	.	.	GT	./.
2R	1	.	A	G	0	.	.	GT	./.
2R	2	.	C	A	0.16	.	.	GT	0/0
2R	2	.	C	T	0.16	.	.	GT	0/0
2R	2	.	C	G	0.16	.	.	GT	0/0
2R	3	.	C	A	0.16	.	.	GT	0/1
2R	3	.	C	T	0.16	.	.	GT	./0
2R	3	.	C	G	0.16	.	.	GT	./0
2R	4	.	G	A	0.16	.	.	GT	1/1
2R	4	.	G	C	0.16	.	.	GT	./.
2R	4	.	G	T	0.16	.	.	GT	./.
2R	5	.	A	C	0.16	.	.	GT	./0
2R	5	.	A	T	0.16	.	.	GT	0/1
2R	5	.	A	G	0.16	.	.	GT	./0
2R	6	.	T	A	0.16	.	.	GT	./.
2R	6	.	T	C	0.16	.	.	GT	1/1
2R	6	.	T	G	0.16	.	.	GT	./.
2R	7	.	G	A	0.16	.	.	GT	./0
2R	7	.	G	C	0.16	.	.	GT	./0
2R	7	.	G	T	0.16	.	.	GT	0/1
2R	8	.	T	A	0.16	.	.	GT	./.
2R	8	.	T	C	0.16	.	.	GT	./.
2R	8	.	T	G	0.16	.	.	GT	1/1
2R	9	.	C	A	0.16	.	.	GT	./1
2R	9	.	C	T	0.16	.	.	GT	./.
2R	9	.	C	G	0.16	.	.	GT	./1
2R	10	.	A	C	0.16	.	.	GT	./0
2R	10	.	A	T	0.16	.	.	GT	./0
2R	10	.	A	G	0.16	.	.	GT	0/1

You can see you end up with half-missing genotype calls in several places.

Bottom line I think you're right, vcfbreakmulti followed by SelectVariants would be a viable option. How whatshap would then deal with half-missing genotypes is then an interesting question.

For simplicity I suggest we go with my script for now, but keep this option in our back pocket.

from pipelines.

alimanfoo commented on September 11, 2024

Hi @jessicaway, for the merging VCFs step in the statistical phasing pipeline, three possible tools I'm aware of are vcflib's vcfcombine, bcftools merge and Picard MergeVcfs (I think, although docs are a bit confusing).

from pipelines.

jessicaway commented on September 11, 2024

Great! Thanks for providing those links! I will take a look at those tools

from pipelines.

alimanfoo commented on September 11, 2024

As a side note, scanning through the bcftools docs I noticed that a different command bcftools concat, which concatenates VCFs from the same sample(s) but different genome regions, has a --ligate option which will ligate phased VCFs by matching phase at overlapping haplotypes. This might be useful to know for the statistical phasing pipeline, because it might allow you to run shapeit4 in parallel for different genome regions (with some overlap at the boundaries) then stitch them back together with bcftools concat. Obviously whether we need to do that will depend on the resource requirements and runtime, but might be useful to have in the back pocket.

from pipelines.

jessicaway commented on September 11, 2024

That is definitely a good tool to keep in mind, thanks!

from pipelines.

alimanfoo commented on September 11, 2024

Hi @gbggrant, just following up our conversation on Wednesday about selecting a set of samples to use for scientific validation of the phasing pipeline.

As I mentioned it would be ideal to use a set of samples that we previously phased using the old phasing pipeline as part of Ag1000G phase 2 (sorry two different senses of the word "phase" here, hope that doesn't confuse). Although the haplotypes will not be identical as from the previous pipeline, I will be able to run some population genetic analyses and see if some of the biological signals that we use the haplotypes to detect are still present with the output of the new pipeline.

There is a tab-delimited file here which lists all the samples that were included in Ag1000G phase 2: ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/samples/samples.meta.txt

The "ox_code" column has the sample identifier. I would suggest to use all of the samples from Burkina Faso. You should be able to identify these by looking at the "country" column. There are 150 samples from Burkina Faso in total.

VCFs for all of these samples that you can use as input should be symlinked from this path at Sanger, files named by sample ID:

/lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/symlink/vo_agam_unifiedgenotyper/gatk_unified_genotyper_gatk3_v3

BAMs are symlinked from here:

/lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/symlink/vo_agam_indelrealign/bam_fix_mates_v2

Hope that makes sense, give me a shout on slack if any problems.

from pipelines.

alimanfoo commented on September 11, 2024

Oh and just to add, running just one chromosome would sufficient for validation. If I had to pick one I'd go for 3R. 2R or 2L would also be fine. Don't do 3L, it's the most boring chromosome arm, not many interesting signals 😄. Also suggest don't do X as it's hemizygous in males so needs special handling.

from pipelines.

alimanfoo commented on September 11, 2024

Hi @gbggrant, BAM index (.bai) files are here:

/lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/symlink/vo_agam_indelrealign/bam_index_v2

Zipped zarr files are here:

/lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/symlink/vo_agam_vcf2zarr/convert_vcf_to_zarr_serial_contig

from pipelines.

alimanfoo commented on September 11, 2024

See #73 for results of scientific validation of the pipeline implementation. TL;DR results look excellent, we can go ahead with work to put this pipeline into production. Very exciting.

from pipelines.

alimanfoo commented on September 11, 2024

Closing as completed 🌸

from pipelines.

Mosquito phasing pipeline development about pipelines HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent