mrcieu / gwas2vcf Goto Github PK

View Code? Open in Web Editor NEW

46.0 16.0 17.0 5.55 MB

Convert GWAS summary statistics to VCF

License: MIT License

Dockerfile 0.72% Python 99.28%

gwas bioinformatics vcf converter summary-statistics

gwas2vcf's Introduction

Convert GWAS summary statistics to VCF

Tool to map GWAS summary statistics to VCF with on-the-fly harmonisation to a supplied reference FASTA

Produces GWAS-VCF with version 1.0 of the specification

Documentation

Full documentation available from https://mrcieu.github.io/gwas2vcf

GWAS summary statistics

Complete GWAS summary statistics in GWAS-VCF are available on >14,000 datasets from the OpenGWAS project

Tutorials

What can I do with GWAS-VCF?

Command-line operations (filter, annotate, convert, liftover, merge, validate, pathway, gene)
Integration with other tools (clumping, finemapping, LDSC, colocalization, Mendelian randomization, COJO)
Parse (R, Python3)

Let us know if you have other use cases through the issues page!

Workflow

Citation

Lyon M, Andrews S, Elsworth B, Gaunt T, Hemani G, Marcora E. The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol 22, 32 (2021). https://doi.org/10.1186/s13059-020-02248-0
Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallett J, Bates P, Palmer T, Haberland V, Davey Smith G, Zheng J, Haycock P, Gaunt TR, Hemani G. The MRC IEU OpenGWAS data infrastructure. bioRxiv, p. 2020.08.10.244293, Aug. 2020. https://doi.org/10.1101/2020.08.10.244293

Please also cite the relevant tool(s) and data source if you use GWAS-VCF for downstream analyses.

gwas2vcf's People

Contributors

Stargazers

Watchers

Forkers

sdhutchins titorat ashwinas88 selfhacked vforget darked89 karatugo selfdecode vreuter wangdi2014 tyang20 stjordanis ht-diva nicksunderland pedrummohammadishemirani jeantristanb

gwas2vcf's Issues

alias file

Hi,
I think your alias.txt contains a mistake. You list:
25 MT
However, at least in PLINK, only 26 is MT, 25 is XY.
By the way, it would be great if the documentation mentioned that your hg19 and hg38 files use different chromosome codes.
Thanks for looking into this,
Till

build and push docker images to dockerhub

VCF cannot store floats < 1e-6

@explodecomputer @elswob We already encountered this issue for storing p-values but there are some cases where the effect size is less than 1e-6 in UKBB. They would end up as 0 in the VCF file. Should we log transform or leave as is?

deploy as web app for processing data

Update documentation for running on HPC and Docker

Do not use empty fields, instead do not emit keys

add static value for sample size and CC

implement travis ci

should use case and sample size to avoid ambiguity

create flow diagram of operation

benchmarks and speedups

Hello,

I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case.
The gwas2vcf was run using Singularity container using:

finngen_R5_AB1_AMOEBIASIS.tsv (16380388 lines)
dbSNP 155
GRCh38 fasta
Singularity 3.7.0
Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz

Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.

the output VCF format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  amoeb01
1       108391  rs1274919517    A       G       .       PASS    .       ES:SE:LP:ID     0.3653:1.9411:0.0702236:rs1274919517

This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins

My questions

is this in line with running times you observed or 10hrs /file is a fluke (== too slow)?
would it be possible to speed up processing?

Thank you

Darek Kedra

implement variant normalisation

implement VR package

update to new schema

update docs

creating package and app from single repo

Sample sizes formatted to only 6 significant digits

Sample sizes (the SS or NC records) greater than 1 million get truncated to six significant digits and formatted using E notation. For example, the number of cases "7654321" gets formatted as "7.65432e+06".

add QC report to docs

use temp file for processing data rather than mem

inf in estimate field

rs2299979 @ /mnt/storage/private/mrcieu/data/IGD/data/public/eqtl-a-ENSG00000100985/eqtl-a-ENSG00000100985.vcf.gz

reference genome base checking

Hello,

gwas2vcf seems to be dropping (BTW, correct behavior) lines where TSV input REF base does not match the genomic fasta base at that position. By accident (BED file with chromosomes starting with 1 combined with pyfaidx) I noticed that the greatly reduced in size VCF result is still produced by gwas2vcf.

Would it be possible to add some extra check or big fat warning that the number of the dropped positions exceeds some arbitrary limit?
I was lucky enough to have an intact VCF from the previous gwas2vcf run using not corrupted all contigs genomic fasta so the difference was obvious.

Kind regards,

Darek Kedra

different naming schemes in dbsnp.v153.b37.vcf.gz vs dbsnp.v153.hg38.vcf.gz

Hello,

using two input TSV files with the same chromosome naming scheme (1, 2, etc.)

example.1k.txt (run with: human_g1k_v37.fasta & dbsnp.v153.b37.vcf.gz
a subset of finngen_R5_D3_SARCOIDOSIS.tsv (header & SNPs for chromosome 22 only) run using: Homo_sapiens_assembly38.fasta & dbsnp.v153.hg38.vcf.gz

I got a proper VCF with example.1k.txt and just a VCF header for the finngen input.
All the genome and dbSNP files were downloaded using links from from your repository README.md.

I have discovered that i.e. dbSNP files have a different naming schemes:

bcftools view -h  dbsnp.v153.b37.vcf.gz | grep '^##cont'

# result

##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
<snip>

hg38

bcftools view -h  dbsnp.v153.hg38.vcf.gz | grep '^##cont' | 

# result

##contig=<ID=chr1,length=248956422,assembly=38>
##contig=<ID=chr2,length=242193529,assembly=38>
##contig=<ID=chr3,length=198295559,assembly=38>
##contig=<ID=chr4,length=190214555,assembly=38>
##contig=<ID=chr5,length=181538259,assembly=38>
##contig=<ID=chr6,length=170805979,assembly=38>
##contig=<ID=chr7,length=159345973,assembly=38>
##contig=<ID=chr8,length=145138636,assembly=38>
##contig=<ID=chr9,length=138394717,assembly=38>
##contig=<ID=chr10,length=133797422,assembly=38>
##contig=<ID=chr11,length=135086622,assembly=38>
##contig=<ID=chr12,length=133275309,assembly=38>
##contig=<ID=chr13,length=114364328,assembly=38>
##contig=<ID=chr14,length=107043718,assembly=38>
##contig=<ID=chr15,length=101991189,assembly=38>
##contig=<ID=chr16,length=90338345,assembly=38>
##contig=<ID=chr17,length=83257441,assembly=38>
##contig=<ID=chr18,length=80373285,assembly=38>
##contig=<ID=chr19,length=58617616,assembly=38>
##contig=<ID=chr20,length=64444167,assembly=38>
##contig=<ID=chr21,length=46709983,assembly=38>
##contig=<ID=chr22,length=50818468,assembly=38>
##contig=<ID=chrX,length=156040895,assembly=38>
##contig=<ID=chrY,length=57227415,assembly=38>
##contig=<ID=chrM,length=16569,assembly=38>
<snip>

Can you kindly provide instructions about the way you use to synchronize the chromosome/contigs naming schemes for dbSNP VCF relase downloaded directly from i.e:
https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
??

It is not clear without investigating the code what is being done in the background to handle apparently different chromosome names in the input TSV, genomic fasta and dbSNP.

Many thanks for your help

Darek Kedra

check position is positive integer

check chromosome position is positive integer

dealing with non-hg19

Provide mechanism for liftover that corrects betas

error when no snp col is provided

2019-11-21 21:18:12,876 INFO GWAS2VCF 1.1.1
2019-11-21 21:18:12,876 INFO Arguments: {'out': '/data/gwas2vcfweb/data/64d840d6-0ca4-11ea-a24a-0242ad006803/64d840d6-0ca4-11ea-a24a-0242ad006803_harm.vcf.gz', 'data': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/64d840d6-0ca4-11ea-a24a-0242ad006803.txt', 'ref': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1899004205/human_g1k_v37.fasta', 'json': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/upload.json', 'id': '64d840d6-0ca4-11ea-a24a-0242ad006803', 'cohort_controls': None, 'cohort_cases': None, 'rm_chr_prefix': True, 'csi': False, 'log': 'INFO'}
2019-11-21 21:18:12,876 INFO Reading JSON parameters
2019-11-21 21:18:12,877 INFO Parameters: {'header': True, 'delimiter': '\t', 'pval_col': 8, 'beta_col': 5, 'pos_col': 1, 'build': 'GRCh37', 'ea_col': 3, 'chr_col': 0, 'oa_col': 4, 'se_col': 6}
2019-11-21 21:18:12,877 INFO Checking input arguments
2019-11-21 21:18:12,877 INFO Reading summary stats and mapping to FASTA: /data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/64d840d6-0ca4-11ea-a24a-0242ad006803.txt
2019-11-21 21:18:12,878 INFO Reading plain text file
2019-11-21 21:18:12,878 INFO Skipping header: chr	pos	rsid	A1	A2	beta	se	N	P-value	Freq.A1.1000G.EUR
2019-11-21 21:18:12,916 INFO Total variants: 999
2019-11-21 21:18:12,916 INFO Variants could not be read: 0
2019-11-21 21:18:12,926 INFO Variants harmonised: 999
2019-11-21 21:18:12,926 INFO Variants discarded during harmonisation: 0
2019-11-21 21:18:12,926 INFO Alleles switched: 455
2019-11-21 21:18:12,926 INFO Skipped 0 of 999
2019-11-21 21:18:12,926 INFO Writing headers to BCF/VCF: /data/gwas2vcfweb/data/64d840d6-0ca4-11ea-a24a-0242ad006803/64d840d6-0ca4-11ea-a24a-0242ad006803_harm.vcf.gz
Traceback (most recent call last):
  File "/app/main.py", line 175, in <module>
    main()
  File "/app/main.py", line 171, in main
    Vcf.write_to_file(harmonised, args.out, fasta, j['build'], args.id, sample_metadata, file_metadata, args.csi)
  File "/app/vcf.py", line 149, in write_to_file
    record.id = Vcf.remove_illegal_chars(result.dbsnpid)
  File "/app/vcf.py", line 30, in remove_illegal_chars
    r = s.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

annotate with dbSNP on the fly

move original dbsnp field to INFO

see spec - mr-base-structure

add chromosome alias option

ieu gwas pipe calls chromosome X -> 23, need to include alias file to map this correctly

For simplicity should we split multi-allelic variants in bcf files?

i.e. don't allow multiple values for a variant. So

1 100 A C,G

would be

1 100 A C
1 100 A G

It means that when extracting data we don't have to run a normalisation step, which could slow things down and sometimes trip people up if they are not expecting it

Containerising

Would it be possible to add the following to the container

Download the reference data
Allow the container to run in the background and accept command similar to docker exec <container_name> ./harmonise.py -arg1 args -arg2 args

read from zip/gzip/compressed text GWAS

check chromosome is in FAI

Do not write out records with chromosomes that are not listed in the header.

bcftools norm output

add flag for tophits instead of using a new file

Marshmallow for reading GWAS data

add PROPCASE for bcf

md5 checksums for inputs

Hello,

This is just an enhancement proposal, no rush to implement it.

does it make sense to add md5 checksums in the VCF header for all files used in a given gwas2vcf run?

I am thinking about:

input TSV (or tsv.gz)
json parameters
genome fasta
dbsnp

This should help to investigate the results IMHO.

Best,

bgzip or gzip?

I think at the moment it detects the output type based on suffix of the filename provided, and automatically gzips? Is there anyway to make it bgzip instead? pysam seems to have problems reading a .vcf.gz gzipped file

e.g. /mnt/storage/private/mrcieu/research/mr-eve/gwas-files/IEU-a:2/

>>> bcf_in = VariantFile("IEU-a:2.vcf.gz")
[W::hts_idx_load2] The index file is older than the data file: IEU-a:2.vcf.gz.tbi
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pysam/libcbcf.pyx", line 3990, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4238, in pysam.libcbcf.VariantFile.open
  File "pysam/libchtslib.pyx", line 517, in pysam.libchtslib.HTSFile.tell
NotImplementedError: seek not implemented in files compressed by method 1

pickle.dump(result, results): no space left on the device

Hello,

while running multiple (~3000 TSV from R5 Finngen) jobs on a cluster I encountered:

Traceback (most recent call last):
  File "/app/main.py", line 182, in <module>
    main()
  File "/app/main.py", line 130, in main
    gwas, idx, sample_metadata = Gwas.read_from_file(
  File "/app/gwas.py", line 342, in read_from_file
    pickle.dump(result, results)
OSError: [Errno 28] No space left on device

I am running the 1.3.1 version inside of the Singularity container. The /tmp partitions on the computing nodes apparently can be filled in be it by multiple gwas2vcf jobs running on the node or other user jobs.
Setting up $TMP to some other larger directory (i.e. /home/username/tmp ) before invoking singularity exec gwas2vcf_1.3.1.sif does work, but one has to make sure that the aforementioned $TMP directory is accessible from the inside of the Singularity container. No --no-home switch in the example above.

May I suggest to wrap the pickle.dump(result, results) in try: except and printing out tempfile.gettempdir() for the easier debugging?

Hope it helps

Darek Kedra

allow for storing multiple traits per VCF (use the sample field)

Also helpful for storage with GenomicsDB?

allow json fields in commandline input

No output file

I have setup docker on a Mac and can run main.py without errors using the following.

docker run -v /Users/username/Documents/dockerfiles:/Users/username/Documents/dockerfiles --name gwas2vcf -it gwas2vcf:latest python main.py --data /Users/username/Documents/dockerfiles/data/gwas.tsv.gz --json /app/params.json --id ID --ref /Users/username/Documents/dockerfiles/data/human_g1k_v37.fasta --dbsnp /Users/username/Documents/dockerfiles/data/dbsnp.v153.b37.vcf.gz --out /Users/username/Documents/dockerfiles/data/out.vcf --alias ./alias.txt

Runtime is around 4 hours but results in no output file. I have also ran in with DEBUG and there appears to be no error messages.

Use absolute number of cases instead of fraction to avoid loss of precision

host reference genome data

@explodecomputer the Broad have retired access to their FTP, I know we have copies of the genome FASTAs etc, do you know if we are hosting them externally?
Thanks
Matt

Difficulty with --dbsnp/alias

Dear gwas2vcf team,

You have done a very nice job with gwas2vcf and I managed to convert out GWAS sumstats but I am having difficulty with --dbsnp/alias; without them I could get my .vcf but not with them; do you have any suggestions?

python main.py --out ${prot}.vcf.gz --data ${prot}.txt.gz
--id ${prot} --ref human_g1k_v37.fasta --json gwas2vcf.json
--dbsnp snp154_GCF_000001405.25.gz
--alias hg19.chromAlias.tsv

A chunk of data is shown here,
chr pos snpid a1 a2 af b se logp n
1 10177 chr1:10177_A_AC A AC 0.6598 -0.0391 0.0223 1.1 6879

Thank you!

Jing Hua

case control header set wrong when cohort fraction used

check binary/continuous flag in vcf files

vcf spec 4.3 supports inf; use this for -log10 p instead of 999

compress fasta files used in example

Hello,

downloading 3G human genome fasta files from your server, i.e.

http://fileserve.mrcieu.ac.uk/ref/2.8/b37/human_g1k_v37.fasta

takes several minutes. Reducing the download size by it by compressing fasta with gzip or even zstd would reduce the time for downloads. Zstd -11 produces file 0.27 of the original size while its decompression on 8 threads i7 takes ~8s.

mrcieu / gwas2vcf Goto Github PK

gwas2vcf's Introduction

Convert GWAS summary statistics to VCF

Documentation

GWAS summary statistics

Tutorials

Workflow

Citation

gwas2vcf's People

Contributors

Stargazers

Watchers

Forkers

gwas2vcf's Issues

hg38

Recommend Projects

Recommend Topics

Recommend Org