Coder Social home page Coder Social logo

gwas2vcf's Introduction

Convert GWAS summary statistics to VCF

Build Status DOI

Tool to map GWAS summary statistics to VCF with on-the-fly harmonisation to a supplied reference FASTA

Produces GWAS-VCF with version 1.0 of the specification

Documentation

Full documentation available from https://mrcieu.github.io/gwas2vcf

GWAS summary statistics

Complete GWAS summary statistics in GWAS-VCF are available on >14,000 datasets from the OpenGWAS project

Tutorials

What can I do with GWAS-VCF?

Let us know if you have other use cases through the issues page!

Workflow

Citation

  • Lyon M, Andrews S, Elsworth B, Gaunt T, Hemani G, Marcora E. The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol 22, 32 (2021). https://doi.org/10.1186/s13059-020-02248-0

  • Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallett J, Bates P, Palmer T, Haberland V, Davey Smith G, Zheng J, Haycock P, Gaunt TR, Hemani G. The MRC IEU OpenGWAS data infrastructure. bioRxiv, p. 2020.08.10.244293, Aug. 2020. https://doi.org/10.1101/2020.08.10.244293

Please also cite the relevant tool(s) and data source if you use GWAS-VCF for downstream analyses.

gwas2vcf's People

Contributors

benjaminralexander avatar darked89 avatar dependabot[bot] avatar explodecomputer avatar titorat avatar vreuter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gwas2vcf's Issues

alias file

Hi,
I think your alias.txt contains a mistake. You list:
25 MT
However, at least in PLINK, only 26 is MT, 25 is XY.
By the way, it would be great if the documentation mentioned that your hg19 and hg38 files use different chromosome codes.
Thanks for looking into this,
Till

VCF cannot store floats < 1e-6

@explodecomputer @elswob We already encountered this issue for storing p-values but there are some cases where the effect size is less than 1e-6 in UKBB. They would end up as 0 in the VCF file. Should we log transform or leave as is?

benchmarks and speedups

Hello,

I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case.
The gwas2vcf was run using Singularity container using:

  • finngen_R5_AB1_AMOEBIASIS.tsv (16380388 lines)
  • dbSNP 155
  • GRCh38 fasta
  • Singularity 3.7.0
  • Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz

Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.

the output VCF format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  amoeb01
1       108391  rs1274919517    A       G       .       PASS    .       ES:SE:LP:ID     0.3653:1.9411:0.0702236:rs1274919517

This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins

My questions

  1. is this in line with running times you observed or 10hrs /file is a fluke (== too slow)?
  2. would it be possible to speed up processing?

Thank you

Darek Kedra

Sample sizes formatted to only 6 significant digits

Sample sizes (the SS or NC records) greater than 1 million get truncated to six significant digits and formatted using E notation. For example, the number of cases "7654321" gets formatted as "7.65432e+06".

inf in estimate field

rs2299979 @ /mnt/storage/private/mrcieu/data/IGD/data/public/eqtl-a-ENSG00000100985/eqtl-a-ENSG00000100985.vcf.gz

reference genome base checking

Hello,

gwas2vcf seems to be dropping (BTW, correct behavior) lines where TSV input REF base does not match the genomic fasta base at that position. By accident (BED file with chromosomes starting with 1 combined with pyfaidx) I noticed that the greatly reduced in size VCF result is still produced by gwas2vcf.

Would it be possible to add some extra check or big fat warning that the number of the dropped positions exceeds some arbitrary limit?
I was lucky enough to have an intact VCF from the previous gwas2vcf run using not corrupted all contigs genomic fasta so the difference was obvious.

Kind regards,

Darek Kedra

different naming schemes in dbsnp.v153.b37.vcf.gz vs dbsnp.v153.hg38.vcf.gz

Hello,

using two input TSV files with the same chromosome naming scheme (1, 2, etc.)

  1. example.1k.txt (run with: human_g1k_v37.fasta & dbsnp.v153.b37.vcf.gz
  2. a subset of finngen_R5_D3_SARCOIDOSIS.tsv (header & SNPs for chromosome 22 only) run using: Homo_sapiens_assembly38.fasta & dbsnp.v153.hg38.vcf.gz

I got a proper VCF with example.1k.txt and just a VCF header for the finngen input.
All the genome and dbSNP files were downloaded using links from from your repository README.md.

I have discovered that i.e. dbSNP files have a different naming schemes:

bcftools view -h  dbsnp.v153.b37.vcf.gz | grep '^##cont'

# result

##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
<snip>

hg38

bcftools view -h  dbsnp.v153.hg38.vcf.gz | grep '^##cont' | 

# result

##contig=<ID=chr1,length=248956422,assembly=38>
##contig=<ID=chr2,length=242193529,assembly=38>
##contig=<ID=chr3,length=198295559,assembly=38>
##contig=<ID=chr4,length=190214555,assembly=38>
##contig=<ID=chr5,length=181538259,assembly=38>
##contig=<ID=chr6,length=170805979,assembly=38>
##contig=<ID=chr7,length=159345973,assembly=38>
##contig=<ID=chr8,length=145138636,assembly=38>
##contig=<ID=chr9,length=138394717,assembly=38>
##contig=<ID=chr10,length=133797422,assembly=38>
##contig=<ID=chr11,length=135086622,assembly=38>
##contig=<ID=chr12,length=133275309,assembly=38>
##contig=<ID=chr13,length=114364328,assembly=38>
##contig=<ID=chr14,length=107043718,assembly=38>
##contig=<ID=chr15,length=101991189,assembly=38>
##contig=<ID=chr16,length=90338345,assembly=38>
##contig=<ID=chr17,length=83257441,assembly=38>
##contig=<ID=chr18,length=80373285,assembly=38>
##contig=<ID=chr19,length=58617616,assembly=38>
##contig=<ID=chr20,length=64444167,assembly=38>
##contig=<ID=chr21,length=46709983,assembly=38>
##contig=<ID=chr22,length=50818468,assembly=38>
##contig=<ID=chrX,length=156040895,assembly=38>
##contig=<ID=chrY,length=57227415,assembly=38>
##contig=<ID=chrM,length=16569,assembly=38>
<snip>

Can you kindly provide instructions about the way you use to synchronize the chromosome/contigs naming schemes for dbSNP VCF relase downloaded directly from i.e:
https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
??

It is not clear without investigating the code what is being done in the background to handle apparently different chromosome names in the input TSV, genomic fasta and dbSNP.

Many thanks for your help

Darek Kedra

error when no snp col is provided

2019-11-21 21:18:12,876 INFO GWAS2VCF 1.1.1
2019-11-21 21:18:12,876 INFO Arguments: {'out': '/data/gwas2vcfweb/data/64d840d6-0ca4-11ea-a24a-0242ad006803/64d840d6-0ca4-11ea-a24a-0242ad006803_harm.vcf.gz', 'data': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/64d840d6-0ca4-11ea-a24a-0242ad006803.txt', 'ref': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1899004205/human_g1k_v37.fasta', 'json': '/data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/upload.json', 'id': '64d840d6-0ca4-11ea-a24a-0242ad006803', 'cohort_controls': None, 'cohort_cases': None, 'rm_chr_prefix': True, 'csi': False, 'log': 'INFO'}
2019-11-21 21:18:12,876 INFO Reading JSON parameters
2019-11-21 21:18:12,877 INFO Parameters: {'header': True, 'delimiter': '\t', 'pval_col': 8, 'beta_col': 5, 'pos_col': 1, 'build': 'GRCh37', 'ea_col': 3, 'chr_col': 0, 'oa_col': 4, 'se_col': 6}
2019-11-21 21:18:12,877 INFO Checking input arguments
2019-11-21 21:18:12,877 INFO Reading summary stats and mapping to FASTA: /data/gwas2vcfweb/cromwell-executions/gwas2vcf/c320d28f-76ca-4bff-886b-48e04209f86c/call-vcf/inputs/1204249195/64d840d6-0ca4-11ea-a24a-0242ad006803.txt
2019-11-21 21:18:12,878 INFO Reading plain text file
2019-11-21 21:18:12,878 INFO Skipping header: chr	pos	rsid	A1	A2	beta	se	N	P-value	Freq.A1.1000G.EUR
2019-11-21 21:18:12,916 INFO Total variants: 999
2019-11-21 21:18:12,916 INFO Variants could not be read: 0
2019-11-21 21:18:12,926 INFO Variants harmonised: 999
2019-11-21 21:18:12,926 INFO Variants discarded during harmonisation: 0
2019-11-21 21:18:12,926 INFO Alleles switched: 455
2019-11-21 21:18:12,926 INFO Skipped 0 of 999
2019-11-21 21:18:12,926 INFO Writing headers to BCF/VCF: /data/gwas2vcfweb/data/64d840d6-0ca4-11ea-a24a-0242ad006803/64d840d6-0ca4-11ea-a24a-0242ad006803_harm.vcf.gz
Traceback (most recent call last):
  File "/app/main.py", line 175, in <module>
    main()
  File "/app/main.py", line 171, in main
    Vcf.write_to_file(harmonised, args.out, fasta, j['build'], args.id, sample_metadata, file_metadata, args.csi)
  File "/app/vcf.py", line 149, in write_to_file
    record.id = Vcf.remove_illegal_chars(result.dbsnpid)
  File "/app/vcf.py", line 30, in remove_illegal_chars
    r = s.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

Containerising

Would it be possible to add the following to the container

  • Download the reference data
  • Allow the container to run in the background and accept command similar to docker exec <container_name> ./harmonise.py -arg1 args -arg2 args

md5 checksums for inputs

Hello,

This is just an enhancement proposal, no rush to implement it.

does it make sense to add md5 checksums in the VCF header for all files used in a given gwas2vcf run?

I am thinking about:

  1. input TSV (or tsv.gz)
  2. json parameters
  3. genome fasta
  4. dbsnp

This should help to investigate the results IMHO.

Best,

DK

bgzip or gzip?

I think at the moment it detects the output type based on suffix of the filename provided, and automatically gzips? Is there anyway to make it bgzip instead? pysam seems to have problems reading a .vcf.gz gzipped file

e.g. /mnt/storage/private/mrcieu/research/mr-eve/gwas-files/IEU-a:2/

>>> bcf_in = VariantFile("IEU-a:2.vcf.gz")
[W::hts_idx_load2] The index file is older than the data file: IEU-a:2.vcf.gz.tbi
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pysam/libcbcf.pyx", line 3990, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4238, in pysam.libcbcf.VariantFile.open
  File "pysam/libchtslib.pyx", line 517, in pysam.libchtslib.HTSFile.tell
NotImplementedError: seek not implemented in files compressed by method 1

pickle.dump(result, results): no space left on the device

Hello,

while running multiple (~3000 TSV from R5 Finngen) jobs on a cluster I encountered:

Traceback (most recent call last):
  File "/app/main.py", line 182, in <module>
    main()
  File "/app/main.py", line 130, in main
    gwas, idx, sample_metadata = Gwas.read_from_file(
  File "/app/gwas.py", line 342, in read_from_file
    pickle.dump(result, results)
OSError: [Errno 28] No space left on device

I am running the 1.3.1 version inside of the Singularity container. The /tmp partitions on the computing nodes apparently can be filled in be it by multiple gwas2vcf jobs running on the node or other user jobs.
Setting up $TMP to some other larger directory (i.e. /home/username/tmp ) before invoking singularity exec gwas2vcf_1.3.1.sif does work, but one has to make sure that the aforementioned $TMP directory is accessible from the inside of the Singularity container. No --no-home switch in the example above.

May I suggest to wrap the pickle.dump(result, results) in try: except and printing out tempfile.gettempdir() for the easier debugging?

Hope it helps

Darek Kedra

No output file

I have setup docker on a Mac and can run main.py without errors using the following.

docker run -v /Users/username/Documents/dockerfiles:/Users/username/Documents/dockerfiles --name gwas2vcf -it gwas2vcf:latest python main.py --data /Users/username/Documents/dockerfiles/data/gwas.tsv.gz --json /app/params.json --id ID --ref /Users/username/Documents/dockerfiles/data/human_g1k_v37.fasta --dbsnp /Users/username/Documents/dockerfiles/data/dbsnp.v153.b37.vcf.gz --out /Users/username/Documents/dockerfiles/data/out.vcf --alias ./alias.txt

Runtime is around 4 hours but results in no output file. I have also ran in with DEBUG and there appears to be no error messages.

Difficulty with --dbsnp/alias

Dear gwas2vcf team,

You have done a very nice job with gwas2vcf and I managed to convert out GWAS sumstats but I am having difficulty with --dbsnp/alias; without them I could get my .vcf but not with them; do you have any suggestions?

python main.py --out ${prot}.vcf.gz --data ${prot}.txt.gz
--id ${prot} --ref human_g1k_v37.fasta --json gwas2vcf.json
--dbsnp snp154_GCF_000001405.25.gz
--alias hg19.chromAlias.tsv

A chunk of data is shown here,
chr pos snpid a1 a2 af b se logp n
1 10177 chr1:10177_A_AC A AC 0.6598 -0.0391 0.0223 1.1 6879

Thank you!

Jing Hua

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.