nealelab / uk_biobank_gwas Goto Github PK

View Code? Open in Web Editor NEW

328.0 328.0 106.0 200 KB

Overview of the data QC, code, and GWAS summary output from the 2017 UK Biobank data release

Python 89.12% R 10.66% Shell 0.22%

uk_biobank_gwas's People

Contributors

Stargazers

Watchers

Forkers

geneticresources snewhouse kormilitzin dolittle007 liameabbott kawtharalajmi tianbu tankmermaid riedelbc yzharold biostatpzeng biostat0903 ericdunipace xinhe-lab fabbondanza lisayg cqgd anubhavkafle oasisye tahzeeb alimahmoudi29 senhongying mamanambiya nurfatimaj ichobits htnani wkl1990 yu-1011 khjia zscu julongwei lfrancioli saonlib ioneliabuzatu yanglab-emory lea1010 gqi qzhqzh evan1105 bkutlu yanch86 evagradovich weizhousjtu zhengcq hl685 xtmgah nikolaospapachristou alaincoletta zhouhufeng bioshare miaoranzhang sunny1205124 gazurel ofrei silhouetteq zj-zhang hammer zhennan-z baranshad xiaoqiwang19 chenll9701 shicheng-guo pristineliving leachau agus-setiawan-desu thupham09 muluayele999 figtop wecandoitkorea joomango wangdi2014 nokeyuan emmauom ahmedmkhattab lichenbiostat blspector jtnedoctor vukadinovic936 smsinks luomei308 jianguozhou3 biochen4445 alireza-majd nvrivera gray-tu cnaid rach4r dariushghasemi xinghuq decai-wang evelynathania forget999 hertera1 borangao das2000sidd iff-0303 adams-charleen zengzeng12 genostack gaochengprc

uk_biobank_gwas's Issues

Including Chromosome X in GWAS

Hello,

I am wondering did your group do association analysis when including chromosome X for UKB. I couldn't find any information related to QC steps and Chr X in your repository.

Cheers,
Ana

Sample-QC file at UK biobank

Dear all,

I have access to UK biobank data.
I am looking for a specific file (ukb_sqc_vZ.txt or named before as ukb_sqc_v2.txt).
I need this file to identify in.white.British.ancestry.subset.
I could not find it on UK biobank site, not even its code to download it from UK biobank.
They mentioned that it is possible to have it from EGA, but even there it is not possible to have it unless access granted by UK biobank.
Any idea how to get it directly from UK biobank?
Regards

Unable to reproduce your sample QC exactly

Hi all,

Thanks for sharing your code!

I am looking to reproduce your GWAS results for a small number of phenotypes as part of a project on the genetic structure among autoimmune disorders.

After applying your Sample QC filters to my data, I am left with about 4k more samples than stated (361194 samples), however:

#               filter pass_alone pass_cumulative
# 1:                all     488377          488377
# 2:        used.in.pca     407219          407219
# 3:     sex.aneuploidy     487725          406825
# 4:         pc.outlier     477101          396439
# 5: ethnicity.included     444409          364965
# 6:            consent     488366          364965

Would you be able to comment on this difference based on the relevant code? It would be good to know how exactly you have filtered samples by ethnicity and identified outliers based on the principal components, if this is different from the way I have done it here.

Many thanks,

Chris

How IRNT was conducted on sex-stratified analysis

Hi,

For three GWAS samples of each phenotype, (Female, Male, bothsex), I am wondering which of the following ways the IRNT was conducted:

Run IRNT separately for "Female", "Male" and "bothsex" GWAS samples
Run IRNT only once on all (combined) samples, and then extract Female, Male for GWAS

Thank you!

How is sex determined

Hi,

I can't seem to find information on how sex was determined for the sex-stratified analyses. Is this based on self-report or genetic inference?

Thanks,
Emil

Inconsistency in MD5 Checksums for imputed v3 GWAS sumstats

Hi!

I recently downloaded GWAS summary statistics for almost all UKBB phenotypes (both sexes only) via the AWS link found in the provided Google document. To ensure data integrity, I conducted an MD5 checksum verification (on Linux system). However, I encountered a recurring issue with 22 specific phenotypes failing the MD5 checksum validation across two separate tests.

For example, for phenotype ID 1220, accessed through the link https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/1220.gwas.imputed_v3.both_sexes.tsv.bgz, the MD5 checksum I computed was 5ee1df62d2a6608c942ec85fd712bf9a. This differs from the expected checksum provided, which is 80d2c21a425aee154d585cd20ffa1e8c.

I have listed all affected phenotypes below for your reference.
Could you please investigate this discrepancy?

Thank you for your attention to this matter!

135 Number of self-reported non-cancer illnesses
136 Number of operations, self-reported
137 Number of treatments/medications taken
398 Number of correct matches in round
403 Number of times snap-button pressed
709 Number in household
924 Usual walking pace
1160 Sleep duration
1190 Nap during day
1200 Sleeplessness / insomnia
1220 Daytime dozing / sleeping (narcolepsy)
1239 Current tobacco smoking
1289 Cooked vegetable intake
1518 Hot drink temperature
1548 Variation in diet
1687 Comparative body size at age 10
1697 Comparative height size at age 10
1873 Number of full brothers
1883 Number of full sisters
2237 Plays computer games
2296 Falls in the last year
2306 Weight change compared with 1 year ago

References in Google Doc - menifest tab

In this doc at the "Menifest 201807" tab there are many cells with REF issue.
Is there are corrected version of the tab?

Thanks in advance!

imputed-v3-gwas

Hello, could you tell me how can i get the code of imputed-v3-gwas? Or the imputed-v3-gwas pipeline is not going to be open resource.

Problem for version 2

Hi,
When I analyze the height phenotype, I check my result to yours (version 2). The dosage of allele is the same, but the ytx is not the same, even the large difference. In the results, I find some ytx is larger than zero, but the beta is smaller than zero. I would like to know how you calculate it. Is it the multiple the residual of y, regressed with sex+PC{1:10}?
If you tell me how to calculate, I can revise my program and get the right answer.
Many thanks for your help!
Sheng

How to handle missing dosage of imputed data.

I'd asked a question before (closed by me, as it was not so specific!) regarding the GWAS done on imputed dosages in UKB. I however intend to paraphrase my old question:

As far as I understood from Hail codes, Hail considers missing probabilities in GEN files as [0, 0, 0], for AA/Aa/aa.
According to the BGEN 1.2 format, missingness of calls is stored along with probabilities in genotype data block. However, it seems that this field is not currently used in version 1.2 (as the result, most significant bit of missingness data is always 0). Moreover, since the third probability (assuming all as bi-allelic) is inferred as one minus the sum of other probabilities, Hail (I tested with rbgen as well) will never find any missingness in BGEN v 1.2 (UKB imputed data).
Nevertheless, if genotype hard calls ('GT') are used, missingness is differently treated (e.g. If there is not a unique maximum probability, the hard call is set to missing).

That being said, do you think that considering no missing value for genotype probabilities in BGEN files doesn't affect the downstream GWAS results?

Many thanks in advance,
Best
Oveis

input file

Dear all,
I don't know how to generate the file in 'gs://ukb31063/hail/ukb31063.neale_gwas_variants.ht'. Would you please to help me to solve this problem.
Thanks

where is ukb_sqc_v2.txt file located?

Hello,

can you please tell me from where and how did you download ukb_sqc_v2.txt file?

Thanks
Ana

Generating the file "neale_lab_parsed.tsv"

Hi there,
I hope you're doing well. Apologies for this very simple question. What variables or Field IDs does the file "neale_lab_parsed.tsv" include? I want to generate the same file so that I can run the script "ukb31063_eur_selection.R". Your help will be greatly appriciated！Thanks a lot.

All the best！

Charles

Questions on organization and possible discrepancies

Hi, I have a few questions on the general organization in this repo:

Is the code that produced the most recent "imputed-v3" version of results all in the 0.1 folder?
If so, why was Hail 0.2 not used rather than 0.1 in the final results?
Is any of the code in the imputed-v2-gwas relevant for somebody that wanted to reproduce the "Round 2" results mentioned at http://www.nealelab.is/uk-biobank? I wanted to be absolutely certain that none of the Hail 0.2 code is important.
Do those Round 2 results correspond to the "imputed-v3" results mentioned in the README of this codebase?

Any help would be much appreciated! Thank you.

HWE threshold for VEF coded SNPs

I am wondering what does w/MAF mean in "HWE p-value > 1e-10 * Exception: VEP annotated coding w/MAF < 0.001"?

regression test

Hi,

For the regression tests, it seems that linear regression tests are used also for binary outcomes. Would linreg3 be able to figure out the data type of outcomes and use the logistic regression test for binary outcomes?

Thanks,
Amy

Computing resources necessary for GWAS

If we wanted to reproduce the results from this analysis, would it be possible to get an estimate of what kind of computing resources are necessary? How many nodes do you need in a Spark cluster to do this many regressions and what specifications do those nodes have? How long might we expect it to take for a given cluster configuration?

Thank you.

Incorrect results for pregnancy related outcomes?

I am posting this issue here, because I did not get a response to an email I sent to [email protected] six weeks ago.

I am wondering about the N for pregnancy related outcomes in the downloadable GWAS summary data, which appears to be faulty to me.

For example, the N for “Diagnoses - main ICD10: O24 Diabetes mellitus in pregnancy” is 337,199. However, there are only around 270,000 women in the UK biobank.
I suspect that 337,199 is the number of individuals (male and female) that were analysed. I would think that analysis of pregnancy related outcomes should be limited to women who reported a life- or still-birth.

Alternatively, I am misunderstanding the way the analysis was set up.

Could you clarify if the analysis of pregnancy related variables is erroneous, or if I misunderstood something?

Clarify MAF threshold (0.0001 or 0.001)

Hi,
Thank you so much for sharing the scripts and for the detailed description of your GWAS pipeline, this information is really helpful.
I've noticed that the README.md file (in the root of your repository) refers to the 0.0001 MAF threshod (both v2 and v3 marker QC), but the blog mention 0.001, and this seem to be consistent with your code ( https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/6_filter_gwas_variants.py#L21 ). Should README.md file be corrected?

Also, on a totally separate note, are you planing to share, at some point, the v3 scripts (similarly to those in "imputed-v2-gwas" folder)? If yes that would be really helpful.

file question about UKBB

Hello, I just got the UKBB data ,I want to know which file for us to strart GWAS analysis?? I mean the imputiton data file?
if you give me some tips and comments, I'll be appreciate your help. thanks.

File structure

I don't know if it's convenient to be able to get the file structure of your project in 0.2.

ICD-10 based diagnoses?

Hi,

I'm having a hard time to understand the extraction of binary phenotypes from ICD-10 codes in UKBB. I try to summarize my questions in few points (which I deeply appreciate if you could help me with them!)

UKBB phenotypes were extracted using customized PHESANT, and phenotype details can be accessed through Updated v2 phenotype summary file - both_sexes. Looking at some binary phenotypes, I realized that they are extracted only using main ICD-10 codes (data-field 41202). Is there any particular reason that you did not include secondary ICD-10 codes (data-field 41204)? I'm asking this, because first, there is no information in customized PHESANT page (or at least, I failed to find it) about this, and secondly, some other studies also include secondary ICD-10 codes to extract binary phenotypes.
It seems that all arrays for main ICD-10 codes were merged together; for instance, if input phenotype code is X111, customized PHESANT scans the all 66 arrays (categorical multiple) for at least one appearance of X111 (as described in Supplementary section S2: Dealing with fields with multiple time points or multiple measurements at the same time point of PHESANT paper). Am I right about this?
If the answer to the above question is YES (i.e. at least one appearance of the input ICD-10 code over all 66 arrays), how should multiple ICD-10 codes be handled? For instance, N18 Chronic renal failure covers N180, N181, N182, N183, N184, N185, N188, N189 codes.
As also mentioned in the PHESANT paper, fields containing >1 instance won't be scanned. But, in case of data-field 41202, 66 arrays correspond to different diagnosis time points (from 1992-2017). Considering the fact that these binary traits will be used for GWAS studies, using different covariates (e.g. age, BMI and gender data were recorded at different time points than those of binary phenotypes), do you think the results of such studies are reliable in this sense?

Thanks in advance for your patience and response!
Best
Oveis

PLINK2 software problem

Dear all:
There is a tough problem for me: I want to merge all imputed_data and INFO data that given from UKBB. but i fail to merge, can somebody tell me that's wrong with my code or do you have other idea about how to merge it?
my code in below:
plink2 --bgen /disk/disk1/UKB/Gene/Imputation/ukb_imp_chr${chr}_v3.bgen 'ref-first' --sample /disk/disk1/UKB/Gene/Imputation/sampleID.sample --extract-col-cond ukb_mfi_chr${chr}_v3.txt 8 2 --extract-col-cond-min 0.5 --make-pgen --out out

how QC going?

hello,!
I want to know that when I get the imputed data, should I go the SNP-QC in each chromosome and then merge all 22 files??? or should I merge chr1-22 first and then go QC part?? please let me know ,I'll very appreciate!

Where to find pheWAS result of V3?

Dear Sir/Madam,

Where to find pheWAS summary statistics of V3? It looks pheweb is based on V2, right?

http://pheweb.sph.umich.edu:5000/pheno/M06

Thanks

Shicheng

Broken links to the GWAS results

It seems that the GWAS results on Dropbox cannot be reached anymore. Starting from the Neale Lab website, I accessed the manifest file, but all the links I tried are broken, returning a 404 error.

Unrelatedness in sample QC?

Hi,

I have a general question regarding relatedness in sample QC. As mentioned here (https://github.com/Nealelab/UK_Biobank_GWAS#imputed-v3-sample-qc) Used.in.pca.calculation filter should remove related individuals from the sample. However, when I remove individuals based on this filter and other filters (e.g. sex chromosome aneuploidy and White.british.ancestry) from the relatedness pairs (fetched via ukbgene rel), there still some pairs of related individuals.
Can you please help me to understand this? Am I doing something wrong here?

Thanks!

Where can we get the smaple of size of cases？

Thanks

european_samples.tsv is ambiguous for mapping to application-specific ids

The file european_samples.tsv from
https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/european_samples.tsv.bgz
contains plate and well ids, which is supposed to obtain application-specific sample ids from ukb_sqc_v2.txt.
However, the batch id is also required, as plate and well ids are not matching to sample ids unambiguously. For instance, the following entries appear twice in european_samples.tsv:
SMP4_0014640A H04
SMP4_0014502A E05

Overall, the following entries from european_samples.tsv appear twice in ukb_sqc_v2.txt in different batches:
SMP4_0013746A H09
SMP4_0014502A A08
SMP4_0014502A E05
SMP4_0014503A F01
SMP4_0014641A B04
SMP4_0014641A C05
SMP4_0016202A B01
SMP4_0016202A C01
SMP4_0012383A C09
SMP4_0014640A H04

Sex and self-reported British ancestry are not sufficient to resolve the ambiguities for all samples.
Can we still have european_samples.tsv with batch id (e.g. Batch_b043, Batch_b053, ...) added?

Thanks in advance

Dmitriy

how to get the sample ID of the ukb_sqc_v2.txt file?

I have the file of the ukb_sqc_v2.txt and the count of the ukb_sqc_v2.txt file is 488377 without header. And through some criteria, i get the QCed sample of 337308. but i have a question that how can i get the sample ID of the QCed sample of 337308. Thank you very much.

Dropbox links aren't working

The links (and wget commands) for the full GWAS results (UKBB GWAS Imputed v3 - File Manifest Release 20180731) are not working. Has the location changed?

Biobank QC and GWAS

Hello,

I was wondering if I could as you did you in the process of doing QC remove all subjects who where under category "Participant excluded from kinship inference process”?

or did you just choose just "No kinship found" and "NA" from the genetic_kinship_to_other_participants_f22021_0_0 Biobank data field?

I am trying to do GWAS myself and I am getting around 17000 more subjects than you guys used in your GWAS.

Can you please advise,
Thanks
Ana

Downloading self reported ancestry

Hello there,

Apologies for this very simple question but I am trying to download the self reported ancestry so that I can extract out different ethnicities for my GWAS.

I have so far been unable to find this information successfully on the UKBB website. I am interested in the PC correlated self reported ethnicity and am after European, African and South East Asian populations.

Thank you very much
Sanjana

pheno 137

Hello,

I was just wondering if you know why all variants in the file located here: wget https://www.dropbox.com/s/qz4bu9lffse7q3l/137.gwas.imputed_v3.female.tsv.bgz?dl=0 -O 137.gwas.imputed_v3.female.tsv.bgz, are low confidence variants, while in the male and all_sex version there are millions of "low_confidence_variant==FALSE".

Thanks in advance,
Jenny

Running ukb31063_eur_selection.R

Hi,

I am having some difficulty running the ukb31063_eur_selection.R script provided on the repo to subset to European samples within my UKBB data.

In particular, I have two questions:
1)qc <- fread("ukb31063_sample_qc.tsv", sep='\t', header=T, stringsAsFactors=F, data.table=F)
Is this file, the ukb_sqc_v2.txt file, which contains heading such as Genotyping.array Batch Plate.Name Well Cluster.CR....
2)If so, I have tried to use the partial script that @astheeggeggs has provided in another issue (#29) to generate a parsed tsv for the phenotypes. However when I try to run through this script, I get a warning, and then later an error when I run this part of the code:

phens <- fread("ukbb_download_27864/marcus_parsed.tsv", sep='\t', header=T, stringsAsFactors=F, data.table=F, select=unname(ph_cols))
Read 502536 rows and 1 (of 1263) columns from 1.111 GB file in 00:00:11
Warning messages:
1: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x1647_0_0' not found in column name header (case sensitive), skipping.
2: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x20115_0_0' not found in column name header (case sensitive), skipping.
3: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x21000_0_0' not found in column name header (case sensitive), skipping.

names(phens) <- names(ph_cols)
Error in names(phens) <- names(ph_cols) :
'names' attribute [4] must be the same length as the vector [1]

Any help with this would be greatly appreciated!
Thanks,
Marcus

covariates

Hi, I was wondering why is agesex, agesex^2 included as covariates. Is age and sex not enough to account as the covariates?

Ordinal phenotypes

Hello,

I have a question about the processing of ordinal phenotypes. In the modified PHESANT repository (here), it doesn't look like the ordinal phenotypes are transformed with the irnt function, but in the blog post it says that all phenotypes are either continuous or binary, and continuous traits were rank transformed to have a normal distribution and I can't find any mention of processing of ordinal traits specifically. I believe the source PHESANT code performs an ordered logistic regression on ordinal phenotypes, but this was modified for your uses I assume. How were they handled in the Neale GWAS? I assume as continuous traits? So I was just wondering if they underwent irnt transformation first?

Also were all continuous (and ordinal?) traits standardized before GWAS in HAIL?

Thanks in advance,
Jenny

citation

Hi, I am downloading the
VCF file associated with "Diagnoses - main ICD10: F31 Bipolar affective disorder ukb-a-525 ". May I aks you guidance on how should I cite this? I understand that for using the GWAS-VCF files I should cite " The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, et. al". However, I would also like to cite the original publichation associated with the study " Bipolar affective disorder ukb-a-525" , but I can not find it.
Thanks for your hlep,
Best,
sofia

Order of the code in 0.2

I would really appreciate you telling me how to run the v3 script. The order in which the code is run is not explained under the 0.2 folder.

Imputed dosage/GT/GP? Which and why?

Hi,

I have some questions on GWAS data reported here on imputed BGEN files. As far as I understood, you used dosage data for regression analysis, am I right? If so, is there any reason for that? Have you compared the results to genotype hard calls ('GT'), or genotype probabilities ('GP')? I'm asking this, because in the several GWAS reports published on BGEN files in UKBB, I cannot find a clear explanation on the method used.
Moreover, I can see that Hail takes maximum probability for genotype harcalls, however, the approach of tools such as PLINK may be a bit different (of course it depends on the user input flags such as --import-dosage-certainty). So, do you think taking the maximum probability is reasonable, or taking a threshold (e.g. 0.9) or as described in PLINK2 page (under Dosage import settings)?

Many thanks in advance,
Best
Oveis

Some analysis problem

Hello,
I also do some analyze of the UKbiobank data. I follow you workflow, however, get the different results.

sqc data and .fam data have the same order, then I combine them by fam_sqc_merge.R. But the intersections between the combined sqc data and phenotype data is 337198. The sample (eid = -11, nrow=404992) is delete from my program. Could you tell me the id of the samples you selected.
I use the code of BlotLMM to analyze the association between standing height and genotypes. But I also find some different, including selected SNPs, dosage and beta. I would like to know how you get the pHWE, with the information from bgen or not ?
Thanks!

nealelab / uk_biobank_gwas Goto Github PK

uk_biobank_gwas's People

Contributors

Stargazers

Watchers

Forkers

uk_biobank_gwas's Issues

Recommend Projects

Recommend Topics

Recommend Org