nealelab / uk_biobank_gwas Goto Github PK
View Code? Open in Web Editor NEWOverview of the data QC, code, and GWAS summary output from the 2017 UK Biobank data release
Overview of the data QC, code, and GWAS summary output from the 2017 UK Biobank data release
Hello,
I am wondering did your group do association analysis when including chromosome X for UKB. I couldn't find any information related to QC steps and Chr X in your repository.
Cheers,
Ana
Dear all,
I have access to UK biobank data.
I am looking for a specific file (ukb_sqc_vZ.txt or named before as ukb_sqc_v2.txt).
I need this file to identify in.white.British.ancestry.subset.
I could not find it on UK biobank site, not even its code to download it from UK biobank.
They mentioned that it is possible to have it from EGA, but even there it is not possible to have it unless access granted by UK biobank.
Any idea how to get it directly from UK biobank?
Regards
Hi all,
Thanks for sharing your code!
I am looking to reproduce your GWAS results for a small number of phenotypes as part of a project on the genetic structure among autoimmune disorders.
After applying your Sample QC filters to my data, I am left with about 4k more samples than stated (361194 samples), however:
# filter pass_alone pass_cumulative
# 1: all 488377 488377
# 2: used.in.pca 407219 407219
# 3: sex.aneuploidy 487725 406825
# 4: pc.outlier 477101 396439
# 5: ethnicity.included 444409 364965
# 6: consent 488366 364965
Would you be able to comment on this difference based on the relevant code? It would be good to know how exactly you have filtered samples by ethnicity and identified outliers based on the principal components, if this is different from the way I have done it here.
Many thanks,
Chris
Hi,
For three GWAS samples of each phenotype, (Female, Male, bothsex), I am wondering which of the following ways the IRNT was conducted:
Thank you!
Hi,
I can't seem to find information on how sex was determined for the sex-stratified analyses. Is this based on self-report or genetic inference?
Thanks,
Emil
Hi!
I recently downloaded GWAS summary statistics for almost all UKBB phenotypes (both sexes only) via the AWS link found in the provided Google document. To ensure data integrity, I conducted an MD5 checksum verification (on Linux system). However, I encountered a recurring issue with 22 specific phenotypes failing the MD5 checksum validation across two separate tests.
For example, for phenotype ID 1220, accessed through the link https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/1220.gwas.imputed_v3.both_sexes.tsv.bgz
, the MD5 checksum I computed was 5ee1df62d2a6608c942ec85fd712bf9a
. This differs from the expected checksum provided, which is 80d2c21a425aee154d585cd20ffa1e8c
.
I have listed all affected phenotypes below for your reference.
Could you please investigate this discrepancy?
Thank you for your attention to this matter!
135 Number of self-reported non-cancer illnesses
136 Number of operations, self-reported
137 Number of treatments/medications taken
398 Number of correct matches in round
403 Number of times snap-button pressed
709 Number in household
924 Usual walking pace
1160 Sleep duration
1190 Nap during day
1200 Sleeplessness / insomnia
1220 Daytime dozing / sleeping (narcolepsy)
1239 Current tobacco smoking
1289 Cooked vegetable intake
1518 Hot drink temperature
1548 Variation in diet
1687 Comparative body size at age 10
1697 Comparative height size at age 10
1873 Number of full brothers
1883 Number of full sisters
2237 Plays computer games
2296 Falls in the last year
2306 Weight change compared with 1 year ago
In this doc at the "Menifest 201807" tab there are many cells with REF issue.
Is there are corrected version of the tab?
Thanks in advance!
Hello, could you tell me how can i get the code of imputed-v3-gwas? Or the imputed-v3-gwas pipeline is not going to be open resource.
Hi,
When I analyze the height phenotype, I check my result to yours (version 2). The dosage of allele is the same, but the ytx is not the same, even the large difference. In the results, I find some ytx is larger than zero, but the beta is smaller than zero. I would like to know how you calculate it. Is it the multiple the residual of y, regressed with sex+PC{1:10}?
If you tell me how to calculate, I can revise my program and get the right answer.
Many thanks for your help!
Sheng
Hi
I'd asked a question before (closed by me, as it was not so specific!) regarding the GWAS done on imputed dosages in UKB. I however intend to paraphrase my old question:
That being said, do you think that considering no missing value for genotype probabilities in BGEN files doesn't affect the downstream GWAS results?
Many thanks in advance,
Best
Oveis
Dear all,
I don't know how to generate the file in 'gs://ukb31063/hail/ukb31063.neale_gwas_variants.ht'. Would you please to help me to solve this problem.
Thanks
Hello,
can you please tell me from where and how did you download ukb_sqc_v2.txt file?
Thanks
Ana
Hi there,
I hope you're doing well. Apologies for this very simple question. What variables or Field IDs does the file "neale_lab_parsed.tsv" include? I want to generate the same file so that I can run the script "ukb31063_eur_selection.R". Your help will be greatly appriciated!Thanks a lot.
All the best!
Charles
Hi, I have a few questions on the general organization in this repo:
Any help would be much appreciated! Thank you.
I am wondering what does w/MAF mean in "HWE p-value > 1e-10 * Exception: VEP annotated coding w/MAF < 0.001"?
Hi,
For the regression tests, it seems that linear regression tests are used also for binary outcomes. Would linreg3 be able to figure out the data type of outcomes and use the logistic regression test for binary outcomes?
Thanks,
Amy
If we wanted to reproduce the results from this analysis, would it be possible to get an estimate of what kind of computing resources are necessary? How many nodes do you need in a Spark cluster to do this many regressions and what specifications do those nodes have? How long might we expect it to take for a given cluster configuration?
Thank you.
I am posting this issue here, because I did not get a response to an email I sent to [email protected] six weeks ago.
I am wondering about the N for pregnancy related outcomes in the downloadable GWAS summary data, which appears to be faulty to me.
For example, the N for “Diagnoses - main ICD10: O24 Diabetes mellitus in pregnancy” is 337,199. However, there are only around 270,000 women in the UK biobank.
I suspect that 337,199 is the number of individuals (male and female) that were analysed. I would think that analysis of pregnancy related outcomes should be limited to women who reported a life- or still-birth.
Alternatively, I am misunderstanding the way the analysis was set up.
Could you clarify if the analysis of pregnancy related variables is erroneous, or if I misunderstood something?
Hi,
Thank you so much for sharing the scripts and for the detailed description of your GWAS pipeline, this information is really helpful.
I've noticed that the README.md file (in the root of your repository) refers to the 0.0001 MAF threshod (both v2 and v3 marker QC), but the blog mention 0.001, and this seem to be consistent with your code ( https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/6_filter_gwas_variants.py#L21 ). Should README.md file be corrected?
Also, on a totally separate note, are you planing to share, at some point, the v3 scripts (similarly to those in "imputed-v2-gwas" folder)? If yes that would be really helpful.
Hello, I just got the UKBB data ,I want to know which file for us to strart GWAS analysis?? I mean the imputiton data file?
if you give me some tips and comments, I'll be appreciate your help. thanks.
I don't know if it's convenient to be able to get the file structure of your project in 0.2.
Hi,
I'm having a hard time to understand the extraction of binary phenotypes from ICD-10 codes in UKBB. I try to summarize my questions in few points (which I deeply appreciate if you could help me with them!)
UKBB phenotypes were extracted using customized PHESANT, and phenotype details can be accessed through Updated v2 phenotype summary file - both_sexes. Looking at some binary phenotypes, I realized that they are extracted only using main ICD-10 codes (data-field 41202). Is there any particular reason that you did not include secondary ICD-10 codes (data-field 41204)? I'm asking this, because first, there is no information in customized PHESANT page (or at least, I failed to find it) about this, and secondly, some other studies also include secondary ICD-10 codes to extract binary phenotypes.
It seems that all arrays for main ICD-10 codes were merged together; for instance, if input phenotype code is X111, customized PHESANT scans the all 66 arrays (categorical multiple) for at least one appearance of X111 (as described in Supplementary section S2: Dealing with fields with multiple time points or multiple measurements at the same time point of PHESANT paper). Am I right about this?
If the answer to the above question is YES (i.e. at least one appearance of the input ICD-10 code over all 66 arrays), how should multiple ICD-10 codes be handled? For instance, N18 Chronic renal failure covers N180, N181, N182, N183, N184, N185, N188, N189 codes.
As also mentioned in the PHESANT paper, fields containing >1 instance won't be scanned. But, in case of data-field 41202, 66 arrays correspond to different diagnosis time points (from 1992-2017). Considering the fact that these binary traits will be used for GWAS studies, using different covariates (e.g. age, BMI and gender data were recorded at different time points than those of binary phenotypes), do you think the results of such studies are reliable in this sense?
Thanks in advance for your patience and response!
Best
Oveis
Dear all:
There is a tough problem for me: I want to merge all imputed_data and INFO data that given from UKBB. but i fail to merge, can somebody tell me that's wrong with my code or do you have other idea about how to merge it?
my code in below:
plink2 --bgen /disk/disk1/UKB/Gene/Imputation/ukb_imp_chr${chr}_v3.bgen 'ref-first' --sample /disk/disk1/UKB/Gene/Imputation/sampleID.sample --extract-col-cond ukb_mfi_chr${chr}_v3.txt 8 2 --extract-col-cond-min 0.5 --make-pgen --out out
hello,!
I want to know that when I get the imputed data, should I go the SNP-QC in each chromosome and then merge all 22 files??? or should I merge chr1-22 first and then go QC part?? please let me know ,I'll very appreciate!
Dear Sir/Madam,
Where to find pheWAS summary statistics of V3? It looks pheweb is based on V2, right?
http://pheweb.sph.umich.edu:5000/pheno/M06
Thanks
Shicheng
It seems that the GWAS results on Dropbox cannot be reached anymore. Starting from the Neale Lab website, I accessed the manifest file, but all the links I tried are broken, returning a 404 error.
Hi,
I have a general question regarding relatedness in sample QC. As mentioned here (https://github.com/Nealelab/UK_Biobank_GWAS#imputed-v3-sample-qc) Used.in.pca.calculation
filter should remove related individuals from the sample. However, when I remove individuals based on this filter and other filters (e.g. sex chromosome aneuploidy and White.british.ancestry) from the relatedness pairs (fetched via ukbgene rel
), there still some pairs of related individuals.
Can you please help me to understand this? Am I doing something wrong here?
Thanks!
Thanks
The file european_samples.tsv from
https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/european_samples.tsv.bgz
contains plate and well ids, which is supposed to obtain application-specific sample ids from ukb_sqc_v2.txt.
However, the batch id is also required, as plate and well ids are not matching to sample ids unambiguously. For instance, the following entries appear twice in european_samples.tsv:
SMP4_0014640A H04
SMP4_0014502A E05
Overall, the following entries from european_samples.tsv appear twice in ukb_sqc_v2.txt in different batches:
SMP4_0013746A H09
SMP4_0014502A A08
SMP4_0014502A E05
SMP4_0014503A F01
SMP4_0014641A B04
SMP4_0014641A C05
SMP4_0016202A B01
SMP4_0016202A C01
SMP4_0012383A C09
SMP4_0014640A H04
Sex and self-reported British ancestry are not sufficient to resolve the ambiguities for all samples.
Can we still have european_samples.tsv with batch id (e.g. Batch_b043, Batch_b053, ...) added?
Thanks in advance
Dmitriy
I have the file of the ukb_sqc_v2.txt and the count of the ukb_sqc_v2.txt file is 488377 without header. And through some criteria, i get the QCed sample of 337308. but i have a question that how can i get the sample ID of the QCed sample of 337308. Thank you very much.
The links (and wget commands) for the full GWAS results (UKBB GWAS Imputed v3 - File Manifest Release 20180731) are not working. Has the location changed?
Hello,
I was wondering if I could as you did you in the process of doing QC remove all subjects who where under category "Participant excluded from kinship inference process”?
or did you just choose just "No kinship found" and "NA" from the genetic_kinship_to_other_participants_f22021_0_0 Biobank data field?
I am trying to do GWAS myself and I am getting around 17000 more subjects than you guys used in your GWAS.
Can you please advise,
Thanks
Ana
Hello there,
Apologies for this very simple question but I am trying to download the self reported ancestry so that I can extract out different ethnicities for my GWAS.
I have so far been unable to find this information successfully on the UKBB website. I am interested in the PC correlated self reported ethnicity and am after European, African and South East Asian populations.
Thank you very much
Sanjana
Hello,
I was just wondering if you know why all variants in the file located here: wget https://www.dropbox.com/s/qz4bu9lffse7q3l/137.gwas.imputed_v3.female.tsv.bgz?dl=0 -O 137.gwas.imputed_v3.female.tsv.bgz, are low confidence variants, while in the male and all_sex version there are millions of "low_confidence_variant==FALSE".
Thanks in advance,
Jenny
Hi,
I am having some difficulty running the ukb31063_eur_selection.R script provided on the repo to subset to European samples within my UKBB data.
In particular, I have two questions:
1)qc <- fread("ukb31063_sample_qc.tsv", sep='\t', header=T, stringsAsFactors=F, data.table=F)
Is this file, the ukb_sqc_v2.txt file, which contains heading such as Genotyping.array Batch Plate.Name Well Cluster.CR....
2)If so, I have tried to use the partial script that @astheeggeggs has provided in another issue (#29) to generate a parsed tsv for the phenotypes. However when I try to run through this script, I get a warning, and then later an error when I run this part of the code:
phens <- fread("ukbb_download_27864/marcus_parsed.tsv", sep='\t', header=T, stringsAsFactors=F, data.table=F, select=unname(ph_cols))
Read 502536 rows and 1 (of 1263) columns from 1.111 GB file in 00:00:11
Warning messages:
1: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x1647_0_0' not found in column name header (case sensitive), skipping.
2: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x20115_0_0' not found in column name header (case sensitive), skipping.
3: In fread("ukbb_download_27864/marcus_parsed.tsv", sep = "\t", header = T, :
Column name 'x21000_0_0' not found in column name header (case sensitive), skipping.
names(phens) <- names(ph_cols)
Error in names(phens) <- names(ph_cols) :
'names' attribute [4] must be the same length as the vector [1]
Any help with this would be greatly appreciated!
Thanks,
Marcus
Hi, I was wondering why is agesex, agesex^2 included as covariates. Is age and sex not enough to account as the covariates?
Hello,
I have a question about the processing of ordinal phenotypes. In the modified PHESANT repository (here), it doesn't look like the ordinal phenotypes are transformed with the irnt
function, but in the blog post it says that all phenotypes are either continuous or binary, and continuous traits were rank transformed to have a normal distribution and I can't find any mention of processing of ordinal traits specifically. I believe the source PHESANT code performs an ordered logistic regression on ordinal phenotypes, but this was modified for your uses I assume. How were they handled in the Neale GWAS? I assume as continuous traits? So I was just wondering if they underwent irnt
transformation first?
Also were all continuous (and ordinal?) traits standardized before GWAS in HAIL?
Thanks in advance,
Jenny
Hi, I am downloading the
VCF file associated with "Diagnoses - main ICD10: F31 Bipolar affective disorder ukb-a-525 ". May I aks you guidance on how should I cite this? I understand that for using the GWAS-VCF files I should cite " The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, et. al". However, I would also like to cite the original publichation associated with the study " Bipolar affective disorder ukb-a-525" , but I can not find it.
Thanks for your hlep,
Best,
sofia
I would really appreciate you telling me how to run the v3 script. The order in which the code is run is not explained under the 0.2 folder.
Hi,
I have some questions on GWAS data reported here on imputed BGEN files. As far as I understood, you used dosage data for regression analysis, am I right? If so, is there any reason for that? Have you compared the results to genotype hard calls ('GT'), or genotype probabilities ('GP')? I'm asking this, because in the several GWAS reports published on BGEN files in UKBB, I cannot find a clear explanation on the method used.
Moreover, I can see that Hail takes maximum probability for genotype harcalls, however, the approach of tools such as PLINK may be a bit different (of course it depends on the user input flags such as --import-dosage-certainty). So, do you think taking the maximum probability is reasonable, or taking a threshold (e.g. 0.9) or as described in PLINK2 page (under Dosage import settings)?
Many thanks in advance,
Best
Oveis
Hello,
I also do some analyze of the UKbiobank data. I follow you workflow, however, get the different results.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.