mareesat / gwa_tutorial Goto Github PK
View Code? Open in Web Editor NEWA comprehensive tutorial about GWAS and PRS
A comprehensive tutorial about GWAS and PRS
Hi Andries,
any plan to have an update to GWA_tutorial?
Shicheng
I am reproducing the whole tutorial.
At <3_Main_script_association_GWAS.txt>, ##Permutation:
The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
should subset the SNP from Chromosome 22. But when I check subset_snp_chr_22.txt, I found the SNPs from other chromosome.
Here I would like to ask what is your original intention:
Do you want to say:"Subset of every chromosome's position 21595000 to 21605000", the descirption is wrong.
Or,
Description is correct, but command need to be revised
awk '{ if ($4 >= 21595000 && $4 <= 21605000 && $1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
Hi,
I found using the current command will result in all variant id specified as "." in the bim file. Is that wrong? Shall we set variant id manually using the parameter --set-missing-var-ids?
**** Problem solved ****
As stated in the tutorial, changing the CRAN mirror to your own VPN would solve this problem.
**** Problem solved ****
I got an error when downloading qqman package from http://cran.cnr.berkeley.edu/ (which is written in Manhattan_plot.R of the tutorial), and I believe it is because there is problem in the download source.
I'm using R version 4.1.1 (2021-08-10), which meets the prerequisite of qqman installation.
Could you guide me to the currently available download site of qqman package? Thank you!
--------- My input ----------
(plink) Rscript --no-save Manhattan_plot.R
--------- Full text of the error ----------
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/src/contrib:
cannot open URL 'http://cran.cnr.berkeley.edu/src/contrib/PACKAGES'
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1:
cannot open URL 'http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1/PACKAGES'
Warning message:
package ‘qqman’ is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Error in library("qqman", lib.loc = "~") :
there is no package called ‘qqman’
Execution halted
I have a doubt on this step in Population stratification. I can understand why we are changing the reference alleles based on the reference data set. But is it okay to do that? Because, we are generating vcf files based on a reference genome (ex: GRCh37) and the 1000 genome data set is made up based on GRCh38. So I'm changing the reference alleles based on 1000 genome data set. What is the point in generating vcf files using our preferred reference genome? We could have created vcf files using the other reference genome initially itself. But why we are not doing that? Please resolve my doubt.
Thanks
When generating a plot to assess the type of relationship in the sample, I do get an R script erro.
$ Rscript --no-save Relatedness.R
Error in legend(1, 1, xjust = 1, yjust = 1, legend = levels(relatedness$RT), :
'legend' is of length 0
Execution halted.
What might be the problem?
In GWAS analysis, this process does not add a kinship matrix to run the GWAS. However, My data is family-level, I want to added the kinship to run the GWAS, Whether any parameter of plink can be added to kinship, thanks.
Hi,
Thank you for the paper and the step-by-step tutorial! Those are of great help!
May I ask a question regarding your paper? It says that the following:
Individuals who are outliers based on the MDS analysis should be removed from further analyses. After the exclusion of these individuals, a new MDS analysis must be conducted, and its main components need to be used as covariates in the association tests in order to correct for any remaining population stratification within the population.
Does it mean those MDS components should be added as covariates regardless whether they are significantly associated with the trait? if so, could you explain why?
I want to practice my genetic data following your tutorial, but I do not know how to create the mydata-based inversion.txt file. Could you please tell me in detail. Thank you very much.
I have set Rscript.exe in the bin/,
When i try to run : $ Rscript --no-save gender_check.R
It shows me:
The system cannot find the path specified.
What should I do, Thanks!!!
vi 0.2_low_call_rate_pihat.txt
i
13291 NA07045
:x
Hi,
I noticed using the command wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz from the 2_Main_script_MDS.txt downloads a corrupted file. Hence creating plink files from the vcf resulted in file read failure error. But when ftp is replaced with https like this, wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz, there is no file corruption issue and the download is faster as well.
Thanks
Hi,
I am new to GWAS, I have 100+ exome SNV annotated files, i want to create same bim, bed and fam files.
Can you/anyone pls help me with how ( bim, bed and fam ) are created for HapMap_3_r3 from hapmad Db, that will help me a lot because those are most important inputs for further analysis even before Step1.
While running:
# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).
awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt
plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7
Here are the error logs:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to HapMap_3_r3_7.log.
Options in effect:
--bfile HapMap_3_r3_6
--extract snp_1_22.txt
--make-bed
--out HapMap_3_r3_7
16125 MB RAM detected; reserving 8062 MB for main workspace.
1430443 variants loaded from .bim file.
165 people (81 males, 84 females) loaded from .fam.
112 phenotype values loaded from .fam.
Error: No variants remaining after --extract.
Here are the HapMap_3_r3_6.bim:
1 rs2185539 0 556738 T C
1 rs11240767 0 718814 T C
1 rs3131972 0 742584 A G
1 rs3131969 0 744045 A G
1 rs1048488 0 750775 C T
Here are the snp_1_22.txt
rs2185539
rs11240767
rs3131972
rs3131969
rs1048488
I have searched through Google and documents, but still haven't found a solution.
Your input or guidance on this would be highly appreciated
Your paper suggests the following:
For binary traits we suggest to exclude: HWE p value <1e −10 in cases and <1e−6 in controls. Less strict case threshold avoids discarding disease‐associated SNPs under selection. For quantitative traits, we recommend HWE p value <1e‐6.
May I ask two questions?
Hi Marees,
Thanks for the pipeline.
I have around 5k samples in pihat_min0.2_genome and 390 in zoom_pihat_genome.
If I sort them, I could identify the related individuals but don't know how to select samples for '0.2_low_call_rate_pihat.txt'.
This part alone is not clear to me, and everything else is perfect.
Thanks again.
Regards,
Mano
Hi,
I am running the codes and I got an error in the following code:
system("awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt")
system("plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7")
The error is:
Error: Failed to open snp_1_22.txt.
Hello - thanks a lot for your efforts, it worked well (tried it in the example datset). Can you please briefly explain what needs to be changed/adapted in the 1st tutorial to run a quantitative trait quality control?
Again, thanks a lot.
Samar
Hello,
Thank you for your great tutorial. I followed your tutorial step-by-step, but in Population_stratification step, for run Rscript MDS_merged.R, I got an error about can not open MDS_merge2.mds. Actually, I do not know why did not make this file in the previous step?
I just have:
MDS_merge2.bed
MDS_merge2.bim
MDS_merge2.fam
MDS_merge2.genome
MDS_merge2.log
MDS_merge2.nosex
Could you guide me?
Thank you.
May I ask "For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate"? What does call rate mean in this sentence, and why do I need to "plink --bfile HapMap_3_r3_11 --missing" after getting the file pihat_min0.2_in_founders.genome? Of course the crux of the matter is how to get the result by the above operation, individual 13291 NA07045 has the lowest call rate, i.e. how to get the data in 0.2_low_call_rate_pihat.txt?
Thank you for your tutorial in advance.
I have some confusion about the "inversion.txt" in the 1_QC_GWAS folder. I know it is for excluding high inversion regions. However, where is the "inversion.txt" come from? Since there are many other regions of high linkage disequilibrium, why you just choose these regions?
Can you give some more detailed explanation here?
Thank you~
May I ask how to explain my qqplots? It looks soooo strange...
There are 2 folders standing for different dataset and both containing association and logistic analysis with plink.
GWAS.zip
Thank you for your checking on my problems and prompt reply!!!
Respectly MareesAT :
When I running the command, plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes, I have got the work-log below:
@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@
Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web
Writing this text to log file [ ALL.2of4intersection.20100804.genotypes.log ]
Analysis started: Wed Sep 14 16:15:57 2022
** Unused command line option: --vcf
** Unused command line option: ALL.2of4intersection.20100804.genotypes.vcf.gz
ERROR: Problem parsing the command line arguments.
Could you please solve it ?
All's best!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.