mareesat / gwa_tutorial Goto Github PK

A comprehensive tutorial about GWAS and PRS

gwas qc prs r plink genetics population-structure unix population-stratification genomics

gwa_tutorial's Introduction

GWA tutorial

This GitHub repository provides several tutorials about techniques used to analyze genetic data. Underneath this README we have provided a step-by-step guide to help researchers without experience in Unix to complete these tutorials succesfully. For reseachers familiar with Unix this README will probably be sufficient.

We have made scripts available for:

1. All essential GWAS QC steps along with scripts for data visualization.
1. Dealing with population stratification, using 1000 genomes as a reference.
1. Association analyses of GWAS data.
1. Polygenic risk score (PRS) analyses.

The scripts downloadable from this GitHub page can be seen purely as tutorials and used for educational purposes, but can also be used as a template for analyzing your own data. All scripts/tutorials from this GitHub page use freely downloadable data, commands to download the necessary data can be found in the scripts.

Content:

1_QC_GWAS.zip
2_Population_stratification.zip
3_Association_GWAS
4_ PRS.doc

How to use the tutorials on this page: The tutorials are designed to run on an UNIX/Linux computer/server. The first 3 tutorials contain both *.text as *.R scripts. The main scripts for performing these tutorials are the *.text scripts (respectively for the first 3 tutorials: 1_Main_script_QC_GWAS, 2_Main_script_MDS.txt, and 3_Main_script_association_GWAS.txt). These script will execute the *.R scripts, when those are placed in the same directory. Note, without placing all files belonging to a specific tutorial in the same directory the tutorials cannot be completed. Furthermore, the first 3 tutorials are not independent; they should be followed in the order given above, according to their number. For example, the files generated at the end of tutorial 1 are essential in performing tutorial 2. Therefore, those files should be moved/copied to the directory in which tutorial 2 is executed. In addition, the files from tutorial 2 are essential for tutorial 3. The fourth tutorial (4_ PRS.doc) is a MS Word document, and runs independently of the previous 3 tutorials.

All scripts are developed for UNIX/Linux computer resources, and all commands should be typed/pasted at the shell prompt.

Note: The *.zip files contain multiple files, in order to successfully complete the tutorials it is essential to download all files from the *.zip files and upload them to your working directory. To pull all tutorials to your computer simply use the following command: git clone https://github.com/MareesAT/GWA_tutorial.git . Alternatively, you can manually open the *.zip folders and PRS.doc file by clicking on the folder/file followed by clicking on "View Raw".

Contact:

Please email Andries Marees ([email protected]) for questions

Additional material

Once you completed the current tutorial we recommend you to visit https://github.com/AngelaMinaVargas/eMAGMA-tutorial This Github repository guides the steps to use eMAGMA.
eMAGMA is a post-GWAS analysis, that conducts eQTL informed gene-based tests by assigning SNPs to tissue-specific eGenes.

Step-by-step-guide for this tutorial

Step-by-step-guide for researches new to Unix and/or genetic analyses.

Introduction

The tutorial consist of four separate parts. The first three are dependent of each other and can only be performed in consecutive order, starting from the first (1_QC_GWAS.zip), then the second (2_Population_stratification.zip, followed by the third (3_Association_GWAS). The fourth part (4_ PRS.doc) can be performed independently.

The Unix commands provided in this guide should be typed/copy-and-pasted after the prompt ($ or >) on your Unix machine. Note, the ">" in front of the commands should not be copy-and-pasted. Only what comes after the ">".

We assume that you have read the accompanying article "A tutorial on conducting Genome-Wide-Association Studies: Quality control and statistical analysis " (https://www.ncbi.nlm.nih.gov/pubmed/29484742), which should provide you with a basic theoretical understanding of the type of analyses covered in this tutorial.

This step-by-step guide serves researchers who have none or very little experience with Unix, by helping them through the Unix commands in preparation of the tutorial.

Preparation

Step 1) The current set of tutorials on this GitHub page are based on a GNU/Linux-based computer, therefore:

Make sure you have access to a GNU/Linux-based computer resource.
Create a directory where you plan to conduct the analysis.

Execute the command below (copy-and-paste without the prompt: > and without the {} ).

mkdir {name_for_your_directory}

Step 2) Download the files from the GitHub page

Change the directory of your Unix machine to the created directory from step 1.

Execute the command below

cd HOME/{user}/{path/name_for_your_directory}
git clone https://github.com/MareesAT/GWA_tutorial.git

Unzip the folder of the first tutorial and move into the newly created directory.

Execute the commands below

unzip 1_QC_GWAS.zip cd 1_QC_GWAS

Step 3) This tutorial requires the open-source programming language R and the open-source whole genome association analysis toolset PLINK version 1.07 (all commands also work with PLINK2). If these programs are not already installed on your computer they can be downloaded from respectively: https://www.r-project.org/ http://zzz.bwh.harvard.edu/plink/ https://www.cog-genomics.org/plink2

We recommend using the newest versions. These websites will guide you through the installation process.
Congratulations everything is set up to start the tutorial!

Execution of tutorial 1

Step 4) Once you've created a directory in which you have downloaded and unzipped the folder: 1_QC_GWAS.zip, you are ready to start the first part of the actual tutorial. All steps of this tutorial will be excecuted using the commands from the main script: 1_Main_script_QC_GWAS.txt, the only thing necessary in completing the tutorial is copy-and-paste the commands from the main script at the prompt of your Unix device. Note, make sure you are in the directory containing all files, which is the directory after the last command of step 2. There is no need to open the other files manually.

There are two ways to use the main script:

Option 1

If you are a novice user, we recommend opening 1_Main_script_QC_GWAS.txt in WordPad or Notepad on your Windows computer.

Option 2

Alternatively, 1_Main_script_QC_GWAS.txt can be opened using an Unix text editor, for example vi.

Open the main script with vi :

vi 1_Main_script_QC_GWAS.txt

This enables you to read the script within the Unix environment and copy the command lines from it.

To exit vi and return to your directory use:

:q

From there, using either option 1 or 2, you can read the information given at every step of script “, 1_Main_script_QC_GWAS.txt” and copy-paste the commands after the prompt on your Unix machine.

Note, if R or PLINK are installed in a directory other than your working directory please specify the path to the excecutables in the given script. Alternatively, you can copy the executables of the programs to your working directory. For example, by using: cp {path/program name} {path/directiory}. However, when using a cluster computer, commands such a "module load plink", and "module load R" will suffice, regardless of directory.

For more information of using R and PLINK in a Unix/Linux environment we refer to: http://zzz.bwh.harvard.edu/plink/download.shtml#nixs

Execution of tutorial 2&3

Unzip the tutorial folder of choice as described in step 2.
Use the output file from the last tutorial as input for the tutorial you want to start.

The command below can be used to copy the file to another directory

cp {path/directory/file} {path/directory}

Use 2_Main_script_MDS.txt for the second tutorial and 3_Main_script_association_GWAS.txt for the third tutorial.

Execution of tutorial 4

4_ PRS.doc works independently from the other tutorials. After downloading 4_ PRS.doc, you can run the script, without the need for unzipping, in a directory of choice.

gwa_tutorial's People

Contributors

Stargazers

Watchers

Forkers

hashojaei ashwinas88 ktaoma htnani fabbondanza adiamb realjuliaz genomicsnx lss20103278 akhileshkaushal achalneupane shine333 anonymousengineer wangdi2014 goodstudychina dumplinggy souzadevinicius sea200k arinbasu evan1105 xujingmei111 jinbinchan yanch86 icmlab jystatistics zscu wangxuann1995 misaka-dayu musanabaganwa yixiangzhang1996 alexpersa7 bigfacebig hl685 wwliao jemimacat coolevilgenius changyu01 bioinformatics-repositories kaixinhuaihuai salvasv willtarran prayforhanluo weiminlee1 jinjie-duan yzharold liuxx1996 andradekc ramseyjt samwalrus bioshare quanrd thyagoleal bevalz swatigiriavellino nikolaospapachristou asassheng ruijieyin dkiberu hy101 dawei1203 julyfish711 qcmeng schnappi-wkl ld9866 mizuwatershui yajuansi-sophie songjunxu felixhell2004 dauss75 adawon lakes-1 ulysses1881826 kwuiee ehsanullah83 sandyyy123 richardyando hosseinece hongzhonglu parthapbn koujiaodahan merolly lqsae shenchen-bioutopia songjiajia2018 zhenghaihui315 pristineliving echolee0925 yuhjyhlin xiaoqiwang19 mingchen4 hj1994412 raycodingx leequn tashkeev-alex faker1c knmkr yunyouhuang jinguameng lizhengbio bjohnson14

gwa_tutorial's Issues

convert vcf to plink step missing id setting

Hi,

I found using the current command will result in all variant id specified as "." in the bim file. Is that wrong? Shall we set variant id manually using the parameter --set-missing-var-ids?

How to Create HapMap_3_r3 ( bim, bed and fam )

Hi,

I am new to GWAS, I have 100+ exome SNV annotated files, i want to create same bim, bed and fam files.

Can you/anyone pls help me with how ( bim, bed and fam ) are created for HapMap_3_r3 from hapmad Db, that will help me a lot because those are most important inputs for further analysis even before Step1.

How to confirm which sub has the lower call rate of each pair that I should remove

Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.

In our dataset the individual 13291 NA07045 had the lower call rate.

vi 0.2_low_call_rate_pihat.txt
i
13291 NA07045

Press esc on keyboard!

Press enter on keyboard

In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

Miss MDS_merge2.mds file

Hello,

Thank you for your great tutorial. I followed your tutorial step-by-step, but in Population_stratification step, for run Rscript MDS_merged.R, I got an error about can not open MDS_merge2.mds. Actually, I do not know why did not make this file in the previous step?
I just have:
MDS_merge2.bed
MDS_merge2.bim
MDS_merge2.fam
MDS_merge2.genome
MDS_merge2.log
MDS_merge2.nosex

Could you guide me?

Thank you.

I have failed convert vcf to Plink(v1.07) format

Respectly MareesAT :

When I running the command, plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes, I have got the work-log below:

@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web

Writing this text to log file [ ALL.2of4intersection.20100804.genotypes.log ]
Analysis started: Wed Sep 14 16:15:57 2022

** Unused command line option: --vcf
** Unused command line option: ALL.2of4intersection.20100804.genotypes.vcf.gz

ERROR: Problem parsing the command line arguments.

Could you please solve it ?

All's best!

problem solved: download site of qqman

**** Problem solved ****
As stated in the tutorial, changing the CRAN mirror to your own VPN would solve this problem.
**** Problem solved ****

I got an error when downloading qqman package from http://cran.cnr.berkeley.edu/ (which is written in Manhattan_plot.R of the tutorial), and I believe it is because there is problem in the download source.

I'm using R version 4.1.1 (2021-08-10), which meets the prerequisite of qqman installation.

Could you guide me to the currently available download site of qqman package? Thank you!

--------- My input ----------
(plink) Rscript --no-save Manhattan_plot.R

--------- Full text of the error ----------
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/src/contrib:
cannot open URL 'http://cran.cnr.berkeley.edu/src/contrib/PACKAGES'
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1:
cannot open URL 'http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1/PACKAGES'
Warning message:
package ‘qqman’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Error in library("qqman", lib.loc = "~") :
there is no package called ‘qqman’
Execution halted

Failed to open snp_1_22.txt

Hi,
I am running the codes and I got an error in the following code:

Step 3

Generate bfile with autosomal SNPs only and delete SNPs with a low minor allele frequency (MAF).

Select autosomal SNPs only (i.e., from chromosomes 1 to 22)

system("awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt")
system("plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7")

The error is:
Error: Failed to open snp_1_22.txt.

Can you explain more about the "inversion.txt" in the 1_QC_GWAS folder

Thank you for your tutorial in advance.
I have some confusion about the "inversion.txt" in the 1_QC_GWAS folder. I know it is for excluding high inversion regions. However, where is the "inversion.txt" come from? Since there are many other regions of high linkage disequilibrium, why you just choose these regions?
Can you give some more detailed explanation here?
Thank you~

Setting Reference Genome

I have a doubt on this step in Population stratification. I can understand why we are changing the reference alleles based on the reference data set. But is it okay to do that? Because, we are generating vcf files based on a reference genome (ex: GRCh37) and the 1000 genome data set is made up based on GRCh38. So I'm changing the reference alleles based on 1000 genome data set. What is the point in generating vcf files using our preferred reference genome? We could have created vcf files using the other reference genome initially itself. But why we are not doing that? Please resolve my doubt.

Thanks

Quantitative phenotypes QC

Hello - thanks a lot for your efforts, it worked well (tried it in the example datset). Can you please briefly explain what needs to be changed/adapted in the 1st tutorial to run a quantitative trait quality control?

Again, thanks a lot.
Samar

Permutation step, Error? or My fault?

I am reproducing the whole tutorial.
At <3_Main_script_association_GWAS.txt>, ##Permutation:
The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
should subset the SNP from Chromosome 22. But when I check subset_snp_chr_22.txt, I found the SNPs from other chromosome.

Here I would like to ask what is your original intention:
Do you want to say:"Subset of every chromosome's position 21595000 to 21605000", the descirption is wrong.
Or,
Description is correct, but command need to be revised
awk '{ if ($4 >= 21595000 && $4 <= 21605000 && $1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt

how to add the kinship matirx

In GWAS analysis, this process does not add a kinship matrix to run the GWAS. However, My data is family-level, I want to added the kinship to run the GWAS, Whether any parameter of plink can be added to kinship， thanks.

Error: 'legend' is of length 0

When generating a plot to assess the type of relationship in the sample, I do get an R script erro.
$ Rscript --no-save Relatedness.R
Error in legend(1, 1, xjust = 1, yjust = 1, legend = levels(relatedness$RT), :
'legend' is of length 0
Execution halted.

What might be the problem?

Any plan to have an update to GWA_tutorial?

Hi Andries,

any plan to have an update to GWA_tutorial?

Shicheng

p-value threshold for HWE deviation

Your paper suggests the following:

For binary traits we suggest to exclude: HWE p value <1e −10 in cases and <1e−6 in controls. Less strict case threshold avoids discarding disease‐associated SNPs under selection. For quantitative traits, we recommend HWE p value <1e‐6.

May I ask two questions?

are those cutoffs for human WGS or WES or whole genome microarray?
For quantitative traits, use the same cutoff for both case and control?

Failed to select autosomal SNPs only (QC Step 3)

While running:

# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).
awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt
plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7

Here are the error logs:

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to HapMap_3_r3_7.log.
Options in effect:
  --bfile HapMap_3_r3_6
  --extract snp_1_22.txt
  --make-bed
  --out HapMap_3_r3_7

16125 MB RAM detected; reserving 8062 MB for main workspace.
1430443 variants loaded from .bim file.
165 people (81 males, 84 females) loaded from .fam.
112 phenotype values loaded from .fam.
Error: No variants remaining after --extract.

Here are the HapMap_3_r3_6.bim:

1	rs2185539	0	556738	T	C
1	rs11240767	0	718814	T	C
1	rs3131972	0	742584	A	G
1	rs3131969	0	744045	A	G
1	rs1048488	0	750775	C	T

Here are the snp_1_22.txt

I have searched through Google and documents, but still haven't found a solution.
Your input or guidance on this would be highly appreciated

why individual 13291 NA07045 in 0.2_low_call_rate_pihat.txt

May I ask "For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate"? What does call rate mean in this sentence, and why do I need to "plink --bfile HapMap_3_r3_11 --missing" after getting the file pihat_min0.2_in_founders.genome？ Of course the crux of the matter is how to get the result by the above operation, individual 13291 NA07045 has the lowest call rate, i.e. how to get the data in 0.2_low_call_rate_pihat.txt?

The system cannot find the path specified.

I have set Rscript.exe in the bin/,

When i try to run : $ Rscript --no-save gender_check.R
It shows me:
The system cannot find the path specified.

What should I do, Thanks!!!

.vcf has fewer tokens than expected error

In part 2, when we convert from vcf to Plink format, I get an error:

Any advice to get around this? I end up with temporary files because of this & the rest of the tutorial is not functioning the way it should.

Some questions about qqplots

May I ask how to explain my qqplots? It looks soooo strange...
There are 2 folders standing for different dataset and both containing association and logistic analysis with plink.
GWAS.zip

Thank you for your checking on my problems and prompt reply!!!

ftp vs https in downloading 1000 genome vcf file in Step 2_Population_stratification

Hi,

I noticed using the command wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz from the 2_Main_script_MDS.txt downloads a corrupted file. Hence creating plink files from the vcf resulted in file read failure error. But when ftp is replaced with https like this, wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz, there is no file corruption issue and the download is faster as well.

Thanks

Could you please tell me how to "inversion.txt" in the 1_QC_GWAS folder

I want to practice my genetic data following your tutorial, but I do not know how to create the mydata-based inversion.txt file. Could you please tell me in detail. Thank you very much.

data for 0.2_low_call_rate_pihat.txt

Hi Marees,

Thanks for the pipeline.
I have around 5k samples in pihat_min0.2_genome and 390 in zoom_pihat_genome.
If I sort them, I could identify the related individuals but don't know how to select samples for '0.2_low_call_rate_pihat.txt'.
This part alone is not clear to me, and everything else is perfect.

Thanks again.

Regards,
Mano

controlling population stratification

Hi,
Thank you for the paper and the step-by-step tutorial! Those are of great help!

May I ask a question regarding your paper? It says that the following:

Individuals who are outliers based on the MDS analysis should be removed from further analyses. After the exclusion of these individuals, a new MDS analysis must be conducted, and its main components need to be used as covariates in the association tests in order to correct for any remaining population stratification within the population.

Does it mean those MDS components should be added as covariates regardless whether they are significantly associated with the trait? if so, could you explain why?

mareesat / gwa_tutorial Goto Github PK

gwa_tutorial's Introduction

GWA tutorial

We have made scripts available for:

Content:

Contact:

Additional material

Step-by-step-guide for this tutorial

Introduction

Preparation

Execute the command below (copy-and-paste without the prompt: > and without the {} ).

Execute the command below

Execute the commands below

Execution of tutorial 1

Option 1

Option 2

Open the main script with vi :

To exit vi and return to your directory use:

Execution of tutorial 2&3

The command below can be used to copy the file to another directory

Execution of tutorial 4

gwa_tutorial's People

Contributors

Stargazers

Watchers

Forkers

gwa_tutorial's Issues

Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.

In our dataset the individual 13291 NA07045 had the lower call rate.

Press esc on keyboard!

Press enter on keyboard

In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

Step 3

Generate bfile with autosomal SNPs only and delete SNPs with a low minor allele frequency (MAF).

Select autosomal SNPs only (i.e., from chromosomes 1 to 22)

Recommend Projects

Recommend Topics

Recommend Org