mareesat / gwa_tutorial Goto Github PK

A comprehensive tutorial about GWAS and PRS

gwas qc prs r plink genetics population-structure unix population-stratification genomics

gwa_tutorial's Issues

.vcf has fewer tokens than expected error

In part 2, when we convert from vcf to Plink format, I get an error:

Any advice to get around this? I end up with temporary files because of this & the rest of the tutorial is not functioning the way it should.

Any plan to have an update to GWA_tutorial?

Hi Andries,

any plan to have an update to GWA_tutorial?

Shicheng

Permutation step, Error? or My fault?

I am reproducing the whole tutorial.
At <3_Main_script_association_GWAS.txt>, ##Permutation:
The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
should subset the SNP from Chromosome 22. But when I check subset_snp_chr_22.txt, I found the SNPs from other chromosome.

Here I would like to ask what is your original intention:
Do you want to say:"Subset of every chromosome's position 21595000 to 21605000", the descirption is wrong.
Or,
Description is correct, but command need to be revised
awk '{ if ($4 >= 21595000 && $4 <= 21605000 && $1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt

convert vcf to plink step missing id setting

Hi,

I found using the current command will result in all variant id specified as "." in the bim file. Is that wrong? Shall we set variant id manually using the parameter --set-missing-var-ids?

problem solved: download site of qqman

**** Problem solved ****
As stated in the tutorial, changing the CRAN mirror to your own VPN would solve this problem.
**** Problem solved ****

I got an error when downloading qqman package from http://cran.cnr.berkeley.edu/ (which is written in Manhattan_plot.R of the tutorial), and I believe it is because there is problem in the download source.

I'm using R version 4.1.1 (2021-08-10), which meets the prerequisite of qqman installation.

Could you guide me to the currently available download site of qqman package? Thank you!

--------- My input ----------
(plink) Rscript --no-save Manhattan_plot.R

--------- Full text of the error ----------
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/src/contrib:
cannot open URL 'http://cran.cnr.berkeley.edu/src/contrib/PACKAGES'
Warning: unable to access index for repository http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1:
cannot open URL 'http://cran.cnr.berkeley.edu/bin/macosx/contrib/4.1/PACKAGES'
Warning message:
package ‘qqman’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Error in library("qqman", lib.loc = "~") :
there is no package called ‘qqman’
Execution halted

Setting Reference Genome

I have a doubt on this step in Population stratification. I can understand why we are changing the reference alleles based on the reference data set. But is it okay to do that? Because, we are generating vcf files based on a reference genome (ex: GRCh37) and the 1000 genome data set is made up based on GRCh38. So I'm changing the reference alleles based on 1000 genome data set. What is the point in generating vcf files using our preferred reference genome? We could have created vcf files using the other reference genome initially itself. But why we are not doing that? Please resolve my doubt.

Thanks

Error: 'legend' is of length 0

When generating a plot to assess the type of relationship in the sample, I do get an R script erro.
$ Rscript --no-save Relatedness.R
Error in legend(1, 1, xjust = 1, yjust = 1, legend = levels(relatedness$RT), :
'legend' is of length 0
Execution halted.

What might be the problem?

how to add the kinship matirx

In GWAS analysis, this process does not add a kinship matrix to run the GWAS. However, My data is family-level, I want to added the kinship to run the GWAS, Whether any parameter of plink can be added to kinship， thanks.

controlling population stratification

Hi,
Thank you for the paper and the step-by-step tutorial! Those are of great help!

May I ask a question regarding your paper? It says that the following:

Individuals who are outliers based on the MDS analysis should be removed from further analyses. After the exclusion of these individuals, a new MDS analysis must be conducted, and its main components need to be used as covariates in the association tests in order to correct for any remaining population stratification within the population.

Does it mean those MDS components should be added as covariates regardless whether they are significantly associated with the trait? if so, could you explain why?

Could you please tell me how to "inversion.txt" in the 1_QC_GWAS folder

I want to practice my genetic data following your tutorial, but I do not know how to create the mydata-based inversion.txt file. Could you please tell me in detail. Thank you very much.

The system cannot find the path specified.

I have set Rscript.exe in the bin/,

When i try to run : $ Rscript --no-save gender_check.R
It shows me:
The system cannot find the path specified.

What should I do, Thanks!!!

How to confirm which sub has the lower call rate of each pair that I should remove

Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.

In our dataset the individual 13291 NA07045 had the lower call rate.

vi 0.2_low_call_rate_pihat.txt
i
13291 NA07045

Press esc on keyboard!

Press enter on keyboard

In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

ftp vs https in downloading 1000 genome vcf file in Step 2_Population_stratification

Hi,

I noticed using the command wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz from the 2_Main_script_MDS.txt downloads a corrupted file. Hence creating plink files from the vcf resulted in file read failure error. But when ftp is replaced with https like this, wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz, there is no file corruption issue and the download is faster as well.

Thanks

How to Create HapMap_3_r3 ( bim, bed and fam )

Hi,

I am new to GWAS, I have 100+ exome SNV annotated files, i want to create same bim, bed and fam files.

Can you/anyone pls help me with how ( bim, bed and fam ) are created for HapMap_3_r3 from hapmad Db, that will help me a lot because those are most important inputs for further analysis even before Step1.

Failed to select autosomal SNPs only (QC Step 3)

While running:

# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).
awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt
plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7

Here are the error logs:

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to HapMap_3_r3_7.log.
Options in effect:
  --bfile HapMap_3_r3_6
  --extract snp_1_22.txt
  --make-bed
  --out HapMap_3_r3_7

16125 MB RAM detected; reserving 8062 MB for main workspace.
1430443 variants loaded from .bim file.
165 people (81 males, 84 females) loaded from .fam.
112 phenotype values loaded from .fam.
Error: No variants remaining after --extract.

Here are the HapMap_3_r3_6.bim:

1	rs2185539	0	556738	T	C
1	rs11240767	0	718814	T	C
1	rs3131972	0	742584	A	G
1	rs3131969	0	744045	A	G
1	rs1048488	0	750775	C	T

Here are the snp_1_22.txt

I have searched through Google and documents, but still haven't found a solution.
Your input or guidance on this would be highly appreciated

p-value threshold for HWE deviation

Your paper suggests the following:

For binary traits we suggest to exclude: HWE p value <1e −10 in cases and <1e−6 in controls. Less strict case threshold avoids discarding disease‐associated SNPs under selection. For quantitative traits, we recommend HWE p value <1e‐6.

May I ask two questions?

are those cutoffs for human WGS or WES or whole genome microarray?
For quantitative traits, use the same cutoff for both case and control?

data for 0.2_low_call_rate_pihat.txt

Hi Marees,

Thanks for the pipeline.
I have around 5k samples in pihat_min0.2_genome and 390 in zoom_pihat_genome.
If I sort them, I could identify the related individuals but don't know how to select samples for '0.2_low_call_rate_pihat.txt'.
This part alone is not clear to me, and everything else is perfect.

Thanks again.

Regards,
Mano

Failed to open snp_1_22.txt

Hi,
I am running the codes and I got an error in the following code:

Step 3

Generate bfile with autosomal SNPs only and delete SNPs with a low minor allele frequency (MAF).

Select autosomal SNPs only (i.e., from chromosomes 1 to 22)

system("awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt")
system("plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7")

The error is:
Error: Failed to open snp_1_22.txt.

Quantitative phenotypes QC

Hello - thanks a lot for your efforts, it worked well (tried it in the example datset). Can you please briefly explain what needs to be changed/adapted in the 1st tutorial to run a quantitative trait quality control?

Again, thanks a lot.
Samar

Miss MDS_merge2.mds file

Hello,

Thank you for your great tutorial. I followed your tutorial step-by-step, but in Population_stratification step, for run Rscript MDS_merged.R, I got an error about can not open MDS_merge2.mds. Actually, I do not know why did not make this file in the previous step?
I just have:
MDS_merge2.bed
MDS_merge2.bim
MDS_merge2.fam
MDS_merge2.genome
MDS_merge2.log
MDS_merge2.nosex

Could you guide me?

Thank you.

why individual 13291 NA07045 in 0.2_low_call_rate_pihat.txt

May I ask "For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate"? What does call rate mean in this sentence, and why do I need to "plink --bfile HapMap_3_r3_11 --missing" after getting the file pihat_min0.2_in_founders.genome？ Of course the crux of the matter is how to get the result by the above operation, individual 13291 NA07045 has the lowest call rate, i.e. how to get the data in 0.2_low_call_rate_pihat.txt?

Can you explain more about the "inversion.txt" in the 1_QC_GWAS folder

Thank you for your tutorial in advance.
I have some confusion about the "inversion.txt" in the 1_QC_GWAS folder. I know it is for excluding high inversion regions. However, where is the "inversion.txt" come from? Since there are many other regions of high linkage disequilibrium, why you just choose these regions?
Can you give some more detailed explanation here?
Thank you~

Some questions about qqplots

May I ask how to explain my qqplots? It looks soooo strange...
There are 2 folders standing for different dataset and both containing association and logistic analysis with plink.
GWAS.zip

Thank you for your checking on my problems and prompt reply!!!

I have failed convert vcf to Plink(v1.07) format

Respectly MareesAT :

When I running the command, plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes, I have got the work-log below:

@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web

Writing this text to log file [ ALL.2of4intersection.20100804.genotypes.log ]
Analysis started: Wed Sep 14 16:15:57 2022

** Unused command line option: --vcf
** Unused command line option: ALL.2of4intersection.20100804.genotypes.vcf.gz

ERROR: Problem parsing the command line arguments.

Could you please solve it ?

All's best!