xihaoli / staarpipeline Goto Github PK

An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline

License: GNU General Public License v3.0

R 93.25% C 0.04% C++ 6.71%

statistical-genetics whole-genome-sequencing whole-exome-sequencing functional-annotation rare-variant-analysis staar-pipeline

staarpipeline's Introduction

STAARpipeline

This is an R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline.

Description

STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis. The single variant analysis in STAARpipeline provides individual P values of variants given an MAF or MAC cut-off. The variant set analysis in STAARpipeline includes gene-centric analysis and non-gene-centric analysis of rare variants. The gene-centric coding analysis provides seven genetic categories: putative loss of function (pLoF), protein-truncating (ptv), missense, disruptive missense, pLoF and disruptive missense, ptv and disruptive missense, and synonymous. The gene-centric noncoding analysis provides eight genetic categories: promoter or enhancer overlaid with CAGE or DHS sites, UTR, upstream, downstream, and noncoding RNA genes. The non-gene-centric analysis includes sliding window analysis with fixed sizes and dynamic window analysis with data-adaptive sizes. STAARpipeline accounts for population structure and relatedness, and is computationally scalable for analyzing biobank-scale WGS/WES studies of continuous and dichotomous traits with balanced or imbalanced case-control ratios, as well as multiple correlated traits jointly. STAARpipeline also provides analytical follow-up of dissecting association signals independent of known variants via conditional analysis using STAARpipelineSummary.

STAARpipeline and STAARpipelineSummary are implemented as a collection of apps. Please see the apps staarpipeline, staarpipelinesummary_varset and staarpipelinesummary_indvar that run on the UK Biobank Research Analysis Platform for more details.

Workflow Overview

Prerequisites

R (recommended version >= 3.5.1)

For optimal computational performance, it is recommended to use an R version configured with the Intel Math Kernel Library (or other fast BLAS/LAPACK libraries). See the instructions on building R with Intel MKL.

Dependencies

STAARpipeline links to R packages Rcpp and RcppArmadillo, and also imports R packages Rcpp, STAAR, MultiSTAAR, SCANG, dplyr, SeqArray, SeqVarTools, GenomicFeatures, TxDb.Hsapiens.UCSC.hg38.knownGene, GMMAT, GENESIS, Matrix. These dependencies should be installed before installing STAARpipeline.

Installation

library(devtools)
devtools::install_github("xihaoli/STAARpipeline",ref="main")

Docker Image

A docker image for STAARpipeline, including R (version 3.6.1) built with Intel MKL and all STAAR-related packages (STAAR, MultiSTAAR, SCANG, STAARpipeline, STAARpipelineSummary) pre-installed, is located in the Docker Hub. The docker image can be pulled using

docker pull zilinli/staarpipeline:0.9.7

Usage

Please see the STAARpipeline user manual for detailed usage of STAARpipeline package. Please see the STAARpipeline tutorial for a detailed example of analyzing sequencing data using STAARpipeline.

Data Availability

The whole-genome functional annotation data assembled from a variety of sources and the precomputed annotation principal components are available at the Functional Annotation of Variant - Online Resource (FAVOR) site and FAVOR Essential Database.

Version

The current version is 0.9.7.1 (August 9, 2024).

Citation

If you use STAARpipeline and STAARpipelineSummary for your work, please cite:

Zilin Li*, Xihao Li*, Hufeng Zhou, Sheila M. Gaynor, Margaret Sunitha Selvaraj, Theodore Arapoglou, Corbin Quick, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K. Arnett, Paul L. Auer, Lawrence F. Bielak, Joshua C. Bis, Thomas W. Blackwell, John Blangero, Eric Boerwinkle, Donald W. Bowden, Jennifer A. Brody, Brian E. Cade, Matthew P. Conomos, Adolfo Correa, L. Adrienne Cupples, Joanne E. Curran, Paul S. de Vries, Ravindranath Duggirala, Nora Franceschini, Barry I. Freedman, Harald H. H. Göring, Xiuqing Guo, Rita R. Kalyani, Charles Kooperberg, Brian G. Kral, Leslie A. Lange, Bridget M. Lin, Ani Manichaikul, Alisa K. Manning, Lisa W. Martin, Rasika A. Mathias, James B. Meigs, Braxton D. Mitchell, May E. Montasser, Alanna C. Morrison, Take Naseri, Jeffrey R. O’Connell, Nicholette D. Palmer, Patricia A. Peyser, Bruce M. Psaty, Laura M. Raffield, Susan Redline, Alexander P. Reiner, Muagututi’a Sefuiva Reupena, Kenneth M. Rice, Stephen S. Rich, Jennifer A. Smith, Kent D. Taylor, Margaret A. Taub, Ramachandran S. Vasan, Daniel E. Weeks, James G. Wilson, Lisa R. Yanek, Wei Zhao, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Lipids Working Group, Jerome I. Rotter, Cristen J. Willer, Pradeep Natarajan, Gina M. Peloso, & Xihong Lin. (2022). A framework for detecting noncoding rare variant associations of large-scale whole-genome sequencing studies. Nature Methods, 19(12), 1599-1611. PMID: 36303018. PMCID: PMC10008172. DOI: 10.1038/s41592-022-01640-x.

Xihao Li*, Zilin Li*, Hufeng Zhou, Sheila M. Gaynor, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K. Arnett, Stella Aslibekyan, Christie M. Ballantyne, Lawrence F. Bielak, John Blangero, Eric Boerwinkle, Donald W. Bowden, Jai G. Broome, Matthew P. Conomos, Adolfo Correa, L. Adrienne Cupples, Joanne E. Curran, Barry I. Freedman, Xiuqing Guo, George Hindy, Marguerite R. Irvin, Sharon L. R. Kardia, Sekar Kathiresan, Alyna T. Khan, Charles L. Kooperberg, Cathy C. Laurie, X. Shirley Liu, Michael C. Mahaney, Ani W. Manichaikul, Lisa W. Martin, Rasika A. Mathias, Stephen T. McGarvey, Braxton D. Mitchell, May E. Montasser, Jill E. Moore, Alanna C. Morrison, Jeffrey R. O'Connell, Nicholette D. Palmer, Akhil Pampana, Juan M. Peralta, Patricia A. Peyser, Bruce M. Psaty, Susan Redline, Kenneth M. Rice, Stephen S. Rich, Jennifer A. Smith, Hemant K. Tiwari, Michael Y. Tsai, Ramachandran S. Vasan, Fei Fei Wang, Daniel E. Weeks, Zhiping Weng, James G. Wilson, Lisa R. Yanek, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Lipids Working Group, Benjamin M. Neale, Shamil R. Sunyaev, Gonçalo R. Abecasis, Jerome I. Rotter, Cristen J. Willer, Gina M. Peloso, Pradeep Natarajan, & Xihong Lin. (2020). Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature Genetics, 52(9), 969-983. PMID: 32839606. PMCID: PMC7483769. DOI: 10.1038/s41588-020-0676-4.

License

This software is licensed under GPLv3.

GNU General Public License, GPLv3

staarpipeline's People

Contributors

Stargazers

Watchers

Forkers

zhouhufeng zilinli1988 shicheng-guo likuokuo damientan varunreddygopu wangyujian123 milad-bastami yu-zhang-oyo jahernayeem cvraut buutrg lvmehinovic socrates1100 yuxinyuanqt xiening362175785 shengkui-zhang shuangsong0110

staarpipeline's Issues

How do other species generate the variants list to be annotated?

Hi,

I want to use STAARpipeline to detect rare variation information in plants. I have my own vcf file and its annotation information. How can I generate the variants list to be annotated? I read the FAVORdatabase_chrsplit.csv in the example, and I don't know what it means, so I don't know how to start.

Looking forward to your reply!

Ayn

Feature request: add to Biocontainers

Hi all,

It would be nice if the Dockerfile containing all the STAAR-related packages were added to biocontainers. Any interest?

regarding STAAR Sliding Window

Hello, I was performing the rare variant analysis using the sliding window approach. When running the pre-built pipeline, I have several questions that need your help. Thank you!

For each loop in kk, I get the output saying

[1] 1 of selected samples: 1,466 of selected variants: 437 Error : eig_sym(): decomposition failed of selected samples: 1,966 of selected variants: 5,335,313,

where the 437 should be the variants within the window, and the 5,335,313 is the total number of variants. 1) based on this output, my first question comes from the error message. I followed an old issue to check. There is no missing in the genotype matrix, and I updated the Rcpp to the latest version. But I still get this error message. 2) secondly, by run the source code line by line, I am a little confused since I did not see a line to subset genotype matrix by the SNPs in the window region. Can you please tell me which line in which source code gives this subset? I appreciate it a lot!

In the pre-set option "type", I did not find a detailed explanation on using either "single" or "multiple". Based on the instruction, the default is "single", but in the pipeline, "multiple" is used. I am wondering can you give me an example of when to use "single" and when to use "multiple"?
Just curious, I am wondering if the sliding window approach can be applied to non-coding variants only?

Thank you!
Ruyu

Are there provide a toy data?

Hi, Dr. li

thanks you for the tool, It looks handy for users to go through the GWAS analysis. And I want to know there are provide a toy data for a new guy, like me.

Looking forward your reply!

Provide test data

Hi, I'm interested in your software. I would like to use this software on other species, but there are no directly available annotation files, only vcf files. Can you provide test data in the tutorial? It would be even better to provide a pipeline that starts with the raw vcf.
Thank you.

Null output in Dynamic window analysis

Hi,

Recently I am using your pipeline on Docker to do burden test with dynamic window analysis.

After generating aGDS file, I conducted step0-step1-step5. I tested with WES data of 6 people (2 trios) with a chr1 region. Here is my original vcf file, generated aGDS file and commands:
dynamic_window_test.zip

As I only tested one chromosome, I changed some lines of codes:

#### Number of jobs for each chromosome
jobs_num <- matrix(rep(0,3),nrow=1)
for(chr in 1:1)
{
	print(chr)
	gds.path <- agds_dir[1] # agds_dir[1]
	genofile <- seqOpen(gds.path)
	
	filter <- seqGetData(genofile, QC_label)
	SNVlist <- filter == "PASS" 

	position <- as.numeric(seqGetData(genofile, "position"))
	position_SNV <- position[SNVlist]
  
	jobs_num[chr,1] <- chr
	jobs_num[chr,2] <- min(position[SNVlist])
	jobs_num[chr,3] <- max(position[SNVlist])

	seqClose(genofile)
}

About groupid and arraryid in step 0, I directly used groupid = arraryid = scang_num.

Though no errors were reported,the output was null. I am not sure if these changes were correct.
Have no idea which step went wrong. Could you give me some advice?

Thanks!

STAARPipeline for imbalanced binary phenotypes

Hello and first many thanks for generating such a nice and useful pipeline!

I've seen that you recently improved STAAR according to imbalanced binary scenarios. Is this already integrated in the STAARPipeline approach?

Many thanks in advance!

Best
Andi

Burden effects for all genes of a category

Hello,

I am trying to extract burden effects for all genes present in the rvas results rather than significant genes to perform meta-analysis? Is it possible to do so if so what is the potential way to accomplish it?

Regards
Akhil

Integration of annotations

Hello I have a question based on the STAAR code and its underlying publications.

In the code it says:
#For each noncoding functional category, the conditional STAAR-B p-value is a p-value from an omnibus test #' that aggregated conditional Burden(1,25) and Burden(1,1), #' together with conditional p-values of each test weighted by each annotation using Cauchy method.
--> So this seems to me like calculate e.g. different burden tests by annotation weights, af weights etc. and integrating them afterwards

But based on your publication it seems to me that you integrated the beta-allele frequency weighting directly with the functional variant annotation and the variant score to generate e.g. QBurden.

Which one is correct?

Best
Andi

Results are not consistent with what we got before

Hello,

I have performed RVAS analysis using STAARpipeline using TOPMed dataset and got the results but when I performed the analysis on the updated dataset, I am not getting the same results as before and Genes that are reported as significant in the original analysis is not significant or not present in the results files. Can you help me out regarding this?

Thank you so much for help.

preinstall packages

I would suggest add following lines for the preinstall packages. These packages seem need to be installed manually in the order (like STAAR depend on SeqArray).

BiocManager::install("SeqArray")
BiocManager::install("SeqVarTools")
devtools::install_github("hanchenphd/GMMAT")
BiocManager::install("GENESIS")
devtools::install_github("xihaoli/STAAR",ref="main")
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
BiocManager::install("GenomicFeatures")
devtools::install_github("zilinli1988/SCANG")

I am also get confused for the Intel Math Kernel Library.

SCANG error in class(genotype)=="dgCMatrix" Rversion 4.2.0

Trying to use Dynamic_Window_SCANG.r function I run into an error come from comparison class(genotype)=="dgCMatrix". While, going deeper into this I found that the problem is that genotype object (= Geno) has two classes "matrix" "array" and the comparison cannot be done. The issue with the two classes, appears after R version 4.0.0 (https://cran.r-project.org/doc/manuals/r-release/NEWS.html).

docker for STAARpipeline

Hello,

I am trying to see if there have been an implementation of this pipeline in the google cloud platform or if any docker has been available for this to perform the same in google cloud as I have seen the same in the Dnanexus platform for ukbiobank.

Regards
Akhil

Gene_Centic_Coding Unable to Analyze Gene

Hello,

While running theGene_Centic_Coding function, I noticed a strange issue while processing through a list of genes for a specific chromosome.

On any given gene, the function seems to work properly until the internal coding function attempts to run the STAAR function:

try(pvalues <- STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, 
    rare_maf_cutoff = rare_maf_cutoff, rv_num_cutoff = rv_num_cutoff), 
    silent = silent)

I am receiving the following error, and thus no results from the current gene:

Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  :     Dimensions don't match for genotype and annotation!

This error occurs virtually for all genes. Looking into this, it appears the issue is how the annotation data is subset for the final list of variants that are lof in plof:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[lof.in.plof, ]

When I run this, lof.in.plof is a vector of NAs, TRUEs, and FALSEs, with the number of TRUEs corresponding to the final filtered number of variants to use (in my case, 5). When the annotation data in Anno.Int.PHRED.sub is subset using this vector, however, the final dimensions of the table still contain the number of rows that correspond to the previous number of variants (which, in my case, was 129).

The Geno matrix has the dimensions [n samples x 5 variants]. When Anno.Int.PHRED.sub.category is passed to the STAAR function, however, its dimensions are still [n samples x 129 variants], causing the error.

If I wrap the which function around lof.in.plof, the dimensions of the resulting table are [n samples x 5] and STAAR is able to run properly and gives no error:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[which(lof.in.plof),]

I assume this fix makes sense and there shouldn't be a reason Anno.Int.PHRED.sub.category should still contain rows with NA data..? The final dimensions of this annotation table should indeed match that of the genotype matrix, no?

Minimum number of samples?

I have a dataset with ~700 samples, eventually to increase to ~1000 but I'm testing out STAARpipeline on what I have so far. I know that is very small for a human GWAS study, but I thought the gene centric and/or sliding window analysis, in combination with the weights from annotations used in STAARpipeline, might bolster the power enough to be worthwhile.

I noticed that my results files from the STAARpipeline_Gene_Centric_Coding.R script just contained a list of NULL values. I went back and stepped through the script manually, and got the following message printed for the first gene:

# of selected samples: 721
# of selected variants: 103
# of selected samples: 721
# of selected variants: 12
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 9
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3,724,472

If I do str(results):

List of 5
 $ plof               : NULL
 $ plof_ds            : NULL
 $ missense           : NULL
 $ disruptive_missense: NULL
 $ synonymous         : NULL

Is my dataset just too small, or is there some other issue to be fixed? I am running 0.9.6 from the Docker container because I was working on this back in October, but can change to 0.9.7 if you think that would help.

Running docker image

Hi everyone!
Is there any possibility of providing a small tutorial on how to use the docker image from STAAR-pipeline?
Without any directions I'm not sure how to execute the correct steps with it and I'm not sure how to proceed. I'm working on implementing the staar-pipeline for our cluster and the docker image would be the best way to do so.
Thanks

Account for population frequency

Hi Dr. Li

In determining rare variant in the gene-centric coding pipeline, does STAARpipeline use allelic frequency information from population database such as gnomAD, 1000 Genome, ESP6500, etc to determine whether the variant in exonic region is also rare in all populations? Also, how do you define 'rare' in all other pipeline? Like rare in the cohort or rare in the population databases?

Thank you very much

Does this pipeline support sex chromosomes?

When I am running the process to the step of STAARpipeline/0.2.1Varinfo_gds.R, if I provide the chromosome as a sex chromosome, it will automatically recognize it as chrNA. I would like to know if this procedure supports calculations for sex chromosomes.

What are the differences between the individual part of the STAARpipeline and the traditional GWAS using the plink method?

Hello, I would like to ask, what are the differences between the individual analysis in the STAARpipeline (analysis centered on a single locus variable) and the traditional GWAS analysis using the plink method?

xihaoli / staarpipeline Goto Github PK

staarpipeline's Introduction

STAARpipeline

Description

Workflow Overview

Prerequisites

Dependencies

Installation

Docker Image

Usage

Data Availability

Version

Citation

License

staarpipeline's People

Contributors

Stargazers

Watchers

Forkers

staarpipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org