rajlabmssm / echold Goto Github PK
View Code? Open in Web Editor NEWechoverse module: LD downloading and processing
echoverse module: LD downloading and processing
Alkes price lab moved their data to Google Storage. This is....not great, because it may mean all the data is now behind a paywall.
https://alkesgroup.broadinstitute.org/UKBB_LD/
Will see if I can work around this, or at least provide users a way to supply their GS credentials and still get the data.
Enable custom LD panel by:
Hi,
It turnes out that when running the "full pipeline vignette" the bgzipped VCF file generated under ./RESULTS/GWAS/Nalls23andMe_2019/BST1/LD is empty.
I believe this is causing the "Segmentation fault (core dumped)" error, so that no finemapping tool works downstream since we are missing the locus specific LD structure.
r$> library(echolocatoR) Registered S3 method overwritten by 'GGally': method from +.gg ggplot2 Bioconductor version 3.12 (BiocManager 1.30.12), ?BiocManager::install for help Bioconductor version '3.12' is out-of-date; the current release version '3.14' is available with R version '4.1'; see https://bioconductor.org/install
data("Nalls_top_SNPs"); top_SNPs <- import_topSNPs( # topSS = "~/Desktop/Fine_Mapping/Data/GWAS/Nalls23andMe_2019/Nalls2019_TableS2.xlsx", topSS = Nalls_top_SNPs, chrom_col = "CHR", position_col = "BP", snp_col="SNP", pval_col="P, all studies", effect_col="Beta, all studies", gene_col="Nearest Gene", locus_col = "Nearest Gene", grouping_vars = c("Locus"), remove_variants = "rs34637584") #> [1] "+ Assigning gene_col and locus_col independently" head(top_SNPs)
Locus Gene CHR POS SNP P Effect 1: ASXL3 ASXL3 18 31304318 rs1941685 1.69e-08 0.0531 2: BAG3 BAG3 10 121415685 rs72840788 1.57e-11 0.0763 3: BIN3 BIN3 8 22525980 rs2280104 1.16e-08 0.0556 4: BRIP1 BRIP1 17 59917366 rs61169879 9.28e-10 0.0820 5: BST1 BST1 4 15737348 rs4698412 2.06e-28 0.1035 6: C5orf24 C5orf24 5 134199105 rs11950533 7.16e-09 -0.0916
r$> fullSS_path <- example_fullSS(fullSS_path=file.path("/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR","Nalls23andMe_2019.fullSS_subset.tsv"))
r$> fullSS_path [1] "/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/Nalls23andMe_2019.fullSS_subset.tsv"
top_SNPs = top_SNPs, # It's best to give absolute paths results_dir = file.path("/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR","results"), loci = c("BST1","MEX3C"),# top_SNPs$Locus, dataset_name = "Nalls23andMe_2019", dataset_type = "GWAS", force_new_subset = T, force_new_LD = T, force_new_finemap = T, remove_tmps = F, # SUMMARY STATS ARGUMENTS fullSS_path = fullSS_path, query_by ="tabix", chrom_col = "CHR", position_col = "POS", snp_col = "RSID", pval_col = "p", effect_col = "beta", stderr_col = "se", freq_col = "freq", MAF_col = "calculate", A1_col = "A1", A2_col = "A2", # FILTERING ARGUMENTS ## It's often desirable to use a larger window size ## (e.g. 2Mb which is bp_distance=500000*2), ## but we use a small window here to speed up the process. bp_distance = 10000,#500000*2, min_MAF = 0.001, trim_gene_limits = F, # FINE-MAPPING ARGUMENTS ## General finemap_methods = c("ABF","FINEMAP","SUSIE","POLYFUN_SUSIE"), n_causal = 5, PP_threshold = .95, # LD ARGUMENTS LD_reference = "1KGphase1",#"UKB", superpopulation = "EUR", download_method = "axel", # PLOT ARGUMENTS ## general plot.types=c("fancy"), ## Generate multiple plots of different window sizes; ### all SNPs, 4x zoomed-in, and a 50000bp window plot.zoom = c("all","4x","10x"), ## XGR # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"), ## Roadmap plot.Roadmap = F, plot.Roadmap_query = NULL, # Nott et al. (2019) plot.Nott_epigenome = T, plot.Nott_show_placseq = T, [1] "+ CONDA:: Activating conda env 'echoR'" [1] "Checking for tabix installation..." [1] "Checking for bcftools installation..." ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( BST1 (1 / 2) ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( [1] "+ CONDA:: Identified bgzip executable in echoR env." [1] "Determining chrom type from file header" [1] "LD:: Standardizing summary statistics subset." [1] "++ Preparing Gene col" [1] "++ Preparing A1,A1 cols" [1] "++ Preparing MAF,Freq cols" [1] "++ Inferring MAF from frequency column..." [1] "++ Preparing N_cases,N_controls cols" [1] "++ Preparing `proportion_cases` col" [1] "++ Calculating `proportion_cases`." [1] "++ Preparing N col" [1] "+ Preparing sample_size (N) column" [1] "++ Computing effective sample size." [1] "++ Preparing t-stat col" [1] "+ Calculating t-statistic from Effect and StdErr..." [1] "++ Assigning lead SNP" [1] "++ Ensuring Effect, StdErr, P are numeric" [1] "++ Ensuring 1 SNP per row" [1] "++ Removing extra whitespace" [1] "++ Saving subset ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/BST1_Nalls23andMe_2019_subset.tsv.gz" [1] "LD:: Using 1000Genomes as LD reference panel." [1] "LD Reference Panel = 1KGphase1" [1] "+ LD:: Querying 1KG remote server." [1] "+ CONDA:: Identified tabix executable in echoR env." [1] "LD:: Querying VCF subset" [1] "/home/acarrasco/.conda/envs/echoR/bin/tabix -fh -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 4:15727389-15747197 > /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/BST1.1KGphase1.vcf" Segmentation fault (core dumped) [1] "LD:BCFTOOLS:: Compressing vcf file..." [1] "LD:TABIX:: Re-indexing vcf.gz..." [1] "LD:BCFTOOLS:: Subsetting vcf to only include EUR individuals ( 378 / 1091 )." Failed to read from /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/BST1.1KGphase1.vcf.gz: unknown file type [1] "LD:PLINK:: Converting vcf.gz to .bed/.bim/.fam" PLINK v1.90b6.9 64-bit (4 Mar 2019) www.cog-genomics.org/plink/1.9/ (C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/plink.log. Options in effect: --out /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/plink --vcf /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/BST1.1KGphase1.vcf.gz 257891 MB RAM detected; reserving 128945 MB for main workspace. Error: File read failure. [1] "LD:snpStats:: Computing LD (stats = R)" Error in data.table::fread(bim_path, col.names = c("CHR", "SNP", "V3", : File '/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD/plink.bim' does not exist or is non-readable. getwd()=='/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR' Fine-mapping complete in: Time difference of 11.7 secs ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( MEX3C (2 / 2) ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( [1] "Determining chrom type from file header" [1] "LD:: Standardizing summary statistics subset." [1] "++ Preparing Gene col" [1] "++ Preparing A1,A1 cols" [1] "++ Preparing MAF,Freq cols" [1] "++ Inferring MAF from frequency column..." [1] "++ Preparing N_cases,N_controls cols" [1] "++ Preparing `proportion_cases` col" [1] "++ Calculating `proportion_cases`." [1] "++ Preparing N col" [1] "+ Preparing sample_size (N) column" [1] "++ Computing effective sample size." [1] "++ Preparing t-stat col" [1] "+ Calculating t-statistic from Effect and StdErr..." [1] "++ Assigning lead SNP" [1] "++ Ensuring Effect, StdErr, P are numeric" [1] "++ Ensuring 1 SNP per row" [1] "++ Removing extra whitespace" [1] "++ Saving subset ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/MEX3C_Nalls23andMe_2019_subset.tsv.gz" [1] "LD:: Using 1000Genomes as LD reference panel." [1] "LD Reference Panel = 1KGphase1" [1] "+ LD:: Querying 1KG remote server." [1] "+ CONDA:: Identified tabix executable in echoR env." [1] "LD:: Querying VCF subset" [1] "/home/acarrasco/.conda/envs/echoR/bin/tabix -fh -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 18:48673708-48692728 > /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/MEX3C.1KGphase1.vcf" Segmentation fault (core dumped) [1] "LD:BCFTOOLS:: Compressing vcf file..." [1] "LD:TABIX:: Re-indexing vcf.gz..." [1] "LD:BCFTOOLS:: Subsetting vcf to only include EUR individuals ( 378 / 1091 )." Failed to read from /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/MEX3C.1KGphase1.vcf.gz: unknown file type [1] "LD:PLINK:: Converting vcf.gz to .bed/.bim/.fam" PLINK v1.90b6.9 64-bit (4 Mar 2019) www.cog-genomics.org/plink/1.9/ (C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/plink.log. Options in effect: --out /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/plink --vcf /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/MEX3C.1KGphase1.vcf.gz 257891 MB RAM detected; reserving 128945 MB for main workspace. Error: File read failure. [1] "LD:snpStats:: Computing LD (stats = R)" Error in data.table::fread(bim_path, col.names = c("CHR", "SNP", "V3", : File '/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/MEX3C/LD/plink.bim' does not exist or is non-readable. getwd()=='/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR' Fine-mapping complete in: Time difference of 8.7 secs Error in `[.data.table`(x, r, vars, with = FALSE) : column(s) not found: SNP
Then, we see how the loci specific LD file is empty, leading to the segmentation fault error
acarrasco@kronos:/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD$ gunzip BST1.1KGphase1.vcf.gz acarrasco@kronos:/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/results/GWAS/Nalls23andMe_2019/BST1/LD$ cat BST1.1KGphase1.vcf
Same output when I tried to use the phase 3 from 1000G Panel
Conclussion
I believe there is an issue accessing the 1000Genome Ref Panel. When I changed the reference panel to UKB, the finemap_loci() function worked
I am trying to compute LD matrix from summary statistics, and I want to use the 1000 genome phase 3 reference panel.
echoLD::get_LD(
query_dat = sumstats,
locus_dir = save_dir,
query_genome="hg38",
LD_reference = "1KGphase3",
superpopulation = "EUR",
nThread = 16
)
This works for some loci, but not always. For example, I have a loci on chr2:8906676-9922165, the error said invalid 'path' argument
, but I am sure the path is correct.
What is the issue behind it? How should I resolve this problem?
Add a message making it more clear when the function fails due to lack of overlapping RSIDS. Perhaps with a backup strategy that involves using the genomic coordinates.
Running this example seems to work totally fine on my Mac (albeit slowly and taking lots of memory, as usual):
query_dat <- echodata::BST1[seq(1, 50), ]
locus_dir <- file.path(tempdir(), echodata::locus_dir)
LD_list <- echoLD:::get_LD_UKB(
query_dat = query_dat,
locus_dir = locus_dir)
Using UK Biobank LD reference panel.
+ UKB LD file name: chr4_14000001_17000001
Downloading full .gz/.npz UKB files and saving to disk.
Downloading with axel [1thread(s)]:
https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr4_14000001_17000001.gz ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001.gz
axel download complete.
Time difference of 1.4 secs
Downloading with axel [1thread(s)]:
https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr4_14000001_17000001.npz ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001.npz
axel download complete.
Time difference of 22.9 secs
load_ld() python function input: /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001
Reading LD matrix into memory. This could take some time...
/var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001.gz
/var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001.npz
Processed URL: /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_14000001_17000001
Some other message at the end
+ Full UKB LD matrix: 22,085 x 22,085
+ Full UKB LD SNP data.table: 22,085 x 5
50 x 50 LD_matrix (sparse)
Converting obj to sparseMatrix.
Saving sparse LD matrix ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpFq7D4S/results/GWAS/Nalls23andMe_2019/BST1/LD/BST1.UKB_LD.RDS
+ Removing .gz temp files.
+ Removing .npz temp files.
Nevertheless, when running unit tests during check, this error is produced. Not sure why testthat
wouldn't have access to all the same libraries that the main R console does.
INTEL MKL ERROR: dlopen(/opt/anaconda3/lib/libmkl_intel_thread.1.dylib, 0x0009): Library not loaded: @rpath/libiomp5.dylib
Referenced from: /opt/anaconda3/lib/libmkl_intel_thread.1.dylib
Reason: tried: '/Library/Frameworks/R.framework/Resources/lib/libiomp5.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libiomp5.dylib' (no such file).
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.4
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] echoLD_0.99.6 snpStats_1.46.0 Matrix_1.4-1 survival_3.4-0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 matrixStats_0.62.0 bit64_4.0.5
[4] filelock_1.0.2 progress_1.2.2 httr_1.4.4
[7] rprojroot_2.0.3 GenomeInfoDb_1.32.3 tools_4.2.1
[10] utf8_1.2.2 R6_2.5.1 DT_0.24
[13] DBI_1.1.3 BiocGenerics_0.42.0 tidyselect_1.1.2
[16] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2
[19] compiler_4.2.1 cli_3.3.0 Biobase_2.56.0
[22] basilisk.utils_1.8.0 xml2_1.3.3 DelayedArray_0.22.0
[25] rtracklayer_1.57.0 readr_2.1.2 rappdirs_0.3.3
[28] stringr_1.4.1 digest_0.6.29 Rsamtools_2.12.0
[31] piggyback_0.1.4 R.utils_2.12.0 basilisk_1.8.1
[34] XVector_0.36.0 pkgconfig_2.0.3 htmltools_0.5.3
[37] echodata_0.99.11 MatrixGenerics_1.8.1 dbplyr_2.2.1
[40] fastmap_1.1.0 BSgenome_1.64.0 htmlwidgets_1.5.4
[43] rlang_1.0.4 rstudioapi_0.14 RSQLite_2.2.16
[46] BiocIO_1.6.0 generics_0.1.3 jsonlite_1.8.0
[49] echoconda_0.99.6 BiocParallel_1.30.3 R.oo_1.25.0
[52] zip_2.2.0 dplyr_1.0.9 VariantAnnotation_1.42.1
[55] RCurl_1.98-1.8 magrittr_2.0.3 GenomeInfoDbData_1.2.8
[58] Rcpp_1.0.9 S4Vectors_0.34.0 fansi_1.0.3
[61] reticulate_1.25 R.methodsS3_1.8.2 lifecycle_1.0.1
[64] stringi_1.7.8 yaml_2.3.5 SummarizedExperiment_1.26.1
[67] zlibbioc_1.42.0 BiocFileCache_2.4.0 grid_4.2.1
[70] blob_1.2.3 parallel_4.2.1 crayon_1.5.1
[73] dir.expiry_1.4.0 lattice_0.20-45 Biostrings_2.64.1
[76] splines_4.2.1 GenomicFeatures_1.48.3 hms_1.1.2
[79] KEGGREST_1.36.3 pillar_1.8.1 GenomicRanges_1.48.0
[82] rjson_0.2.21 codetools_0.2-18 biomaRt_2.52.0
[85] stats4_4.2.1 pkgload_1.3.0 XML_3.99-0.10
[88] glue_1.6.2 BiocManager_1.30.18 data.table_1.14.2
[91] tzdb_0.3.0 png_0.1-7 vctrs_0.4.1
[94] tidyr_1.2.0 purrr_0.3.4 assertthat_0.2.1
[97] cachem_1.0.6 openxlsx_4.2.5 echotabix_0.99.7
[100] restfulr_0.0.15 downloadR_0.99.3 tibble_3.1.8
[103] GenomicAlignments_1.32.1 AnnotationDbi_1.58.0 memoise_2.0.1
[106] IRanges_2.30.1 ellipsis_0.3.2 here_1.0.1
For some reason this shows up even when snpStats
is already installed...
Is your feature request related to a problem? Please describe.
We want to be able to get the exact superpopulation, or even better, the subpopulation from the input GWAS, by correlating the input GWAS MAFs against the the LDref MAFs.
Describe the solution you'd like
Describe alternatives you've considered
If the inputted superpopulation does not really mach, rise a warning for now
Additional context
Read what is out there and maybe try to perform an undustment od the LD reference (ie deconvolution methods)
NOTE
Go to translate_population.R
**Using UKBB 150 to calculate LD structure **
Maybe acessing UKB 150K is another alternative to calculate LD
https://www.nature.com/articles/s41586-022-04965-x
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.