a one-stop R package for shrinkage estimation of linkage disequilibrium
install.packages("devtools")
devtools::install_github("stephenslab/ldshrink")
NOTE: This package is still under very active development.
ldshrink: a one-stop R package for shrinkage estimation of linkage disequilibrium
Below is a check list of guidelines on files in R/
from Wickham's package book: http://r-pkgs.had.co.nz/r.html
_
to separate words within a name.library()
or require()
. Use the DESCRIPTION
to specify your package’s requirements.source()
to load code from a file. Rely on devtools::load_all()
which automatically sources all files in R/
.on.exit()
: options()
, par()
and setwd()
.onAttach()
for startup message.packageStartupMessage()
, and not message()
.options()
..onLoad()
, consider using .onUnload()
to clean up any side effects..R
files."\u1234"
format is to use stringi::stri_escape_unicode()
.Below are links to several genetic maps that I have used in my past work. After discussing with Peter (see Issue #3 ), it seems fine to include these external data in LDshrink
as long as we properly cite these data. To make LDshrink
a one-stop package for LD-related calculation, I think we should include all these files in the package (if possible).
IMPUTE v1
Note that these maps have very straightforward format:
$ head genetic_map_CEU_chr16.txt
position CEU_rate(cM/Mb) Genetic_Map(cM)
24045 0 0
24170 0.3020482 0
25057 0.3020482 0.0002679168
25065 0.2846052 0.0002703331
25561 0.2782453 0.0004114973
25658 0.2782453 0.0004384871
28165 0.2782453 0.001136048
29475 0.2794324 0.001500549
31010 0.2794324 0.001929478
We need to make sure that map_data
, m
and Ne
are all consistent with the same version of genetic map data (map_data
) used here. For example, if I remember it correctly, m=85
and Ne=11490.672741
correspond to 1000 Genomes Phase 1 CEU samples.
Nick came up a smart way to handle this issue.
Here the cov()
seems to be the default one in R
, but Nick also has a faster cov()
based on C
(which does not handle NA
well though). We should allow people to use this faster one, if they wish and can provide input data without NA
.
Below is a check list of guidelines on the DESCRIPTION
file from Wickham's package book: http://r-pkgs.had.co.nz/description.html
DESCRIPTION
.DESCRIPTION
uses a simple file format called DCF, the Debian control format.Imports
: packages listed here must be present for your package to work. In fact, any time your package is installed, those packages will, if not already present, be installed on your computer.package::function()
.::
(on the order of 5µs, so it will only matter if you call the function millions of times).Suggests
: your package can use these packages, but doesn’t require them. You might use suggested packages for example datasets, to run tests, build vignettes, or maybe there’s only one function that needs the package.Suggests
are not automatically installed along with your package. This means that you need to check if the package is available before using it (use requireNamespace(x, quietly = TRUE)
).Imports
and Suggests
to your package is to use devtools::use_package()
.MASS (>= 7.3.0)
.Imports
, not Depends
.Depends: R (>= 3.0.1)
.LinkingTo
: packages listed here rely on C or C++ code in another package.SystemRequirements
field.Title
is a one line description of the package, and is often shown in package listing. It should be plain text (no markup), capitalised like a title, and NOT end in a period. Keep it short: listings will often truncate the title to 65 characters.Description
is more detailed than the title. You can use multiple sentences but you are limited to one paragraph. If your description spans multiple lines (and it should!), each line must be no more than 80 characters wide. Indent subsequent lines with 4 spaces.README.md
file that goes into much more depth and shows a few examples.Authors@R
field..
or -.
For example, 1.0
and 0.9.1-10
are valid versions.<major>.<minor>.<patch>
.9000
. For example, the first version of the package should be 0.0.0.9000
.9000
to 9001
if you’ve added an important feature that another development package needs to depend on.Collate
controls the order in which R files are sourced. This only matters if your code has side-effects; most commonly because you’re using S4.LazyData
makes it easier to access data in your package. Because it’s so important, it’s included in the minimal description created by devtools
.@CreRecombinase spotted the following line that might be not necessary.
This is a LD shrinkage method introduced for better PCA in population structure. I will further review it to see whether/how to incorporate it in ldshrink
.
Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912642/
Although the main purpose of LDshrink
is to provide a one-stop place for various LD shrinkage estimators that have been recently used in GWAS summary statistics literature, I think perhaps we should also provide the option to output the traditional LD statistics in population genetics literature.
Below are two potential use cases for this option.
This issue may be redundant since Nick has already provided the option useLDshrink = TRUE
.
Here we simply call R
built-in function eigen()
.
According to R
documentation, it seems that this call does not exploit the banded structure of Wen-Stephens estimator:
Source
eigen uses the LAPACK routines DSYEVR, DGEEV, ZHEEV and ZGEEV.
LAPACK is from http://www.netlib.org/lapack and its guide is listed in the references.
Nick also has a version for sparse LD matrix, which is not included here:
I also have a routine that computes and stores a sparse symmetric matrix for the whole chromosome
it's faster than the dense block-wise routine
It seems that GCTB software has implemented Wen-Stephens shrinkage LD estimator (see also https://github.com/stephenslab/rss-private/issues/64):
Shrunk LD matrix
The shrinkage estimator for the LD correlation matrix was originally proposed by Wen and Stephens (2010). The estimator shrinks the off-diagonal entries of the sample LD matrix toward zero. Zhu and Stephens (2017) used the shrinkage estimator in their Regression with Summary Statistics (RSS) methodology and showed empirically that it can provide improved inference. The shrinkage estimator overcomes some of the issues arising from approximating the full LD matrix in the summary-data based model using a subset of the full LD matrix and constructing the matrix from a reference. The GCTB implementation is a C++ port from that provided with the RSS software and has been adapted for use with the GCTB software.
GCTB software doc page: http://cnsgenomics.com/software/gctb/#SummaryBayesianAlphabet
@pcarbo Hi Peter -- I just had a great meeting with Nick today, and we have the following question about including published external data in LDshrink
package. We would appreciate your input.
To make the package user-friendly, we would like to include the following external data files:
Genetic maps from 1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/
Approximately independent LD blocks: https://bitbucket.org/nygcresearch/ldetect-data/src
These data files have fairly simple structure. The genetic maps are data frames with the following headers:
[id] [physical position] [genetic position (cumulative)]
The LD blocks are data frames with the following headers:
[chromosome name] [region start] [region stop]
These data files also have formal publications available. I wonder if we could include these external files in LDshrink
package, provided that we explicitly cite these publications?
Do we have to worry about licensing issues?
An alternative plan is to provide users with the data preprocessing scripts/functions, but I think this will make the package harder to use.
Below is a check list of guidelines on files in vignette /
from Wickham's package book: http://r-pkgs.had.co.nz/vignettes.html.
This part seems straightforward since the main tools are rmarkdown
and knitr
, with which I have some experiences. However, preparing and organizing the contents of vignettes may be as hard as writing articles.
devtools::use_vignette("my-vignette")
.devtools::build_vignettes()
, but this is rarely useful. Instead use devtools::build()
to create a package bundle with the vignettes included.DESCRIPTION
.This thread is based on http://r-pkgs.had.co.nz/package.html#naming.
LDshrink -> ldshrink
. This seems to be a global change.In my version of the clang compiler (version 4.0.1), I get several dozen warnings, which are attached. Some of these are in the Eigen source, but others are in our own code, and perhaps some of these can be easily addressed. I've labeled this as a "bug" but perhaps should be read as "potential bug".
Hello,
I just started using the GCTB software but am running into an issue when I try to merge sparse ldm files from a single chromsome into one ldm file. I get the following error:
/var/spool/slurmd.cn1605/job55094972/slurm_script: line 13: 82612 Segmentation fault
Here's the basics of the code I ran
gctb --mldm chr1.mldmlist --make-sparse-ldm --out chr1 --thread 16
Thank you,
Alexis
Although not the main focus or output of LDshrink
, it will be straightforward (and perhaps useful) to generate some simple visualization results similar to Fig. 1 in Wen & Stephens (2010).
Also point out in documentation/manuscript that some existing packages can be used for this task: e.g. https://cran.r-project.org/web/packages/corrplot/index.html
Here is a feature request for an export interface that for given chr, start and end positions, a complete matrix (or sparse matrix in R which can be trivially made complete) from given blocked LD database file can be extracted.
Below is a check list of guidelines on files in man/
from Wickham's package book: http://r-pkgs.had.co.nz/man.html
Even as a total newbie to R package development, I know I should not manually edit files in man/
folder most of time ... Hence, this check list should be used together with #22.
Instead of writing these files by hand, we’re going to use
roxygen2
which turns specially formatted comments into.Rd
files.
#'
and come before a function.@@
if you want to add a literal @
to the documentation.@section
tag. This is a useful way of breaking a long details section into multiple chunks with useful headings. Section titles should be in sentence case, must be followed by a colon, and they can only be one line long.@seealso
allows you to point to other useful resources, either on the web, \url{http://www.r-project.org}
, in your package \code{\link{functioname}}
, or another package \code{\link[packagename]{functioname}}
.@family
. The value of @family
should be plural.@aliases alias1 alias2 ...
adds additional aliases to the topic. An alias is another name for the topic that can be used with ?
.@param name description
describes the function’s inputs or parameters.@param x,y Numeric vectors.
.R CMD check
.\dontrun{}
allows you to include code in the example that is not run.@example path/relative/to/package/root
to insert them into the documentation. (Note that the @example
tag here has no ‘s’.)@return description
describes the output from the function.@@
for @
, %%
for %
and \\
for \
.@inheritParams source_function
and or @inheritParams package::function
.@rdname
or @describeIn
..Rd
syntax to format text: https://cran.r-project.org/doc/manuals/R-exts.html#Marking-textIt seems that the current LD score calculation is different from the authors' approach.
Below is the 2nd paragraph of the first LDSC paper online method: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4495769/
I wonder how hard to implement the following L2-penalty (i.e. "ridge") LD shrinkage estimator: Sigma_ridge = Sigma + lambda I
, where Sigma
is the sample covariance matrix for SNPs.
For simplicity we can fix the value of lambda
(e.g. lambda=0.01
) for now. (It might be conceptually straightforward to tune this parameter by cross-validation, but this definitely complicates the software implementation.)
This approach was used in ImpG-Summary: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184260/
It seems that haplo_panel
is a bit confusing here -- haplotype data are not required; people can input genotype data to compute Wen-Stephens estimator, and indeed this feature is supported by the isGeno
option.
On a related note, there seem to be three types of naming conventions:
map_data
isGeno
na.rm
Is it worthwhile to pick one and stick to it?
@CreRecombinase @xiangzhu Can you find a way to use replace the C++17 requirement (in the Makevars
file) with the C++11 requirement, for example? The C++17 standard is much less widely supported; compare this and this. In the interest of making the LDshrink package as accessible as possible, can you try to find a way to avoid using C++17 features?
Thanks to @jean997 for identifying this issue.
We often exclude MHC region when estimating LD and performing further downstream analyses.
It would be nice to do this MHC exclusion for people.
The idea is to have the following simple data frame inside ldshrink
, and a flag like remove_mhc=TRUE
:
> HLA.hg19
chrom start.base end.base
HLA 6 29719561 32883508
> is(HLA.hg19)
[1] "data.frame" "list" "oldClass" "vector"
Similarly we may want to remove centromeres as well.
Both MHC and centromeres data frames can be found here: https://github.com/smgogarten/GWASTools/tree/master/data.
This issue is also related to Issue #14, since they share the same idea: if we cannot estimate something reliably, then probably better no to estimate them at all.
Another very detailed note: pay attention to the sources of these regions -- e.g. if these regions are pulled out from UCSC, then they are 0-based.
It seems the main function only returns an estimated LD matrix at this point?
https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/LDshrink.R#L6
Ideally we want to have some basic SNP info available (e.g. position, allele), which is essential in combining LD with GWAS summary statistics in analyses.
I think the emeraLD
package gives us a good example: https://github.com/statgen/emeraLD
> source('emeraLD2R.r');
Loading required package: data.table
data.table 1.11.4 Latest news: http://r-datatable.com
emeraLD v0.1 (c) 2018 corbin quick (corbinq@gmail.com)
reading from m3vcf file...
processed genotype data for 5008 haplotypes...
calculating LD for 60 SNPs...
done!! thanks for using emeraLD
> names(ld_data)
[1] "Sigma" "info"
> head(ld_data$info)
chr pos id ref alt
1: 20 83061 rs549711487 C T
2: 20 83196 rs62190472 A T
3: 20 83252 rs6137896 G C
4: 20 83570 rs6048967 T G
5: 20 83611 rs114000219 C A
6: 20 83792 rs529518485 A G
> head(ld_data$Sigma[, 1:5], 5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000 -0.00602 0.03989 -0.00824 -0.00331
[2,] -0.00602 1.00000 -0.14013 -0.03102 -0.01245
[3,] 0.03989 -0.14013 1.00000 -0.05714 -0.04400
[4,] -0.00824 -0.03102 -0.05714 1.00000 -0.01704
[5,] -0.00331 -0.01245 -0.04400 -0.01704 1.00000
I wonder how hard to add the following distance-based LD shrinkage estimator: LD(i, j) = 0
if SNP i
and j
are far away, say 10Mb apart.
This approach was used in GCTA-COJO paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593158/
It may be useful to include the list of long-range LD regions in LDshrink
: https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD).
The data structure is straightforward, 3-column data frame:
Chr | Start | Stop |
---|---|---|
1 | 48000000 | 52000000 |
2 | 86000000 | 100500000 |
2 | 134500000 | 138000000 |
2 | 183000000 | 190000000 |
For example, let LDshrink
throw a warning message when genotypes of SNPs in these long-range LD regions are used.
Hi Xiang,
I have read your AOAS paper, and I think ldshrink package is very helpful to me. But when I download the package, it is hard for me to find the main function to calculate the shinkage LD structure. Would you like to give me a toy example to let me use the package?
Many thanks for your kindly help.
Sheng
Currently ldshrink
assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.
In the past I first used vcftools
to convert vcf data to IMPUTE2
format (which is indeed a p-by-n matrix), and then transpose IMPUTE2
-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.
This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools
before any LD-related operations in ldshrink
.
It seems that now we can use data.table
(https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.
Here is my own example.
> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")
|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754 8
> head(my_dt)
#CHROM POS ID REF ALT QUAL FILTER
1: chr15 101094127 rs12904576 T C . PASS
2: chr15 101114640 rs36053285 T C . PASS
3: chr15 45685616 rs796721871 TA T . PASS
4: chr15 45670316 rs2114501 G A . PASS
5: chr15 45618730 rs59889118 G A . PASS
6: chr15 45620612 rs1980288 T C . PASS
INFO
1: Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2: Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
The benefit of using data.table
here is two-fold: i) users don't have to leave R and use vcftools
to get n-by-p genotype matrix from vcf data; ii) data.table
is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).
Hence, we can either add a wrapper that uses data.table
to parse vcf for ldshrink
users, or at minimum, we can simply provide a vignette showing how to use data.table
to parse vcf.
Finally there exists a package vcfR
(https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).
I came across the following error in my mac laptop:
> devtools::load_all(".")
[omitted]
RcppExports.cpp:242:5: error: no matching function for call to 'R_useDynamicSymbols'
R_useDynamicSymbols(dll, FALSE);
^~~~~~~~~~~~~~~~~~~
/Library/Frameworks/R.framework/Resources/include/R_ext/Rdynload.h:84:10: note: candidate function not viable: no known conversion from 'int' to 'Rboolean' for 2nd argument
Rboolean R_useDynamicSymbols(DllInfo *info, Rboolean value);
^
4 warnings and 1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘ldshrink’
* removing ‘/private/var/folders/dm/39__1gf52_s_9xbqrfd_46lc0000gp/T/RtmpjSVD5F/devtools_install_60b7e017afc/ldshrink’
Error: Command failed (1)
Here is my laptop info:
# xiangzhu @ stanford in ~ [22:27:51]
$ system_profiler SPSoftwareDataType
Software:
System Software Overview:
System Version: macOS 10.14 (18A391)
Kernel Version: Darwin 18.0.0
Boot Volume: Macintosh HD
Boot Mode: Normal
Computer Name: STA-C02V26FCHTD5
User Name: Xiang Zhu (xiangzhu)
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Time since boot: 5 days 22:05
Here is my R session info:
> devtools::session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, darwin15.6.0
ui RStudio (1.1.456)
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
date 2018-10-04
Packages ------------------------------------------------------------------------------------------------------------------------------------------
package * version date source
base * 3.5.1 2018-07-05 local
BH 1.66.0-1 2018-02-13 CRAN (R 3.5.0)
commonmark 1.6 2018-09-30 CRAN (R 3.5.0)
compiler 3.5.1 2018-07-05 local
datasets * 3.5.1 2018-07-05 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
digest 0.6.17 2018-09-12 CRAN (R 3.5.1)
graphics * 3.5.1 2018-07-05 local
grDevices * 3.5.1 2018-07-05 local
grid 3.5.1 2018-07-05 local
lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
Matrix 1.2-14 2018-04-13 CRAN (R 3.5.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.1 2018-07-05 local
R6 2.3.0 2018-10-04 CRAN (R 3.5.1)
Rcpp 0.12.19 2018-10-01 CRAN (R 3.5.0)
RcppEigen 0.3.3.4.0 2018-02-07 CRAN (R 3.5.0)
RcppParallel 4.4.1 2018-07-19 CRAN (R 3.5.0)
RcppProgress 0.4.1 2018-05-11 CRAN (R 3.5.0)
rlang 0.2.2 2018-08-16 CRAN (R 3.5.0)
roxygen2 6.1.0 2018-07-27 CRAN (R 3.5.0)
stats * 3.5.1 2018-07-05 local
stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
stringr 1.3.1 2018-05-10 CRAN (R 3.5.0)
testthat 2.0.0 2017-12-13 CRAN (R 3.5.0)
tools 3.5.1 2018-07-05 local
utils * 3.5.1 2018-07-05 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
Nick found this feature useful, and we should document it.
I'm referring to
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L76-L77
and
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L139
theta seems to be very small for all sample sizes, even small ones like 100; I'm therefore wondering if this has any effect at all?
Thanks,
Florian
During my recent whole genome simulations I found some tailing EVs of LD matrices are (numerically) complex numbers, e.g. 3.28+1e-14 i
.
This seems a numerical issue since the magnitude of imaginary part is so close to zero.
However, Nick haven't noticed this pattern in his whole genome data analyses.
Below is a draft of meeting agenda.
LDshrink
development?LDshrink
reliably produce results for one chromosome (i.e. Xiang's use case)?LDshrink
reliably produce results for whole genome (i.e. Nick's use case)?LDshrink
from an accepted Bioconductor package?LDshrink
package development so that it can be accepted by Bioconductor.LDshrink
package (closely related to package vignettes).mashr
revision used Nick's precomputed LD data on Midway).Paper: https://doi.org/10.1016/j.ajhg.2015.11.021
See related discussions at #33.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.