The ldshrink from stephenslab

Revise R codes based on Wickham book

Below is a check list of guidelines on files in R/ from Wickham's package book: http://r-pkgs.had.co.nz/r.html

Two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file.
File names should be meaningful and end in .R.
Avoid capitalisation problems by never using filenames that differ only in capitalisation.
You don’t have to use my style, but I strongly recommend that you use a consistent style and you document it.
If you’re working on someone else’s code, don’t impose your own style. Instead, read their style documentation and follow it as closely as possible.
Variable and function names should be lowercase. Use an underscore _ to separate words within a name.
Generally, variable names should be nouns and function names should be verbs.
Where possible, avoid using names of existing functions and variables.
Format suggestions on spacing , curly braces, line lengths, assignment, and commenting.
Never run code at the top-level of a package: package code should only create objects, mostly functions.
Don’t use library() or require(). Use the DESCRIPTION to specify your package’s requirements.
Never use source() to load code from a file. Rely on devtools::load_all() which automatically sources all files in R/.
Clean up after yourself with on.exit(): options(), par() and setwd()
Isolate creating plots and printing output to the console in functions that only produce output.
Use .onAttach() for startup message.
To display startup messages, always use packageStartupMessage(), and not message().
To set custom options for your package with options().
If you use .onLoad(), consider using .onUnload() to clean up any side effects.
If you’re planning on submitting your package to CRAN, you must use only ASCII characters in your .R files.
The easiest way to use the special unicode escape "\u1234" format is to use stringi::stri_escape_unicode().

External sources of genetic maps

Below are links to several genetic maps that I have used in my past work. After discussing with Peter (see Issue #3 ), it seems fine to include these external data in LDshrink as long as we properly cite these data. To make LDshrink a one-stop package for LD-related calculation, I think we should include all these files in the package (if possible).

Genetic maps listed in `IMPUTE v1`

Genetic maps created by Pickrell

https://github.com/joepickrell/1000-genomes-genetic-maps

Note that these maps have very straightforward format:

$ head genetic_map_CEU_chr16.txt
position CEU_rate(cM/Mb) Genetic_Map(cM)
24045 0 0
24170 0.3020482 0
25057 0.3020482 0.0002679168
25065 0.2846052 0.0002703331
25561 0.2782453 0.0004114973
25658 0.2782453 0.0004384871
28165 0.2782453 0.001136048
29475 0.2794324 0.001500549
31010 0.2794324 0.001929478

Make sure that `map_data`, `m` and `Ne` are consistent

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L2

We need to make sure that map_data, m and Ne are all consistent with the same version of genetic map data (map_data) used here. For example, if I remember it correctly, m=85 and Ne=11490.672741 correspond to 1000 Genomes Phase 1 CEU samples.

Nick came up a smart way to handle this issue.

Allow people to use a faster `cov()`

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L10

Here the cov() seems to be the default one in R, but Nick also has a faster cov() based on C (which does not handle NA well though). We should allow people to use this faster one, if they wish and can provide input data without NA.

Revise package metadata based on Wickham book

Below is a check list of guidelines on the DESCRIPTION file from Wickham's package book: http://r-pkgs.had.co.nz/description.html

Possibly unnecessary use of `scale()`

@CreRecombinase spotted the following line that might be not necessary.

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L8

LD shrinkage in Zou et al (Hum Hered. 2010)

This is a LD shrinkage method introduced for better PCA in population structure. I will further review it to see whether/how to incorporate it in ldshrink.

Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912642/

Key equation

Option to output traditional LD statistics

Although the main purpose of LDshrink is to provide a one-stop place for various LD shrinkage estimators that have been recently used in GWAS summary statistics literature, I think perhaps we should also provide the option to output the traditional LD statistics in population genetics literature.

For phased haplotypes: r2, D, and D’ statistics
For unphased genotypes: sample correlation or r2

Below are two potential use cases for this option.

Use case 1: a data analyst doesn't need or care LD shrinkage estimator, and just need standard LD statistics based on a dataset.
Use case 2: a method developer wants to assess the improvement of using LD shrinkage estimator over using standard LD estimator.

This issue may be redundant since Nick has already provided the option useLDshrink = TRUE.

Possible ways of exploiting the structure of LD matrix

Here we simply call R built-in function eigen().

https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/block_utils.R#L118

According to R documentation, it seems that this call does not exploit the banded structure of Wen-Stephens estimator:

Source
eigen uses the LAPACK routines DSYEVR, DGEEV, ZHEEV and ZGEEV.
LAPACK is from http://www.netlib.org/lapack and its guide is listed in the references.

Nick also has a version for sparse LD matrix, which is not included here:

I also have a routine that computes and stores a sparse symmetric matrix for the whole chromosome
it's faster than the dense block-wise routine

LD shrinkage estimation in GCTB

It seems that GCTB software has implemented Wen-Stephens shrinkage LD estimator (see also https://github.com/stephenslab/rss-private/issues/64):

Shrunk LD matrix
The shrinkage estimator for the LD correlation matrix was originally proposed by Wen and Stephens (2010). The estimator shrinks the off-diagonal entries of the sample LD matrix toward zero. Zhu and Stephens (2017) used the shrinkage estimator in their Regression with Summary Statistics (RSS) methodology and showed empirically that it can provide improved inference. The shrinkage estimator overcomes some of the issues arising from approximating the full LD matrix in the summary-data based model using a subset of the full LD matrix and constructing the matrix from a reference. The GCTB implementation is a C++ port from that provided with the RSS software and has been adapted for use with the GCTB software.

GCTB software doc page: http://cnsgenomics.com/software/gctb/#SummaryBayesianAlphabet

Including external data

@pcarbo Hi Peter -- I just had a great meeting with Nick today, and we have the following question about including published external data in LDshrink package. We would appreciate your input.

To make the package user-friendly, we would like to include the following external data files:

Genetic maps from 1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/
Approximately independent LD blocks: https://bitbucket.org/nygcresearch/ldetect-data/src

These data files have fairly simple structure. The genetic maps are data frames with the following headers:

[id] [physical position] [genetic position (cumulative)]

The LD blocks are data frames with the following headers:

[chromosome name] [region start] [region stop]

These data files also have formal publications available. I wonder if we could include these external files in LDshrink package, provided that we explicitly cite these publications?

Do we have to worry about licensing issues?

An alternative plan is to provide users with the data preprocessing scripts/functions, but I think this will make the package harder to use.

Revise vignettes based on Wickham book

Below is a check list of guidelines on files in vignette / from Wickham's package book: http://r-pkgs.had.co.nz/vignettes.html.

This part seems straightforward since the main tools are rmarkdown and knitr, with which I have some experiences. However, preparing and organizing the contents of vignettes may be as hard as writing articles.

Start with devtools::use_vignette("my-vignette").
Can build all vignettes from the console with devtools::build_vignettes(), but this is rarely useful. Instead use devtools::build() to create a package bundle with the vignettes included.
Need to put yourself in the readers’ shoes, and adopt a “beginner’s mind”.
Any packages used by the vignette must be declared in the DESCRIPTION.
Need to watch the file size.

Name the package

This thread is based on http://r-pkgs.had.co.nz/package.html#naming.

Pick a unique name you can easily Google. Done: top 1 search in my side.

Avoid using both upper and lower case letters. To do: LDshrink -> ldshrink. This seems to be a global change.

Warnings from package installation

In my version of the clang compiler (version 4.0.1), I get several dozen warnings, which are attached. Some of these are in the Eigen source, but others are in our own code, and perhaps some of these can be easily addressed. I've labeled this as a "bug" but perhaps should be read as "potential bug".

ldshrink.install.err.gz

Segmentation Fault when merging ldm files with GCTB

Hello,
I just started using the GCTB software but am running into an issue when I try to merge sparse ldm files from a single chromsome into one ldm file. I get the following error:
/var/spool/slurmd.cn1605/job55094972/slurm_script: line 13: 82612 Segmentation fault

Here's the basics of the code I ran
gctb --mldm chr1.mldmlist --make-sparse-ldm --out chr1 --thread 16

Thank you,
Alexis

Simple visualization of LD matrix

Although not the main focus or output of LDshrink, it will be straightforward (and perhaps useful) to generate some simple visualization results similar to Fig. 1 in Wen & Stephens (2010).

Also point out in documentation/manuscript that some existing packages can be used for this task: e.g. https://cran.r-project.org/web/packages/corrplot/index.html

Interface to export LD matrix for requested region

Here is a feature request for an export interface that for given chr, start and end positions, a complete matrix (or sparse matrix in R which can be trivially made complete) from given blocked LD database file can be extracted.

Revise object documentation based on Wickham book

Below is a check list of guidelines on files in man/ from Wickham's package book: http://r-pkgs.had.co.nz/man.html

Even as a total newbie to R package development, I know I should not manually edit files in man/ folder most of time ... Hence, this check list should be used together with #22.

Instead of writing these files by hand, we’re going to use roxygen2 which turns specially formatted comments into .Rd files.

LD score calculation

It seems that the current LD score calculation is different from the authors' approach.

https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/block_utils.R#L117

Below is the 2nd paragraph of the first LDSC paper online method: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4495769/

L2-penalty LD shrinkage

I wonder how hard to implement the following L2-penalty (i.e. "ridge") LD shrinkage estimator: Sigma_ridge = Sigma + lambda I, where Sigma is the sample covariance matrix for SNPs.

For simplicity we can fix the value of lambda (e.g. lambda=0.01) for now. (It might be conceptually straightforward to tune this parameter by cross-validation, but this definitely complicates the software implementation.)

This approach was used in ImpG-Summary: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184260/

Rename `haplo_panel`

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L2

It seems that haplo_panel is a bit confusing here -- haplotype data are not required; people can input genotype data to compute Wen-Stephens estimator, and indeed this feature is supported by the isGeno option.

On a related note, there seem to be three types of naming conventions:

map_data
isGeno
na.rm

Is it worthwhile to pick one and stick to it?

C++17 requirement is limiting

@CreRecombinase @xiangzhu Can you find a way to use replace the C++17 requirement (in the Makevars file) with the C++11 requirement, for example? The C++17 standard is much less widely supported; compare this and this. In the interest of making the LDshrink package as accessible as possible, can you try to find a way to avoid using C++17 features?

Thanks to @jean997 for identifying this issue.

MHC region exclusion

We often exclude MHC region when estimating LD and performing further downstream analyses.

It would be nice to do this MHC exclusion for people.

The idea is to have the following simple data frame inside ldshrink, and a flag like remove_mhc=TRUE:

> HLA.hg19
    chrom start.base end.base
HLA     6   29719561 32883508
> is(HLA.hg19)
[1] "data.frame" "list"       "oldClass"   "vector"

Similarly we may want to remove centromeres as well.

Both MHC and centromeres data frames can be found here: https://github.com/smgogarten/GWASTools/tree/master/data.

This issue is also related to Issue #14, since they share the same idea: if we cannot estimate something reliably, then probably better no to estimate them at all.

Another very detailed note: pay attention to the sources of these regions -- e.g. if these regions are pulled out from UCSC, then they are 0-based.

Return SNP info along with LD matrix

It seems the main function only returns an estimated LD matrix at this point?
https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/LDshrink.R#L6

Ideally we want to have some basic SNP info available (e.g. position, allele), which is essential in combining LD with GWAS summary statistics in analyses.

I think the emeraLD package gives us a good example: https://github.com/statgen/emeraLD

> source('emeraLD2R.r');
Loading required package: data.table
data.table 1.11.4  Latest news: http://r-datatable.com
emeraLD v0.1 (c) 2018 corbin quick (corbinq@gmail.com)

reading from m3vcf file...

processed genotype data for 5008 haplotypes...

calculating LD for 60 SNPs...

done!! thanks for using emeraLD

> names(ld_data)
[1] "Sigma" "info"

> head(ld_data$info)
   chr   pos          id ref alt
1:  20 83061 rs549711487   C   T
2:  20 83196  rs62190472   A   T
3:  20 83252   rs6137896   G   C
4:  20 83570   rs6048967   T   G
5:  20 83611 rs114000219   C   A
6:  20 83792 rs529518485   A   G

> head(ld_data$Sigma[, 1:5], 5)
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,]  1.00000 -0.00602  0.03989 -0.00824 -0.00331
[2,] -0.00602  1.00000 -0.14013 -0.03102 -0.01245
[3,]  0.03989 -0.14013  1.00000 -0.05714 -0.04400
[4,] -0.00824 -0.03102 -0.05714  1.00000 -0.01704
[5,] -0.00331 -0.01245 -0.04400 -0.01704  1.00000

Distance-based LD shrinkage

I wonder how hard to add the following distance-based LD shrinkage estimator: LD(i, j) = 0 if SNP i and j are far away, say 10Mb apart.

This approach was used in GCTA-COJO paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593158/

Long-range regions of high LD

It may be useful to include the list of long-range LD regions in LDshrink: https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD).

The data structure is straightforward, 3-column data frame:

Chr	Start	Stop
1	48000000	52000000
2	86000000	100500000
2	134500000	138000000
2	183000000	190000000

For example, let LDshrink throw a warning message when genotypes of SNPs in these long-range LD regions are used.

How to use this package?

Hi Xiang,
I have read your AOAS paper, and I think ldshrink package is very helpful to me. But when I download the package, it is hard for me to find the main function to calculate the shinkage LD structure. Would you like to give me a toy example to let me use the package?
Many thanks for your kindly help.

                                              Sheng

Use `data.table` package to parse vcf

Currently ldshrink assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used vcftools to convert vcf data to IMPUTE2 format (which is indeed a p-by-n matrix), and then transpose IMPUTE2-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools before any LD-related operations in ldshrink.

It seems that now we can use data.table (https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18

The benefit of using data.table here is two-fold: i) users don't have to leave R and use vcftools to get n-by-p genotype matrix from vcf data; ii) data.table is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses data.table to parse vcf for ldshrink users, or at minimum, we can simply provide a vignette showing how to use data.table to parse vcf.

Finally there exists a package vcfR (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

Rcpp error: no matching function for call to 'R_useDynamicSymbols', shown in macOS 10.14 + R 3.5.1

I came across the following error in my mac laptop:

> devtools::load_all(".")

[omitted]

RcppExports.cpp:242:5: error: no matching function for call to 'R_useDynamicSymbols'
    R_useDynamicSymbols(dll, FALSE);
    ^~~~~~~~~~~~~~~~~~~
/Library/Frameworks/R.framework/Resources/include/R_ext/Rdynload.h:84:10: note: candidate function not viable: no known conversion from 'int' to 'Rboolean' for 2nd argument
Rboolean R_useDynamicSymbols(DllInfo *info, Rboolean value);
         ^
4 warnings and 1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘ldshrink’
* removing ‘/private/var/folders/dm/39__1gf52_s_9xbqrfd_46lc0000gp/T/RtmpjSVD5F/devtools_install_60b7e017afc/ldshrink’
Error: Command failed (1)

Here is my laptop info:

# xiangzhu @ stanford in ~ [22:27:51]
$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: macOS 10.14 (18A391)
      Kernel Version: Darwin 18.0.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: STA-C02V26FCHTD5
      User Name: Xiang Zhu (xiangzhu)
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      Time since boot: 5 days 22:05

Here is my R session info:

> devtools::session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-10-04                  

Packages ------------------------------------------------------------------------------------------------------------------------------------------
 package      * version   date       source        
 base         * 3.5.1     2018-07-05 local         
 BH             1.66.0-1  2018-02-13 CRAN (R 3.5.0)
 commonmark     1.6       2018-09-30 CRAN (R 3.5.0)
 compiler       3.5.1     2018-07-05 local         
 datasets     * 3.5.1     2018-07-05 local         
 devtools       1.13.6    2018-06-27 CRAN (R 3.5.0)
 digest         0.6.17    2018-09-12 CRAN (R 3.5.1)
 graphics     * 3.5.1     2018-07-05 local         
 grDevices    * 3.5.1     2018-07-05 local         
 grid           3.5.1     2018-07-05 local         
 lattice        0.20-35   2017-03-25 CRAN (R 3.5.1)
 magrittr       1.5       2014-11-22 CRAN (R 3.5.0)
 Matrix         1.2-14    2018-04-13 CRAN (R 3.5.1)
 memoise        1.1.0     2017-04-21 CRAN (R 3.5.0)
 methods      * 3.5.1     2018-07-05 local         
 R6             2.3.0     2018-10-04 CRAN (R 3.5.1)
 Rcpp           0.12.19   2018-10-01 CRAN (R 3.5.0)
 RcppEigen      0.3.3.4.0 2018-02-07 CRAN (R 3.5.0)
 RcppParallel   4.4.1     2018-07-19 CRAN (R 3.5.0)
 RcppProgress   0.4.1     2018-05-11 CRAN (R 3.5.0)
 rlang          0.2.2     2018-08-16 CRAN (R 3.5.0)
 roxygen2       6.1.0     2018-07-27 CRAN (R 3.5.0)
 stats        * 3.5.1     2018-07-05 local         
 stringi        1.2.4     2018-07-20 CRAN (R 3.5.0)
 stringr        1.3.1     2018-05-10 CRAN (R 3.5.0)
 testthat       2.0.0     2017-12-13 CRAN (R 3.5.0)
 tools          3.5.1     2018-07-05 local         
 utils        * 3.5.1     2018-07-05 local         
 withr          2.1.2     2018-03-15 CRAN (R 3.5.0)
 xml2           1.2.0     2018-01-24 CRAN (R 3.5.0)
 yaml           2.2.0     2018-07-25 CRAN (R 3.5.0)

Document the interpolation of genetic maps

Nick found this feature useful, and we should document it.

One question on the "theta shrinkage"

I'm referring to
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L76-L77
and
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L139

theta seems to be very small for all sample sizes, even small ones like 100; I'm therefore wondering if this has any effect at all?

Thanks,
Florian

Confirm that EVs are actually real numbers

During my recent whole genome simulations I found some tailing EVs of LD matrices are (numerically) complex numbers, e.g. 3.28+1e-14 i.

This seems a numerical issue since the magnitude of imaginary part is so close to zero.

However, Nick haven't noticed this pattern in his whole genome data analyses.

2018-06-14 meeting agenda

Below is a draft of meeting agenda.

Time: 2018-06-14, 10 am PST
Location: Skype
Attendees: Nick, Xiang

Review items

What is the current state of LDshrink development?
Can current LDshrink reliably produce results for one chromosome (i.e. Xiang's use case)?
Can current LDshrink reliably produce results for whole genome (i.e. Nick's use case)?
How far is the current LDshrink from an accepted Bioconductor package?

Action items

Freeze a set of features for the first public release.
Complete LDshrink package development so that it can be accepted by Bioconductor.
Draft a short manuscript on the LDshrink package (closely related to package vignettes).
Compare with related work: https://doi.org/10.1101/301366.
Split work by category (one for coding-related work, the other for writing-related work).
Set a clear time frame, to avoid Xiang's research "down time" when he has to teach.

Future items

Find public domains to share precomputed LD data for other projects (e.g. I vaguely remember that mashr revision used Nick's precomputed LD data on Midway).
Communicate with others (e.g. Matthew, Xin, Peter) once the package is near mature.