Coder Social home page Coder Social logo

philboileau / cvcovest Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 5.0 5 MB

An R package for assumption-lean covariance matrix estimation in high dimensions

Home Page: https://philboileau.github.io/cvCovEst/

License: Other

R 92.47% TeX 7.36% Makefile 0.18%
covariance-matrix-estimation cross-validation high-dimensional-statistics nonparametric-statistics

cvcovest's Introduction

R/cvCovEst

CircleCI codecov Project Status: Active – The project has reached a stable, usable state and is being actively developed. DOI MIT license

Cross-Validated Covariance Matrix Estimation

Authors: Philippe Boileau, Brian Collica, and Nima Hejazi


What’s cvCovEst?

cvCovEst implements an efficient cross-validated procedure for covariance matrix estimation, particularly useful in high-dimensional settings. The general methodology allows for cross-validation to be used to data adaptively identify the optimal estimator of the covariance matrix from a prespecified set of candidate estimators. An overview of the framework is provided in the package vignette. For a more detailed description, see Boileau et al. (2021). A suite of plotting and diagnostic tools are also included.


Installation

For standard use, install cvCovEst from CRAN:

install.packages("cvCovEst")

The development version of the package may be installed from GitHub using remotes:

remotes::install_github("PhilBoileau/cvCovEst")

Example

To illustrate how cvCovEst may be used to select an optimal covariance matrix estimator via cross-validation, consider the following toy example:

library(MASS)
library(cvCovEst)
set.seed(1584)

# generate a 50x50 covariance matrix with unit variances and off-diagonal
# elements equal to 0.5
Sigma <- matrix(0.5, nrow = 50, ncol = 50) + diag(0.5, nrow = 50)

# sample 50 observations from multivariate normal with mean = 0, var = Sigma
dat <- mvrnorm(n = 50, mu = rep(0, 50), Sigma = Sigma)

# run CV-selector
cv_cov_est_out <- cvCovEst(
    dat = dat,
    estimators = c(linearShrinkLWEst, denseLinearShrinkEst,
                   thresholdingEst, poetEst, sampleCovEst),
    estimator_params = list(
      thresholdingEst = list(gamma = c(0.2, 2)),
      poetEst = list(lambda = c(0.1, 0.2), k = c(1L, 2L))
    ),
    cv_loss = cvMatrixFrobeniusLoss,
    cv_scheme = "v_fold",
    v_folds = 5
  )

# print the table of risk estimates
# NOTE: the estimated covariance matrix is accessible via the `$estimate` slot
cv_cov_est_out$risk_df
#> # A tibble: 9 × 3
#>   estimator            hyperparameters      cv_risk
#>   <chr>                <chr>                  <dbl>
#> 1 linearShrinkLWEst    hyperparameters = NA    357.
#> 2 poetEst              lambda = 0.2, k = 1     369.
#> 3 poetEst              lambda = 0.2, k = 2     372.
#> 4 poetEst              lambda = 0.1, k = 2     375.
#> 5 poetEst              lambda = 0.1, k = 1     376.
#> 6 denseLinearShrinkEst hyperparameters = NA    379.
#> 7 sampleCovEst         hyperparameters = NA    379.
#> 8 thresholdingEst      gamma = 0.2             384.
#> 9 thresholdingEst      gamma = 2               826.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.


Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.


Citation

Please cite the following paper when using the cvCovEst R software package.

@article{cvCovEst2021,
  doi = {10.21105/joss.03273},
  url = {https://doi.org/10.21105/joss.03273},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {63},
  pages = {3273},
  author = {Philippe Boileau and Nima S. Hejazi and Brian Collica and Mark J. van der Laan and Sandrine Dudoit},
  title = {cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in `R`},
  journal = {Journal of Open Source Software}
}

When describing or discussing the theory underlying the cvCovEst method, or simply using the method, please cite the pre-print below.

@article{boileau2022,
    author = {Philippe Boileau and Nima S. Hejazi and Mark J. van der Laan and Sandrine Dudoit},
    doi = {10.1080/10618600.2022.2110883},
    eprint = {https://doi.org/10.1080/10618600.2022.2110883},
    journal = {Journal of Computational and Graphical Statistics},
    number = {ja},
    pages = {1-28},
    publisher = {Taylor & Francis},
    title = {Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions},
    url = {https://doi.org/10.1080/10618600.2022.2110883},
    volume = {0},
    year = {2022},
    bdsk-url-1 = {https://doi.org/10.1080/10618600.2022.2110883}}

License

© 2020-2023 Philippe Boileau

The contents of this repository are distributed under the MIT license. See file LICENSE.md for details.


References

Boileau, Philippe, Nima S. Hejazi, Mark J. van der Laan, and Sandrine Dudoit. 2021. “Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions.” https://arxiv.org/abs/2102.09715.

cvcovest's People

Contributors

bcollica avatar hadley avatar jamarcusliu avatar nhejazi avatar philboileau avatar privefl avatar sandrinedudoit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cvcovest's Issues

Sparse Matrix Handling

Given that this package will be used with high-dimensional data, cvCovEst's functions should play nicely with the sparse matrix classes implemented in the Matrix package.

Add tests on the values return by the function

Hello,

Their are already many tests available to check your package (nice ;)) but I feel most of them are checking the arguments, and that the functions run without leaving messages or errors but can you add test to check if the value return by the function are the one that are expected ? For instance, for a given seed can you manually check that you find the same empirical_risk than in your function for the different estimators ?

Best,

Target audience

The target audience of this package is not clearly defined. Can you add it at the beginning of the readme please?

Thanks,

Update data centering approach

As pointed out in #68, some estimators do not center data internally prior to estimating the covariance matrix. This isn't an issue when using cvCovEst() since it centers the data internally. It prevents the use of the packages estimator functions from being used as standalone estimators, however. The data should be centered by each estimator instead, even if it is slightly less computationally efficient.

Loss functions

Possible loss functions for a covariance matrix Sigma, based on observing an iid sample of a vector X,

  • Squared error losses:
    • L(Sigma)(X)=sum_{kl}(X_{kl}-mu_{kl})^2 f(k,l) with weight function f(k,l), where mu(k,l)=EX_kX_l. In particular, f(k,l) = 1/p provides a theoretically sound loss when the true covariance matrix is assumed to be sparse, and can be easily decomposed into a useable empirical loss. When f(k,l) = 1/p^2, a similarly theoretically sound empirical loss function can be derived, though it is best suited for situations where the true covariance is not assumed to be sparse, such as in random effect models / latent variable models.
    • L(Sigma)(X)= (x_{kl}-mu(k,l):k,l)^t (rho)^{-1} (x_{kl}-mu(k,l):k,l), where rho would represent an estimator of covariance matrix of vector (x_{kl}: k,l)
  • Operator norm: An operator norm loss will be ideal for when the estimand is not the covariance matrix, but the leading eigenvalues and eigenvectors of the true covariance matrix.

`cvCovEst` is very slow for large datasets

cvCovEst is quite slow for relatively large datasets. This is likely due to the observations-based Frobenius loss function that is currently employed. It requires that, for each fold, the cross product of every element in the validation set be computed and summed. This is crippling for large n and p. Based on Proposition 1 of the manuscript, however, we know that the estimator selection procedure based on this loss is equivalent to that of the matrix-based Frobenius loss. The matrix-based Frobenius loss does not require that the cross product of every element in the validation set be computed, only the validation set's sample covariance matrix.

We should implement this loss as a fold function to pass to origami::cross_validate. We might also consider setting it as the default loss function, recommending in the manuscript that it be used in practice for computational reasons.

TODO

  • Setup repository README.Rmd file
    • Draft description (for DESCRIPTION file also)
    • Installation instructions
    • Toy example
    • Issues sections
    • Contributions
    • License
  • Identify classes of matrix objects we want to work with
  • Create loss functions
  • Identify estimators we want to consider. Determine if they are already implemented

Add other package reference in the paper

Can you add in the paper a few words on other packages that already exists to compare covariance matrix estimation ?
For instance : CovTools provides different ways of estimating covariance matrix and tests to compare them to a given matrix or evolqg which also provides many estimations techniques for covariance matrices.

This two packages are not doing exactly the same thing that you are doing but I think your paper lack a bit of comparison to existing package.

Superimpose eigenvalue comparison plots

The leading eigenvalues of the selected estimator's estimate and those of the sample covariance matrix should be superimposed for comparison purposes, not plotted in two side-by-side figures.

Report summary tables & Add non-Gaussian examples

  • The plot.cvCovEst function reports the eigenvalues plot, risk plot, heatmap plot and the summary table with the best performing estimator within each class. The first three plots can be exported separately. Is it possible to output the summary table (in latex format, or other formats) separately? I believe it would be useful in presenting the analysis results. You could use "xtable" or other packages.
  • It would be more convincing to add a non-Gaussian example in the reports, such as Gamma distributions.
  • A typo: the argument in plot.cvCovEst for data is "dat_orig". However, in the Toy Dataset Example in your paper, it is "data_in".

Overparallelization from `coop::covar`

In testing cvCovEst, there seems to be some sort of parallelization happening at the level of calls to C (through coop::covar), which appears to use all cores on my local machines even for a single call to cvCovEst (even when parallel = FALSE). This is not the expected behavior. Unfortunately, we currently use coop::covar throughout all estimation functions; a PR will resolve this shortly, but we should keep track of the issue and try to eventually restore the intended behavior.

I've only tested this on linux machines (a local Ubuntu 20.04 and an HPC system). Interestingly, @PhilBoileau is unable to replicate this behavior on macOS, suggesting that there's some parallelization issue due to differing BLAS libraries. This could be related to wrathematics/coop#13 but will open up a new issue there for tracking. So far, I've tried (separately and in tandem) the two following solutions

library(openblasctl)
openblas_set_num_threads(1)
Sys.setenv(OMP_NUM_THREADS = "1")

Nonlinear shrinkage estimator breaks in high n and p

In larger sample sizes, some estimators appear to face some degree of instability, e.g., in a simple 400 x 400 dataset, the nonlinear shrinkage estimator fails with

r$> nlShrinkLWEst(data_in)
Error in u %*% diag(d_tilde) : non-conformable arguments

This is not such a big problem but points out a design safety issue: when this estimator is included in the selector library, the whole cross-validated selector fails. Instead, we should add in some safety guards so that a single (or even multiple) failing estimators do not bring the whole selector down, that is, the CV selection procedure should return the optimal choice among estimators that can be fit.

Implementing new methods

Request for Shrinkage Estimators for Spiked Covariance Models

There has been a request to implement some of the shrinkage estimators described by Donoho, Gavish, and Johnstone. The paper presents analytical formulas for 18 different "optimal" estimators depending on the loss function. The request was for the estimators based on Stein and Operator loss, but it should be fairly straight forward to create a single wrapper estimator which can house one or more of these shrinkage functions with the Frobenius norm being the default so it can fit within the current CV framework.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.