Coder Social home page Coder Social logo

awesome-statml's Introduction

GitHub

Awesome Reference in Statistics and Machine Learning

This repository contains a curated list of awesome references for statistics and machine learning.

Tags

⛳ Methodology (METH) 📘 Learning Theory (LT) 🎯 Optimization (OPT)
🔎 Statistical Inference (INF) 💻 Software (SW) 🔓 Explainable AI (XAI)
🍒 Biostatistics (BIO) ⌨️ Empirical Studies (ES) 🌐 Deep Learning (DL)
📊 Dataset (DATA) ➡️ Causal Inference (CI) 🗒️ Natural Language Learning (NLP)

Reference

[METH] Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604), 309-368.

[METH][OPT] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

  • keywords: random forest, assemble methods, bias-variance trade-off

[METH][LT] Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.

  • keywords: Bayes rule, Fisher consistency, convex optimization; empirical process theory; excess risk bounds

[OPT] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1), 1-122.

  • keywords: Convex Optimization, Proximity, Smooth objective function

[SW] How to upload your python package to PyPi.

  • keywords: Python package/library, Pypi, twine

[SW] Basic Tutorial for Cython

  • keywords: Python package/library, Cython, C/C++
  • memo: Cython is Python with C data types, to speed up the Python loops.

[METH][LT][INF] Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2016). Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522.

  • keywords: Kernel method, RKHS, MMD
  • memo: Overview of kernel methods, properties of RKHS and kernel-based MMD.

[LT] Garnham, A. L., & Prendergast, L. A. (2013). A note on least squares sensitivity in single-index model estimation and the benefits of response transformations. Electronic Journal of Statistics, 7, 1983-2004.

  • keywords: sliced inverse regression (SIR), sufficient dimension reduction (SDR), OLS, single-index model
  • memo: in a single-index model, when Cov(X,Y) is nonzero, OLS is able to recover a minimal sufficient reduction space, yet it fails when Cov(X,Y) = 0.

[ES][DL] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

  • keywords: deep learning, generalization, random labels
  • memo: deep neural networks easily fit random labels. Figure 1: training errors for true labels, random labels, shuffled pixels, random pixels, are all converge to zeros. Yet the testing error would affect by the label corruption.

[BIO][METH][OPT][SW] Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., & Sham, P. C. (2017). Polygenic scores via penalized regression on memo statistics. Genetic epidemiology, 41(6), 469-480. R package: LassoSum

  • keywords: memo statistics, sparse regression, invalid IVs, lasso, elastic net
  • memo: solve LASSO and elastic net based on memo statistics: coordinate descent for Lasso (elastic net) only require memo data.

[METH][DATA][SW] Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M. (2016). The extreme classification repository: multi-label datasets & code.

  • keywords: extreme classification, multi-label classification
  • memo: The objective in extreme multi-label classification is to learn feature architectures and classifiers that can automatically tag a data point with the most relevant subset of labels from an extremely large label set. This repository provides resources that can be used for evaluating the performance of extreme multi-label algorithms including datasets, code, and metrics.

[METH][DATA] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191-198).

  • keywords: recommender systems, extreme classification, ranking, candidate set
  • memo: A two-stage recommender system: first detail a deep candidate generation model and then describe a separate deep ranking model.

[OPT][INF] Stegle, O., Lippert, C., Mooij, J. M., Larence, N. D., & Borgwardt, K. (2011). Efficient inference in matrix-variate Gaussian models with iid observation noise. In Proceedings of the Advances in Neural Information Processing Systems 24 (NIPS 2011).

  • keywords: inverse, inference, matrix-variate Gaussian models
  • memo: In equation (5), it could effectively compute the inverse of a diagonal matrix plus a Kronecker product.

[LT][OPT] Andersen Ang, Slides: Nuclear norm is the tightest convex envelop of rank function within the unit ball.

  • keywords: nuclear norm, rank, convex envelop
  • memo: Find/prove nuclear norm is the tightest convex envelop of rank. The same argument can be used for other nonconvex and discontinuous regularization.

[OPT][SW] Ge, J., Li, X., Jiang, H., Liu, H., Zhang, T., Wang, M., & Zhao, T. (2019). Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python. J. Mach. Learn. Res., 20(44), 1-5. [ Github + Docs ]

  • keywords: sparse regression, scad, MCP
  • memo: A Python/R library for sparse regression, including Lasso, SCAD, and MCP.

[OPT] Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509.

  • keywords: SCAD, local linear approximation (LLA)
  • memo: Solve the SCAD by repeatedly solving Lasso in (2.7).

[BIO][METH][CI] Windmeijer, F., Farbmacher, H., Davies, N., & Davey Smith, G. (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339-1350.

  • keywords: adaptive lasso, 2SLS, causal inference, invalid IV
  • memo: Use adaptive lasso to select invalid IVs in 2SLS.

[CI][METH] Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2018). How to make causal inferences using texts. arXiv preprint arXiv:1802.02163.

  • keywords: text, causal inference
  • memo: Causal inference based on textual data, text could be treatment or outcome.

[METH][LT] Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. Advances in neural information processing systems, 26, 1196-1204.

  • keywords: noisy labels, unbalanced-loss
  • memo: Model the noisy labels by class-conditional random noise model (CCN). Based on CCN, the authors find that the minimizer of classification with noisy labels is drifted Bayes rule: which coincides with the Bayes rule of unbalanced loss.

[METH][DL] Jeremy Jordan, 2018. An overview of semantic image segmentation.

  • keywords: image segmentation, Dice loss
  • memo: A introduction for image segmentation, including background, existing methods and loss functions.

[ES][DATA][BIO] Shit, S., Paetzold, J. C., Sekuboyina, A., Ezhov, I., Unger, A., Zhylka, A., ... & Menze, B. H. (2021). clDice-a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16560-16569).

  • keywords: image segmentation, Dice loss, topology-preservation
  • memo: A novel Dice-based loss function for medical image segmentation. The motivation is topology-preservation and skeleta of vessels in medical image. Moreover, the trackable computing losses are proposed with an ad-hoc manner.

[ES][DL] Peng, H., Mou, L., Li, G., Chen, Y., Lu, Y., & Jin, Z. (2015). A comparative study on regularization strategies for embedding-based neural networks. arXiv preprint arXiv:1508.03721.

  • keywords: regularization, embedding
  • memo: A comparative empirical study (Experiment A and B) for different regularization in embedding-based neural networks, including (i) l2-reg for other layers (BOTH WORK); (ii) l2-reg for an embedding layer (A WORKS); (iii) re-embedding words: l2-reg in difference on an embedding layer and a pre-trained layer (NO WORKS); (iv) Dropout for other layers (BOTH WORK).

[CI][INF] Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., ... & Yang, D. (2021). Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. arXiv preprint arXiv:2109.00725.

  • keywords: causal inference, NLP, survey
  • memo: (i) Background of CI; (ii) Text as treatment, outcome, or confounder; (iii) CI -> ML prediction;

[METH][LT] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., & Smola, A. (2017). Deep sets. arXiv preprint arXiv:1703.06114.

  • keywords: permutation invariance; learning with set
  • memo: (i) permutation invariance iff the learning model can be express as a sum function;

[METH][LT] Cheng, J., Levina, E., Wang, P., & Zhu, J. (2014). A sparse Ising model with covariates. Biometrics, 70(4), 943-953.

  • keywords: Ising model; label dependence
  • memo: (i) extend the dependence in Ising model to be a function of features;

[LT][DL] Bartlett, P., Foster, D. J., & Telgarsky, M. (2017). Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.

  • keywords: covering number, Rademacher complexity, estimation error bound
  • memo: The estimation error bounds for neural networks based on covering number and Rademacher complexity

[LT][DL] Bauer, B., & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261-2285.

  • keywords: regret bound, estimation error, approximation error
  • memo: Both estimation error and approximation error (Theorems 2-3) are provided in the paper.

[LT][DL] Guo, Z. C., Shi, L., & Lin, S. B. (2019). Realizing data features by deep nets. IEEE Transactions on Neural Networks and Learning Systems, 31(10), 4036-4048.

  • keywords: covering number, estimation error bound
  • memo: VC-type covering number for neural networks

[LT][METH] Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, 2287-2322.

  • keywords: soft-impute, low-rank, nuclear norm
  • memo: (i) Solving low-rank regression by soft-thresholded SVD. (ii) Relation between low-rank regression and latent factor model or matrix factorization in Section 8 (Theorem 3) is quite interesting.

[LT][METH][INF] Alex Stephenson, Standard Errors and the Delta Method.

  • keywords: delta method, asymptotic distribution, standard error
  • memo: Asymptotic distribution of functions over a random variable; when the asymptotic behavior of this random variable is obtained.

[INF] Aaron Mishkin, Instrumental Variables, DeepIV, and Forbidden Regressions

  • keywords: instrumental Variables, causal inference

[INF] Vovk, V., & Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791-808.

  • keywords: Combining p-values;

[OPT] Iusem, A. N. (2003). On the convergence properties of the projected gradient method for convex optimization. Computational & Applied Mathematics, 22, 37-52.

  • keywords: projected gradient method, convex optimization
  • memo: Proposition 4 shows that projected GD convergence to stationary point (if the cluster point exists) when objective function is continuously differentiable, and the feasible domain is convex.

[METH][OPT] Some tensor algebra

  • keywords: Mode-product, tensor-matrix multiplication, tensor-vector multiplication

[OPT] Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., & Wright, S. J. (2011). Interior-point methods for large-scale cone programming. Optimization for machine learning, 5583.

  • keywords: cone programming, QP, interior-point methods

[OPT] Tibshirani, R. J., Coordinate Descent.

  • keywords: Coordinate Descent
  • memo: The CD algorithm and its convergence rate under different assumptions.

[METH] Jansche, M. (2007, June). A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 736-743).

  • keywords: F-score, label-dependence

[LT] Pillai, I., Fumera, G., & Roli, F. (2017). Designing multi-label classifiers that maximize F measures: State of the art. Pattern Recognition, 61, 394-404.

  • keywords: F-score, decision-theoretic approach
  • memo: A recent survey for F-score maximization

[OPT] Stephen Boyd and Jon Dattorro, Alternating Projections.

  • keywords: alternating projection,
  • memo: AP is an algorithm computing a point in the intersection of some convex sets (or smallest distance of two sets).

[INF][METH] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(1), 723-773.

  • keywords: MMD, kernel methods, two-sample test, integral probability metric, hypothesis testing
  • memo: non-parametric distribution discrepancy test based on MMD.

[METH][DL] Sundeep, T. Knowledge Distillation: Principles, Algorithms, Applications. neptuneblog.

  • keywords: knowledge distillation
  • memo: introduction for knowledge distillation: online/offline knowledge distillation; existing methods

[OPT] Powell, Michael JD. "On search directions for minimization algorithms." Mathematical programming 4, no. 1 (1973): 193-201.

  • keywords: BCD, block coordinate descent
  • memo: When F is nonconvex, BCD may cycle and stagnate

[METH][OPT] Zheng, X., Aragam, B., Ravikumar, P. K., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning Advances in Neural Information Processing Systems, 31.

  • keywords: DAG, matrix power
  • memo: convert DAG constrains as one matrix power equality constraint.

[OPT] Shalev-Shwartz, S., & Zhang, T. (2012). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research.

  • keywords: stochastic dual coordinate ascent, liblinear, coordinate descent
  • memo: simultaneously update primal/dual variables; generalize the algorithm from liblinear

[OPT] Glasmachers, T., & Dogan, U. (2013, October). Accelerated coordinate descent with adaptive coordinate frequencies. In Asian Conference on Machine Learning (pp. 72-86). PMLR.

  • keywords: coordinate descent
  • memo: Using Adaptive Coordinate Frequencies to update coordinates

[OPT] Zimmert, J., de Witt, C. S., Kerg, G., & Kloft, M. (2015, December). Safe screening for support vector machines. In NIPS 2015 Workshop on Optimization in Machine Learning (OPT).

  • keywords: screening, shrinking, coordinate descent
  • memo: screening the shrinking variables for boxed QP in coordinate desecent: when the gradient is non-zeros then the variable is in the boundary.

[OPT][LT] Raginsky, M., Rakhlin, A., & Telgarsky, M. (2017, June). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory (pp. 1674-1703). PMLR.

  • keywords: SGD, global solution, langevin process, SDE

[CI][METH][OPT][LT] Kang, H., Zhang, A., Cai, T. T., & Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American statistical Association, 111(513), 132-144.

  • keywords: 2SLS, invalid IV, sparse regression
  • memo: introduce sparse regression in the second stage of 2SLS in separate the effect from invalid IVs

[METH][LT] Dalalyan, A. S., & Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20 (pp. 97-111). Springer Berlin Heidelberg.

  • keywords: model aggregate, bagging
  • memo: Theorem 2 indicates model aggregate is close to the best performance based on model selection

[LT] Chen, J. (2017). Consistency of the MLE under mixture models.

[LT] Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367-1383.

  • keywords: Nonparametric MLE, identfiability, penalized MLE
  • memo: Conditions for consistency of Nonparametric MLE under Mixture Model. (i) Identifiability of Mixture Model is a necessary condition. (ii) Most existing general approaches do NOT apply to normal mixture models. (A) Theorem 3.1: Under the finite normal mixture model with equal variance and #Group is known, the MLE is strongly consistent. (B) Section 3.2: Under the finite normal mixture model with unequal variance and #Group is known, the MLE may NOT consistent. Yet, this issue can be solved via penalzied (on variance) MLE. (C) Section 3.3: The proper estimation of the mixing distribution under a finite mixture model requires a very large sample size when the subpopulations are not well separated.

[LT] Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE transactions on information theory, 51(1), 128-142.

  • keywords: universal consistency
  • memo: Proposition 3.3: (Classification) Fisher consistency and classification calibration are equivalent.

awesome-statml's People

Contributors

statmlben avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.