Awesome Reference in Statistics and Machine Learning

This repository contains a curated list of awesome references for statistics and machine learning.


⛳ Methodology (METH)	📘 Learning Theory (LT)	🎯 Optimization (OPT)
🔎 Statistical Inference (INF)	💻 Software (SW)	🔓 Explainable AI (XAI)
🍒 Biostatistics (BIO)	⌨️ Empirical Studies (ES)	🌐 Deep Learning (DL)
📊 Dataset (DATA)	➡️ Causal Inference (CI)	🗒️ Natural Language Learning (NLP)

Reference

[METH] Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604), 309-368.

[METH][OPT] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

keywords: random forest, assemble methods, bias-variance trade-off

[METH][LT] Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.

keywords: Bayes rule, Fisher consistency, convex optimization; empirical process theory; excess risk bounds

[OPT] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1), 1-122.

keywords: Convex Optimization, Proximity, Smooth objective function

[SW] How to upload your python package to PyPi.

keywords: Python package/library, Pypi, twine

[SW] Basic Tutorial for Cython

keywords: Python package/library, Cython, C/C++
memo: Cython is Python with C data types, to speed up the Python loops.

[METH][LT][INF] Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2016). Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522.

keywords: Kernel method, RKHS, MMD
memo: Overview of kernel methods, properties of RKHS and kernel-based MMD.

[LT] Garnham, A. L., & Prendergast, L. A. (2013). A note on least squares sensitivity in single-index model estimation and the benefits of response transformations. Electronic Journal of Statistics, 7, 1983-2004.

keywords: sliced inverse regression (SIR), sufficient dimension reduction (SDR), OLS, single-index model
memo: in a single-index model, when Cov(X,Y) is nonzero, OLS is able to recover a minimal sufficient reduction space, yet it fails when Cov(X,Y) = 0.

[ES][DL] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

keywords: deep learning, generalization, random labels
memo: deep neural networks easily fit random labels. Figure 1: training errors for true labels, random labels, shuffled pixels, random pixels, are all converge to zeros. Yet the testing error would affect by the label corruption.

[BIO][METH][OPT][SW] Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., & Sham, P. C. (2017). Polygenic scores via penalized regression on memo statistics. Genetic epidemiology, 41(6), 469-480. R package: LassoSum

keywords: memo statistics, sparse regression, invalid IVs, lasso, elastic net
memo: solve LASSO and elastic net based on memo statistics: coordinate descent for Lasso (elastic net) only require memo data.

[METH][DATA][SW] Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M. (2016). The extreme classification repository: multi-label datasets & code.

keywords: extreme classification, multi-label classification
memo: The objective in extreme multi-label classification is to learn feature architectures and classifiers that can automatically tag a data point with the most relevant subset of labels from an extremely large label set. This repository provides resources that can be used for evaluating the performance of extreme multi-label algorithms including datasets, code, and metrics.

[METH][DATA] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191-198).

keywords: recommender systems, extreme classification, ranking, candidate set
memo: A two-stage recommender system: first detail a deep candidate generation model and then describe a separate deep ranking model.

[OPT][INF] Stegle, O., Lippert, C., Mooij, J. M., Larence, N. D., & Borgwardt, K. (2011). Efficient inference in matrix-variate Gaussian models with iid observation noise. In Proceedings of the Advances in Neural Information Processing Systems 24 (NIPS 2011).

keywords: inverse, inference, matrix-variate Gaussian models
memo: In equation (5), it could effectively compute the inverse of a diagonal matrix plus a Kronecker product.

[LT][OPT] Andersen Ang, Slides: Nuclear norm is the tightest convex envelop of rank function within the unit ball.

keywords: nuclear norm, rank, convex envelop
memo: Find/prove nuclear norm is the tightest convex envelop of rank. The same argument can be used for other nonconvex and discontinuous regularization.

[OPT][SW] Ge, J., Li, X., Jiang, H., Liu, H., Zhang, T., Wang, M., & Zhao, T. (2019). Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python. J. Mach. Learn. Res., 20(44), 1-5. [ Github + Docs ]

keywords: sparse regression, scad, MCP
memo: A Python/R library for sparse regression, including Lasso, SCAD, and MCP.

[OPT] Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509.

keywords: SCAD, local linear approximation (LLA)
memo: Solve the SCAD by repeatedly solving Lasso in (2.7).

[BIO][METH][CI] Windmeijer, F., Farbmacher, H., Davies, N., & Davey Smith, G. (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339-1350.

keywords: adaptive lasso, 2SLS, causal inference, invalid IV
memo: Use adaptive lasso to select invalid IVs in 2SLS.

[CI][METH] Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2018). How to make causal inferences using texts. arXiv preprint arXiv:1802.02163.

keywords: text, causal inference
memo: Causal inference based on textual data, text could be treatment or outcome.

[METH][LT] Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. Advances in neural information processing systems, 26, 1196-1204.

keywords: noisy labels, unbalanced-loss
memo: Model the noisy labels by class-conditional random noise model (CCN). Based on CCN, the authors find that the minimizer of classification with noisy labels is drifted Bayes rule: which coincides with the Bayes rule of unbalanced loss.

[METH][DL] Jeremy Jordan, 2018. An overview of semantic image segmentation.

keywords: image segmentation, Dice loss
memo: A introduction for image segmentation, including background, existing methods and loss functions.

[ES][DATA][BIO] Shit, S., Paetzold, J. C., Sekuboyina, A., Ezhov, I., Unger, A., Zhylka, A., ... & Menze, B. H. (2021). clDice-a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16560-16569).

keywords: image segmentation, Dice loss, topology-preservation
memo: A novel Dice-based loss function for medical image segmentation. The motivation is topology-preservation and skeleta of vessels in medical image. Moreover, the trackable computing losses are proposed with an ad-hoc manner.

[ES][DL] Peng, H., Mou, L., Li, G., Chen, Y., Lu, Y., & Jin, Z. (2015). A comparative study on regularization strategies for embedding-based neural networks. arXiv preprint arXiv:1508.03721.

keywords: regularization, embedding
memo: A comparative empirical study (Experiment A and B) for different regularization in embedding-based neural networks, including (i) l2-reg for other layers (BOTH WORK); (ii) l2-reg for an embedding layer (A WORKS); (iii) re-embedding words: l2-reg in difference on an embedding layer and a pre-trained layer (NO WORKS); (iv) Dropout for other layers (BOTH WORK).

[CI][INF] Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., ... & Yang, D. (2021). Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. arXiv preprint arXiv:2109.00725.

keywords: causal inference, NLP, survey
memo: (i) Background of CI; (ii) Text as treatment, outcome, or confounder; (iii) CI -> ML prediction;

[METH][LT] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., & Smola, A. (2017). Deep sets. arXiv preprint arXiv:1703.06114.

keywords: permutation invariance; learning with set
memo: (i) permutation invariance iff the learning model can be express as a sum function;

[METH][LT] Cheng, J., Levina, E., Wang, P., & Zhu, J. (2014). A sparse Ising model with covariates. Biometrics, 70(4), 943-953.

keywords: Ising model; label dependence
memo: (i) extend the dependence in Ising model to be a function of features;

[LT][DL] Bartlett, P., Foster, D. J., & Telgarsky, M. (2017). Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.

keywords: covering number, Rademacher complexity, estimation error bound
memo: The estimation error bounds for neural networks based on covering number and Rademacher complexity

[LT][DL] Bauer, B., & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261-2285.

keywords: regret bound, estimation error, approximation error
memo: Both estimation error and approximation error (Theorems 2-3) are provided in the paper.

[LT][DL] Guo, Z. C., Shi, L., & Lin, S. B. (2019). Realizing data features by deep nets. IEEE Transactions on Neural Networks and Learning Systems, 31(10), 4036-4048.

keywords: covering number, estimation error bound
memo: VC-type covering number for neural networks

[LT][METH] Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, 2287-2322.

keywords: soft-impute, low-rank, nuclear norm
memo: (i) Solving low-rank regression by soft-thresholded SVD. (ii) Relation between low-rank regression and latent factor model or matrix factorization in Section 8 (Theorem 3) is quite interesting.

[LT][METH][INF] Alex Stephenson, Standard Errors and the Delta Method.

keywords: delta method, asymptotic distribution, standard error
memo: Asymptotic distribution of functions over a random variable; when the asymptotic behavior of this random variable is obtained.

[INF] Aaron Mishkin, Instrumental Variables, DeepIV, and Forbidden Regressions

keywords: instrumental Variables, causal inference

[INF] Vovk, V., & Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791-808.

keywords: Combining p-values;

[OPT] Iusem, A. N. (2003). On the convergence properties of the projected gradient method for convex optimization. Computational & Applied Mathematics, 22, 37-52.

keywords: projected gradient method, convex optimization
memo: Proposition 4 shows that projected GD convergence to stationary point (if the cluster point exists) when objective function is continuously differentiable, and the feasible domain is convex.

[METH][OPT] Some tensor algebra

keywords: Mode-product, tensor-matrix multiplication, tensor-vector multiplication

[OPT] Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., & Wright, S. J. (2011). Interior-point methods for large-scale cone programming. Optimization for machine learning, 5583.

keywords: cone programming, QP, interior-point methods

[OPT] Tibshirani, R. J., Coordinate Descent.

keywords: Coordinate Descent
memo: The CD algorithm and its convergence rate under different assumptions.

[METH] Jansche, M. (2007, June). A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 736-743).

keywords: F-score, label-dependence

[LT] Pillai, I., Fumera, G., & Roli, F. (2017). Designing multi-label classifiers that maximize F measures: State of the art. Pattern Recognition, 61, 394-404.

keywords: F-score, decision-theoretic approach
memo: A recent survey for F-score maximization

[OPT] Stephen Boyd and Jon Dattorro, Alternating Projections.

keywords: alternating projection,
memo: AP is an algorithm computing a point in the intersection of some convex sets (or smallest distance of two sets).

[INF][METH] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(1), 723-773.

keywords: MMD, kernel methods, two-sample test, integral probability metric, hypothesis testing
memo: non-parametric distribution discrepancy test based on MMD.

[METH][DL] Sundeep, T. Knowledge Distillation: Principles, Algorithms, Applications. neptuneblog.

keywords: knowledge distillation
memo: introduction for knowledge distillation: online/offline knowledge distillation; existing methods

[OPT] Powell, Michael JD. "On search directions for minimization algorithms." Mathematical programming 4, no. 1 (1973): 193-201.

keywords: BCD, block coordinate descent
memo: When F is nonconvex, BCD may cycle and stagnate

[METH][OPT] Zheng, X., Aragam, B., Ravikumar, P. K., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning Advances in Neural Information Processing Systems, 31.

keywords: DAG, matrix power
memo: convert DAG constrains as one matrix power equality constraint.

[OPT] Shalev-Shwartz, S., & Zhang, T. (2012). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research.

keywords: stochastic dual coordinate ascent, liblinear, coordinate descent
memo: simultaneously update primal/dual variables; generalize the algorithm from liblinear

[OPT] Glasmachers, T., & Dogan, U. (2013, October). Accelerated coordinate descent with adaptive coordinate frequencies. In Asian Conference on Machine Learning (pp. 72-86). PMLR.

keywords: coordinate descent
memo: Using Adaptive Coordinate Frequencies to update coordinates

[OPT] Zimmert, J., de Witt, C. S., Kerg, G., & Kloft, M. (2015, December). Safe screening for support vector machines. In NIPS 2015 Workshop on Optimization in Machine Learning (OPT).

keywords: screening, shrinking, coordinate descent
memo: screening the shrinking variables for boxed QP in coordinate desecent: when the gradient is non-zeros then the variable is in the boundary.

[OPT][LT] Raginsky, M., Rakhlin, A., & Telgarsky, M. (2017, June). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory (pp. 1674-1703). PMLR.

keywords: SGD, global solution, langevin process, SDE

[CI][METH][OPT][LT] Kang, H., Zhang, A., Cai, T. T., & Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American statistical Association, 111(513), 132-144.

keywords: 2SLS, invalid IV, sparse regression
memo: introduce sparse regression in the second stage of 2SLS in separate the effect from invalid IVs

[METH][LT] Dalalyan, A. S., & Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20 (pp. 97-111). Springer Berlin Heidelberg.

keywords: model aggregate, bagging
memo: Theorem 2 indicates model aggregate is close to the best performance based on model selection

[LT] Chen, J. (2017). Consistency of the MLE under mixture models.

[LT] Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367-1383.

keywords: Nonparametric MLE, identfiability, penalized MLE
memo: Conditions for consistency of Nonparametric MLE under Mixture Model. (i) Identifiability of Mixture Model is a necessary condition. (ii) Most existing general approaches do NOT apply to normal mixture models. (A) Theorem 3.1: Under the finite normal mixture model with equal variance and #Group is known, the MLE is strongly consistent. (B) Section 3.2: Under the finite normal mixture model with unequal variance and #Group is known, the MLE may NOT consistent. Yet, this issue can be solved via penalzied (on variance) MLE. (C) Section 3.3: The proper estimation of the mixing distribution under a finite mixture model requires a very large sample size when the subpopulations are not well separated.

[LT] Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE transactions on information theory, 51(1), 128-142.

keywords: universal consistency
memo: Proposition 3.3: (Classification) Fisher consistency and classification calibration are equivalent.

lidongrong / awesome-statml Goto Github PK

awesome-statml's Introduction

Awesome Reference in Statistics and Machine Learning

Tags

Reference

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent