Awesome Reference in Statistics and Machine Learning

This repository contains a curated list of awesome references for statistics and machine learning.


⛳ Methodology (METH)	📘 Learning Theory (LT)	🎯 Optimization (OPT)
🔎 Statistical Inference (INF)	💻 Software (SW)	🔓 Explainable AI (XAI)
🍒 Biostatistics (BIO)	⌨️ Empirical Studies (ES)	🌐 Deep Learning (DL)
📊 Dataset (DATA)	➡️ Causal Inference (CI)	🗒️ Natural Language Learning (NLP)

Reference

[METH] Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604), 309-368.

[METH][OPT] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

keywords: random forest, assemble methods, bias-variance trade-off

[METH][LT] Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.

keywords: Bayes rule, Fisher consistency, convex optimization; empirical process theory; excess risk bounds

[OPT] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1), 1-122.

keywords: Convex Optimization, Proximity, Smooth objective function

[SW] How to upload your python package to PyPi.

keywords: Python package/library, Pypi, twine

[SW] Basic Tutorial for Cython

keywords: Python package/library, Cython, C/C++
memo: Cython is Python with C data types, to speed up the Python loops.

[METH][LT][INF] Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2016). Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522.

keywords: Kernel method, RKHS, MMD
memo: Overview of kernel methods, properties of RKHS and kernel-based MMD.

[LT] Garnham, A. L., & Prendergast, L. A. (2013). A note on least squares sensitivity in single-index model estimation and the benefits of response transformations. Electronic Journal of Statistics, 7, 1983-2004.

keywords: sliced inverse regression (SIR), sufficient dimension reduction (SDR), OLS, single-index model
memo: in a single-index model, when Cov(X,Y) is nonzero, OLS is able to recover a minimal sufficient reduction space, yet it fails when Cov(X,Y) = 0.

[ES][DL] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

keywords: deep learning, generalization, random labels
memo: deep neural networks easily fit random labels. Figure 1: training errors for true labels, random labels, shuffled pixels, random pixels, are all converge to zeros. Yet the testing error would affect by the label corruption.

[BIO][METH][OPT][SW] Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., & Sham, P. C. (2017). Polygenic scores via penalized regression on memo statistics. Genetic epidemiology, 41(6), 469-480. R package: LassoSum

keywords: memo statistics, sparse regression, invalid IVs, lasso, elastic net
memo: solve LASSO and elastic net based on memo statistics: coordinate descent for Lasso (elastic net) only require memo data.

[METH][DATA][SW] Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M. (2016). The extreme classification repository: multi-label datasets & code.

keywords: extreme classification, multi-label classification
memo: The objective in extreme multi-label classification is to learn feature architectures and classifiers that can automatically tag a data point with the most relevant subset of labels from an extremely large label set. This repository provides resources that can be used for evaluating the performance of extreme multi-label algorithms including datasets, code, and metrics.

[METH][DATA] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191-198).

keywords: recommender systems, extreme classification, ranking, candidate set
memo: A two-stage recommender system: first detail a deep candidate generation model and then describe a separate deep ranking model.

[OPT][INF] Stegle, O., Lippert, C., Mooij, J. M., Larence, N. D., & Borgwardt, K. (2011). Efficient inference in matrix-variate Gaussian models with iid observation noise. In Proceedings of the Advances in Neural Information Processing Systems 24 (NIPS 2011).

keywords: inverse, inference, matrix-variate Gaussian models
memo: In equation (5), it could effectively compute the inverse of a diagonal matrix plus a Kronecker product.

[LT][OPT] Andersen Ang, Slides: Nuclear norm is the tightest convex envelop of rank function within the unit ball.

keywords: nuclear norm, rank, convex envelop
memo: Find/prove nuclear norm is the tightest convex envelop of rank. The same argument can be used for other nonconvex and discontinuous regularization.

[OPT][SW] Ge, J., Li, X., Jiang, H., Liu, H., Zhang, T., Wang, M., & Zhao, T. (2019). Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python. J. Mach. Learn. Res., 20(44), 1-5. [ Github + Docs ]

keywords: sparse regression, scad, MCP
memo: A Python/R library for sparse regression, including Lasso, SCAD, and MCP.

[OPT] Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36(4), 1509.

keywords: SCAD, local linear approximation (LLA)
memo: Solve the SCAD by repeatedly solving Lasso in (2.7).

[BIO][METH][CI] Windmeijer, F., Farbmacher, H., Davies, N., & Davey Smith, G. (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339-1350.

keywords: adaptive lasso, 2SLS, causal inference, invalid IV
memo: Use adaptive lasso to select invalid IVs in 2SLS.

[CI][METH] Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2018). How to make causal inferences using texts. arXiv preprint arXiv:1802.02163.

keywords: text, causal inference
memo: Causal inference based on textual data, text could be treatment or outcome.

[METH][LT] Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. Advances in neural information processing systems, 26, 1196-1204.

keywords: noisy labels, unbalanced-loss
memo: Model the noisy labels by class-conditional random noise model (CCN). Based on CCN, the authors find that the minimizer of classification with noisy labels is drifted Bayes rule: which coincides with the Bayes rule of unbalanced loss.

[METH][DL] Jeremy Jordan, 2018. An overview of semantic image segmentation.

keywords: image segmentation, Dice loss
memo: A introduction for image segmentation, including background, existing methods and loss functions.

[ES][DATA][BIO] Shit, S., Paetzold, J. C., Sekuboyina, A., Ezhov, I., Unger, A., Zhylka, A., ... & Menze, B. H. (2021). clDice-a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16560-16569).

keywords: image segmentation, Dice loss, topology-preservation
memo: A novel Dice-based loss function for medical image segmentation. The motivation is topology-preservation and skeleta of vessels in medical image. Moreover, the trackable computing losses are proposed with an ad-hoc manner.

[ES][DL] Peng, H., Mou, L., Li, G., Chen, Y., Lu, Y., & Jin, Z. (2015). A comparative study on regularization strategies for embedding-based neural networks. arXiv preprint arXiv:1508.03721.

keywords: regularization, embedding
memo: A comparative empirical study (Experiment A and B) for different regularization in embedding-based neural networks, including (i) l2-reg for other layers (BOTH WORK); (ii) l2-reg for an embedding layer (A WORKS); (iii) re-embedding words: l2-reg in difference on an embedding layer and a pre-trained layer (NO WORKS); (iv) Dropout for other layers (BOTH WORK).

[CI][INF] Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., ... & Yang, D. (2021). Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. arXiv preprint arXiv:2109.00725.

keywords: causal inference, NLP, survey
memo: (i) Background of CI; (ii) Text as treatment, outcome, or confounder; (iii) CI -> ML prediction;

[METH][LT] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., & Smola, A. (2017). Deep sets. arXiv preprint arXiv:1703.06114.

keywords: permutation invariance; learning with set
memo: (i) permutation invariance iff the learning model can be express as a sum function;

[METH][LT] Cheng, J., Levina, E., Wang, P., & Zhu, J. (2014). A sparse Ising model with covariates. Biometrics, 70(4), 943-953.

keywords: Ising model; label dependence
memo: (i) extend the dependence in Ising model to be a function of features;

[LT][DL] Bartlett, P., Foster, D. J., & Telgarsky, M. (2017). Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.

keywords: covering number, Rademacher complexity, estimation error bound
memo: The estimation error bounds for neural networks based on covering number and Rademacher complexity

[LT][DL] Bauer, B., & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261-2285.

keywords: regret bound, estimation error, approximation error
memo: Both estimation error and approximation error (Theorems 2-3) are provided in the paper.

[LT][DL] Guo, Z. C., Shi, L., & Lin, S. B. (2019). Realizing data features by deep nets. IEEE Transactions on Neural Networks and Learning Systems, 31(10), 4036-4048.

keywords: covering number, estimation error bound
memo: VC-type covering number for neural networks

[LT][METH] Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, 2287-2322.

keywords: soft-impute, low-rank, nuclear norm
memo: (i) Solving low-rank regression by soft-thresholded SVD. (ii) Relation between low-rank regression and latent factor model or matrix factorization in Section 8 (Theorem 3) is quite interesting.

[LT][METH][INF] Alex Stephenson, Standard Errors and the Delta Method.

keywords: delta method, asymptotic distribution, standard error
memo: Asymptotic distribution of functions over a random variable; when the asymptotic behavior of this random variable is obtained.

[INF] Aaron Mishkin, Instrumental Variables, DeepIV, and Forbidden Regressions

keywords: instrumental Variables, causal inference

[INF] Vovk, V., & Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791-808.

keywords: Combining p-values;

[OPT] Iusem, A. N. (2003). On the convergence properties of the projected gradient method for convex optimization. Computational & Applied Mathematics, 22, 37-52.

keywords: projected gradient method, convex optimization
memo: Proposition 4 shows that projected GD convergence to stationary point (if the cluster point exists) when objective function is continuously differentiable, and the feasible domain is convex.

[METH][OPT] Some tensor algebra

keywords: Mode-product, tensor-matrix multiplication, tensor-vector multiplication

[OPT] Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., & Wright, S. J. (2011). Interior-point methods for large-scale cone programming. Optimization for machine learning, 5583.

keywords: cone programming, QP, interior-point methods

[OPT] Tibshirani, R. J., Coordinate Descent.

keywords: Coordinate Descent
memo: The CD algorithm and its convergence rate under different assumptions.

[METH] Jansche, M. (2007, June). A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 736-743).

keywords: F-score, label-dependence

[LT] Pillai, I., Fumera, G., & Roli, F. (2017). Designing multi-label classifiers that maximize F measures: State of the art. Pattern Recognition, 61, 394-404.

keywords: F-score, decision-theoretic approach
memo: A recent survey for F-score maximization

[OPT] Stephen Boyd and Jon Dattorro, Alternating Projections.

keywords: alternating projection,
memo: AP is an algorithm computing a point in the intersection of some convex sets (or smallest distance of two sets).

[INF][METH] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(1), 723-773.

keywords: MMD, kernel methods, two-sample test, integral probability metric, hypothesis testing
memo: non-parametric distribution discrepancy test based on MMD.

[METH][DL] Sundeep, T. Knowledge Distillation: Principles, Algorithms, Applications. neptuneblog.

keywords: knowledge distillation
memo: introduction for knowledge distillation: online/offline knowledge distillation; existing methods

[OPT] Powell, Michael JD. "On search directions for minimization algorithms." Mathematical programming 4, no. 1 (1973): 193-201.

keywords: BCD, block coordinate descent
memo: When F is nonconvex, BCD may cycle and stagnate

[METH][OPT] Zheng, X., Aragam, B., Ravikumar, P. K., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning Advances in Neural Information Processing Systems, 31.

keywords: DAG, matrix power
memo: convert DAG constrains as one matrix power equality constraint.

[OPT] Shalev-Shwartz, S., & Zhang, T. (2012). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research.

keywords: stochastic dual coordinate ascent, liblinear, coordinate descent
memo: simultaneously update primal/dual variables; generalize the algorithm from liblinear

[OPT] Glasmachers, T., & Dogan, U. (2013, October). Accelerated coordinate descent with adaptive coordinate frequencies. In Asian Conference on Machine Learning (pp. 72-86). PMLR.

keywords: coordinate descent
memo: Using Adaptive Coordinate Frequencies to update coordinates

[OPT] Zimmert, J., de Witt, C. S., Kerg, G., & Kloft, M. (2015, December). Safe screening for support vector machines. In NIPS 2015 Workshop on Optimization in Machine Learning (OPT).

keywords: screening, shrinking, coordinate descent
memo: screening the shrinking variables for boxed QP in coordinate desecent: when the gradient is non-zeros then the variable is in the boundary.

[OPT][LT] Raginsky, M., Rakhlin, A., & Telgarsky, M. (2017, June). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory (pp. 1674-1703). PMLR.

keywords: SGD, global solution, langevin process, SDE

[CI][METH][OPT][LT] Kang, H., Zhang, A., Cai, T. T., & Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American statistical Association, 111(513), 132-144.

keywords: 2SLS, invalid IV, sparse regression
memo: introduce sparse regression in the second stage of 2SLS in separate the effect from invalid IVs

[METH][LT] Dalalyan, A. S., & Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20 (pp. 97-111). Springer Berlin Heidelberg.

keywords: model aggregate, bagging
memo: Theorem 2 indicates model aggregate is close to the best performance based on model selection

[LT] Chen, J. (2017). Consistency of the MLE under mixture models.

[LT] Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367-1383.

keywords: Nonparametric MLE, identfiability, penalized MLE
memo: Conditions for consistency of Nonparametric MLE under Mixture Model. (i) Identifiability of Mixture Model is a necessary condition. (ii) Most existing general approaches do NOT apply to normal mixture models. (A) Theorem 3.1: Under the finite normal mixture model with equal variance and #Group is known, the MLE is strongly consistent. (B) Section 3.2: Under the finite normal mixture model with unequal variance and #Group is known, the MLE may NOT consistent. Yet, this issue can be solved via penalzied (on variance) MLE. (C) Section 3.3: The proper estimation of the mixing distribution under a finite mixture model requires a very large sample size when the subpopulations are not well separated.

[LT] Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE transactions on information theory, 51(1), 128-142.

keywords: universal consistency
memo: Proposition 3.3: (Classification) Fisher consistency and classification calibration are equivalent.

lidongrong / awesome-statml Goto Github PK

awesome-statml's Introduction

Awesome Reference in Statistics and Machine Learning

Tags

Reference

awesome-statml's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent