unpingco / python-for-probability-statistics-and-machine-learning-2e Goto Github PK

View Code? Open in Web Editor NEW

333.0 13.0 114.0 16.41 MB

Second edition of Springer Book Python for Probability, Statistics, and Machine Learning

License: MIT License

Jupyter Notebook 99.96% Python 0.03% Dockerfile 0.01%

machine-learning springer-book-python probability-statistics python-modules

python-for-probability-statistics-and-machine-learning-2e's Introduction

Python-for-Probability-Statistics-and-Machine-Learning-2E

Second edition of Springer text Python for Probability, Statistics, and Machine Learning

This book, fully updated for Python version 3.6+, covers the key ideas that link probability, statistics, and machine learning illustrated using Python modules in these areas. All the figures and numerical results are reproducible using the Python codes provided. The author develops key intuitions in machine learning by working meaningful examples using multiple analytical methods and Python codes, thereby connecting theoretical concepts to concrete implementations. Detailed proofs for certain important results are also provided. Modern Python modules like Pandas, Sympy, Scikit-learn, Tensorflow, and Keras are applied to simulate and visualize important machine learning concepts like the bias/variance trade-off, cross-validation, and regularization. Many abstract mathematical ideas, such as convergence in probability theory, are developed and illustrated with numerical examples.

This updated edition now includes the Fisher Exact Test and the Mann-Whitney-Wilcoxon Test. A new section on survival analysis has been included as well as substantial development of Generalized Linear Models. The new deep learning section for image processing includes an in-depth discussion of gradient descent methods that underpin all deep learning algorithms. As with the prior edition, there are new and updated Programming Tips that the illustrate effective Python modules and methods for scientific programming and machine learning. There are 445 run-able code blocks with corresponding outputs that have been tested for accuracy. Over 158 graphical visualizations (almost all generated using Python) illustrate the concepts that are developed both in code and in mathematics. We also discuss and use key Python modules such as Numpy, Scikit-learn, Sympy, Scipy, Lifelines, CvxPy, Theano, Matplotlib, Pandas, Tensorflow, Statsmodels, and Keras.

This book is suitable for anyone with an undergraduate-level exposure to probability, statistics, or machine learning and with rudimentary knowledge of Python programming.

Conda setup instructions

If you are using conda, you can get started by cloning this repository and using the environment.yaml file as in the following:

conda env create -n pyPSML -f environment.yaml

and then activate the environment using the following,

conda activate pyPSML

Then, you can run jupyter notebook and navigate the Jupyter notebooks for the individual chapters. All of the notebooks are fully functional in this so-created environment. Note that there are embedded figures in the Jupyter notebooks that are meant to validate the outputs of the Matplotlib codes therein.

Docker setup instructions

If you are using docker, there is a Dockerfile included. After cloning this repository, you can build the image with the following,

docker build -t pypsml2e .

and then run it locally using,

docker container run -it -p 8888:8888 pypsml2e

Then, navigate to the output URL and you can explore the Jupyter notebooks for each chapter. Alternately, if you don't want to build your own image, you can do

docker run -p 8888:8888 unpingco/pypsml2e

to get the docker image from https://hub.docker.com/r/unpingco/pypsml2e. Note that this may not be as up-to-date as building it yourself from this repository, but should still work fine.

Your comments (including errata) are welcome in the Issues link above.

Good luck! I hope you find these materials helpful.

python-for-probability-statistics-and-machine-learning-2e's People

Contributors

Stargazers

Watchers

Forkers

cschmer lixiaopi1985 kaziahosunhabibripon yongduek sombrazorro mhdella mitchgao fakhraddin depo-hub reginacasta99 gaiyangjun xrick pravak114 bonchae de8ug rvalenzuelar kingmbc mutual-ai ponsford taiphillips embedxj yoonlee-lab benjaminschwessinger shivlondon arberx roman-cmyk amolsakhale nbhr saitzaw wildtype knu-fundamental-knowledges anhnguyendepocen nishimaomaoxiong keshava bharathreddy06 chetanmehra nmhlog jboverio gridl vabskrishna carlosgomezfandino machinelearningrepos texervn asukumari kpasha douglasflint quantumu pyzeon yanding steffejr christoph1811 animesh dronedefense vikramgururaj derakding taner45 harshita1804 paullcm tillmeineke kushagraagrawal ikmalsyafiq naps12 drkiettran fangyangjz nisarpro teiband parthosen susannaaz ttyo dkmahto ashishpatel26 ahlamyu techsoft29 arun-kaushik bamboo120 wisamreid maybelaterornot mtahir19 holbrohp wassimsuleiman charithsiu masedki eliekawerk projetsplusia hookttg yahcut monikeymonkey mrudulamadhavan licycommunication jamesfang8499 danielathk bellezhang618 netkindom ronandownes khinthandarkyaw98 keriheuer samjrx yqisme chevz151 escaperx

python-for-probability-statistics-and-machine-learning-2e's Issues

Conditional_Expectation_Projection.ipynb

Señor Unpingco, it's really hard for me to understand the formula below the sentence: "The conditional expectation is the minimum mean squared error (MMSE) solution to the following problem...", if it's of the form $ \int_{R} (x-h(Y))^2 f_X(x) dx $ or $ \int_{R} (x-h(Y))^2 f_{X|Y}(x|y) dx $, it's more clear. It would be very kind of you if you could elaborate more on the formula.

Third Edition of "Python for Probability, Statistics, and Machine Learning"

Sorry, I haven't found the github page for the 3rd edition of your above-mentioned book. So I'm making here a comment on a typo in the 3rd edition.
On page of 48 of the 3rd edition, "Lesbesgue theory" should read "Lebesgue theory".

Error trying to create environment using conda

When I try to create pyPSML environment in conda using environment.yaml I get a ResolvePackageNotFound:

conda env create -n pyPSML -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - h5py==2.9.0=py37h7918eee_0
  - gxx_impl_linux-64==7.3.0=hdf63c60_1
  - c-ares==1.15.0=h7b6447c_1001
  - fastcache==1.1.0=py37h516909a_0
  - expat==2.2.5=he1b5a44_1003
  - tornado==6.0.3=py37h516909a_0
  - python==3.7.3=h33d41f4_1
  - readline==8.0=hf8c457e_0
  - grpcio==1.16.1=py37hf8bcb03_1
  - ecos==2.0.7=py37h3010b51_1000
  - tensorflow-base==1.14.0=mkl_py37h7ce6ba3_0
  - scikit-learn==0.21.2=py37hd81dba3_0
  - libsodium==1.0.16=h1bed415_0
  - scipy==1.2.1=py37h7c811a0_0
  - pygpu==0.7.6=py37h3010b51_1000
  - sqlite==3.29.0=hcee41ef_0
  - mpfr==4.0.2=ha14ba45_0
  - mistune==0.8.4=py37h7b6447c_0
  - jpeg==9c=h14c3975_1001
  - binutils_linux-64==2.31.1=h6176602_8
  - libxml2==2.9.9=h13577e0_2
  - mkl==2019.4=243
  - pthread-stubs==0.4=h14c3975_1001
  - freetype==2.10.0=he983fc9_1
  - dbus==1.13.6=he372182_0
  - cvxpy==1.0.24=py37he1b5a44_0
  - zlib==1.2.11=h516909a_1005
  - pyzmq==18.1.0=py37he6710b0_0
  - libiconv==1.15=h516909a_1005
  - libuuid==2.32.1=h14c3975_1000
  - multiprocess==0.70.8=py37h516909a_0
  - sip==4.19.8=py37hf484d3e_1000
  - pyrsistent==0.14.11=py37h7b6447c_0
  - gcc_impl_linux-64==7.3.0=habb00fd_1
  - statsmodels==0.10.0=py37hdd07704_0
  - gxx_linux-64==7.3.0=h553295d_8
  - libgfortran-ng==7.3.0=hdf63c60_0
  - osqp==0.5.0=py37hb3f55d8_0
  - libgcc-ng==9.1.0=hdf63c60_0
  - pandas==0.24.2=py37he6710b0_0
  - pcre==8.41=hf484d3e_1003
  - icu==58.2=hf484d3e_1000
  - gst-plugins-base==1.14.5=h0935bb2_0
  - kiwisolver==1.1.0=py37hc9558a2_0
  - theano==1.0.4=py37hf484d3e_1000
  - fontconfig==2.13.1=he4413a7_1000
  - scs==2.1.1.2=py37h4ff444d_0
  - gstreamer==1.14.5=h36ae1b5_0
  - gettext==0.19.8.1=hc5be6a0_1002
  - gmp==6.1.2=hf484d3e_1000
  - xorg-libxau==1.0.9=h14c3975_0
  - libffi==3.2.1=he1b5a44_1006
  - openssl==1.1.1c=h7b6447c_1
  - ncurses==6.1=hf484d3e_1002
  - bzip2==1.0.8=h516909a_0
  - intel-openmp==2019.4=243
  - libpng==1.6.37=hed695b0_0
  - numpy==1.17.0=py37h95a1406_0
  - qt==5.9.7=h52cfd70_2
  - xorg-libxdmcp==1.1.3=h516909a_0
  - glib==2.58.3=h6f030ca_1002
  - matplotlib==3.1.0=py37h5429711_0
  - tensorboard==1.14.0=py37hf484d3e_0
  - mkl-service==2.2.0=py37h516909a_0
  - gcc_linux-64==7.3.0=h553295d_8
  - tensorflow==1.14.0=mkl_py37h45c423b_0
  - xz==5.2.4=h14c3975_1001
  - gmpy2==2.1.0b1=py37h04dde30_0
  - pyqt==5.9.2=py37hcca6a23_2
  - zeromq==4.3.1=he6710b0_3
  - cvxpy-base==1.0.24=py37he1b5a44_0
  - libxcb==1.13=h14c3975_1002
  - libgpuarray==0.7.6=h14c3975_1003
  - yaml==0.1.7=had09818_2
  - pyyaml==5.1.2=py37h7b6447c_0
  - markupsafe==1.1.1=py37h14c3975_0
  - mpc==1.1.0=hb20f59a_1006
  - libprotobuf==3.8.0=hd408876_0
  - binutils_impl_linux-64==2.31.1=h6176602_1
  - wrapt==1.11.2=py37h7b6447c_0
  - hdf5==1.10.4=hb1b8bf9_0
  - protobuf==3.8.0=py37he6710b0_0
  - tk==8.6.9=hed695b0_1002
  - libstdcxx-ng==9.1.0=hdf63c60_0

integral, p.48, line 5

I don't know how to interpret the integrand dP_X(dx). Please, give an explanation or a fix if this is a typo.

Errors in 2.1.5 Independent Random Variables

computation in Page 49

Could you please chech the computation in Page 49? I think it should be

Typo Page 59

Hello - I think in the minimization problem you formulate in the middle of the page, you ought to be integrating against a density — meaning, you ought to include an “f(x)” after the (x-h(y))^2 term.

Page 15, Floating Point Numbers, Numpy, Chapter 1.

0.1 = 1.6 × 2^(−4)
In my humble opinion, should the both sides be equal?

sympy stats.sample API has changed

The return type of sample has been changed to return an iterator
object since version 1.7. For more information see
sympy/sympy#19061

import numpy as np
from sympy import stats
# Eq constrains Z
samples_z7 = lambda : stats.sample(x, S.Eq(z,7)) 
#using 6 as an estimate
mn= np.mean([(6-samples_z7())**2 for i in range(100)]) 
#7/2 is the MSE estimate
mn0= np.mean([(7/2.-samples_z7())**2 for i in range(100)]) 
print('MSE=%3.2f using 6 vs MSE=%3.2f using 7/2 ' % (mn,mn0))

Error message

----> 2 mn = np.mean([(6 - samples_z7())**2 for i in range(100)])
TypeError: unsupported operand type(s) for -: 'int' and 'generator'```

Computation in Page 50 (Chapter 2.1.4)

Hi, I'm new in Github. I have a question about the computation in Page 50. I think it should be:

Could you please check this computation?

ch. 2.1.1 "understanding probability density"

Hi José, I am sympathetic with the idea to base the chapter two on measure theory and the Lebesgue integral. But the example on p.40 and Fig 2.4 are in contradiction to the chapter title. Fig 2.1 doesn't show a density. Areas don't add up to 1. Furthermore the two measures have the same length 1. This does not motivate a learner to invest time in learning Lebesgue integration. Instead the graphic in Fig 2.1 I find a graphic in the german Wikipedia (https://de.wikipedia.org/wiki/Lebesgue-Integral) more instructive. The density is bimodal and the measures have different sizes. So the German graphic is even better that one of the English Wikipedia.
What is also missing is a hint or an example why the Lebesgue integral is necessary for understanding the chapters to come.
A further bonus would be Python code doing Lebsgue integration with an example where Riemann integration is not possible.
All the best, Claus