aloctavodia / statistical-rethinking-with-python-and-pymc3 Goto Github PK

Python/PyMC3 port of the examples in " Statistical Rethinking A Bayesian Course with Examples in R and Stan" by Richard McElreath

Jupyter Notebook 100.00%

pymc bayesian-data-analysis statistics python data-science

statistical-rethinking-with-python-and-pymc3's Introduction

Statistical Rethinking with Python and PyMC3

This repository has been deprecated in favour of this one, please check that repository for updates, for opening issues or sending pull requests

Statistical Rethinking is an incredible good introductory book to Bayesian Statistics, its follows a Jaynesian and practical approach with very good examples and clear explanations.

In this repository we ported the codes (originally in R and Stan) in the book to PyMC3. We are trying to keep the examples as close as possible to those in the book, while at the same time trying to express them in the most Pythonic and PyMC3onic way we can.

Display notebooks

View Jupyter notebooks in nbviewer

Contributing

If you want to contribute please, send your pull request to this. All contributions are welcome!

Installing the dependencies

To install the dependencies to run these notebooks, you can use Anaconda. Once you have installed Anaconda, run:

conda env create -f environment.yml

to install all the dependencies into an isolated environment. You can switch to this environment by running:

source activate stat-rethink-pymc3

Statistical Rethinking with Python and PyMC3 by All Contributors is licensed under a Creative Commons Attribution 4.0 International License.

statistical-rethinking-with-python-and-pymc3's People

Contributors

Stargazers

Watchers

Forkers

paulhendricks narendramukherjee diwahars hehuanshu96 benjamesbabala agustinaarroyuelo austinrochford gladomat guillermogsjc gitter-badger nigeljyng alexleeds libardo1 mrgoogol findmyway nicolasfauchereau junpenglao psfriso ralic rouseguy ml-ai-nlp-ir vybhavk arokem ibonaranburu radovankavicky gapdata josuema prasadseemakurthi nunb edderic pepsalehi chenmoshushi boyali madhatter106 duskinmist fourpartswater savvastj larkin22 bphillab alieanuser saurabhdhupar ymotamedi kastnerkyle ketalochiang snowdj robinekenheim semanticbeeng hzauccg dadosprobr kaihwang napsternxg errai34 betatim dunovank femtotrader existeundelta raffo85h caesaryang huyenle1 ahoejlund books-source-codes elssam chr7stos pepper1709 hal88 richgillin hublun miyainnyc shuangxia weilai0980 mansenfranzen ggoral pseemakurthi hal2001 lfwin kliao zonca spiedeman mtlc rlopezra guhjy hamil168 roholazandie doosanjung raf8 scottgmcleod way2joy anhnguyendepocen kkim610 eric-1986 steveshaw gweindel gmontanari skagr vishalbelsare wangruinju yluo24 vandith al00014 jaircastruita

statistical-rethinking-with-python-and-pymc3's Issues

Work-in-progress tracking

Is there a way you'd like to track progress on this? I'd love to contribute, but don't want to duplicate a chapter someone is already working on.

How to fix error?

Chapter 6 needs to be finished

Chapter 7 needs to be ported

We need contributors to work on Chapter 7

Chp_04 doesnt load

I get the message saying

Sorry, something went wrong. Reload?

. I believe this happens for other Chapters as well.

Rename environment name in environment.yml

I think the environment name pymc in environment.yml is a bit confusing and may cause conflicts with existing environments of the user. A better name might be stat-rethink-pymc3.

This will require changes in the README.md and environment.yml.

Chapter 8 needs to be ported

We need contributors to work on Chapter 8

Chapter 13 needs to be ported

We need contributors to work on Chapter 13

Chapter 9 needs to be ported

We need contributors to work on Chapter 9

Homework solutions?

What about the homework solutions? Where are they?

Enhancement of the notebooks

Since almost all the chapter is ported (I will also work on Chapter 14), maybe it is good to

add legends to the figures in the early chapter.
reproduce the rest of the figures shown in the book. Some of the figures are related to the analysis without code presented.

Conda Environment File

It might be helpful for contributors to have a conda environment file so they can quickly create a new environment and get things rolling. I have one I'll paste below, but I haven't checked it against all of the notebooks to make sure everything necessary is included and anything extraneous is not.

Here's mine:

name: pymc
channels:
- conda-forge/label/rc
- conda-forge
- defaults
dependencies:
- cycler=0.10.0=py36_0
- freetype=2.7=1
- joblib=0.11=py36_0
- libgpuarray=0.6.2=np112py36_0
- libpng=1.6.28=0
- mako=1.0.6=py36_0
- matplotlib=2.0.0=np112py36_3
- nose=1.3.7=py36_2
- pandas=0.19.2=np112py36_1
- patsy=0.4.1=py36_0
- pyparsing=2.2.0=py36_0
- pytz=2017.2=py36_0
- theano=0.9.0=py36_0
- tqdm=4.11.2=py36_0
- pymc3=3.1rc3=py36_0
- appnope=0.1.0=py36_0
- bleach=1.5.0=py36_0
- decorator=4.0.11=py36_0
- entrypoints=0.2.2=py36_1
- h5py=2.7.0=np112py36_0
- hdf5=1.8.17=1
- html5lib=0.999=py36_0
- ipykernel=4.6.1=py36_0
- ipython=5.3.0=py36_0
- ipython_genutils=0.2.0=py36_0
- jinja2=2.9.6=py36_0
- jsonschema=2.5.1=py36_0
- jupyter_client=5.0.1=py36_0
- jupyter_core=4.3.0=py36_0
- markupsafe=0.23=py36_2
- mistune=0.7.4=py36_0
- mkl=2017.0.1=0
- mkl-service=1.1.2=py36_3
- nbconvert=5.1.1=py36_0
- nbformat=4.3.0=py36_0
- notebook=5.0.0=py36_0
- numpy=1.12.1=py36_0
- openssl=1.0.2k=1
- pandocfilters=1.4.1=py36_0
- path.py=10.1=py36_0
- pexpect=4.2.1=py36_0
- pickleshare=0.7.4=py36_0
- pip=9.0.1=py36_1
- prompt_toolkit=1.0.14=py36_0
- ptyprocess=0.5.1=py36_0
- pygments=2.2.0=py36_0
- python=3.6.1=0
- python-dateutil=2.6.0=py36_0
- pyzmq=16.0.2=py36_0
- readline=6.2=2
- scipy=0.19.0=np112py36_0
- seaborn=0.7.1=py36_0
- setuptools=27.2.0=py36_0
- simplegeneric=0.8.1=py36_1
- six=1.10.0=py36_0
- sqlite=3.13.0=0
- terminado=0.6=py36_0
- testpath=0.3=py36_0
- tk=8.5.18=0
- tornado=4.4.2=py36_0
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- wheel=0.29.0=py36_0
- xz=5.2.2=1
- zlib=1.2.8=3
- pip:
  - ipython-genutils==0.2.0
  - jupyter-client==5.0.1
  - jupyter-core==4.3.0
  - prompt-toolkit==1.0.14
  - pygpu==0.6.2
prefix: /Users/Peter/anaconda/envs/pymc

Smaller WAIC in PyMC3

Just wondering if anybody dig into the code to see why WAIC return from the book and rethinking is sometimes larger than we computed using pm.compare. It appears in some models but not the others. I dont think it is too much of a problem, as the orders return from both is the same.

Proposal for Code 5.55 (Ch 5)

It seems to me that the Code 5.55, as it presently stands, does not reflect the intended "index variable" approach introduced in the book? Hence, I propose the following code:

with pm.Model() as m5_16_alt:
    a = pm.Normal('a',mu = 0.6, sd=10, shape=len(d['clade_id'].unique()))
    mu = pm.Deterministic('mu', a[d['clade_id'].values])
    sigma = pm.Uniform('sigma', lower= 0 , upper= 10)
    kcal_per_g = pm.Normal('kcal_per_g', mu = mu, sd=sigma, observed = d['kcal.per.g'])
    trace_5_16_alt = pm.sample(1000, tune=1000)

Proposed code includes shape parameter for the variable a and uses index variable the way it was intended by the author of the book. Is my reasoning on this correct?
The summary produces following output (excerpt):

a:

  Mean             SD               MC Error         89% HPD interval
  -------------------------------------------------------------------
  
  0.544            0.044            0.001            [0.482, 0.620]
  0.713            0.044            0.001            [0.641, 0.781]
  0.788            0.054            0.002            [0.709, 0.882]
  0.506            0.059            0.002            [0.407, 0.593]

sigma:

  Mean             SD               MC Error         89% HPD interval
  -------------------------------------------------------------------
  
  0.131            0.019            0.001            [0.101, 0.159]

Compare this output with the book (page 159).

Pip failed to install pyMC3

Hello,
Thanks very much for putting this package together.

I tried to install with :
git clone https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3.git
and
conda env create -f environment.yml

But I got a warning:
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

Everything progressed seemingly OK until:

Installing pip dependencies: \ Ran pip subprocess with arguments:
['/Users/me/opt/anaconda3/envs/stat-rethink-pymc3/bin/python', '-m', 'pip', 'install', '-U', '-r', '/Users/me/Dropbox/Coding/McElreath/Statistical-Rethinking-with-Python-and-PyMC3/condaenv.35uxzzws.requirements.txt']
Pip subprocess output:
Collecting git+git://github.com/pymc-devs/pymc3.git@master (from -r /Users/me/Dropbox/Coding/McElreath/Statistical-Rethinking-with-Python-and-PyMC3/condaenv.35uxzzws.requirements.txt (line 1))
  Cloning git://github.com/pymc-devs/pymc3.git (to revision master) to /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u
  fatal: unable to connect to github.com:
  github.com[0: 140.82.121.3]: errno=Operation timed out

  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

failed

CondaEnvException: Pip failed

Would appreciate any help possible!

Chapter 11 needs to be ported

We need contributors to work on Chapter 11

New code for Statistcal Rethinking 2nd edition

I've been working through the not yet released 2nd version of McElreath's book and have been using this repository to help guide me along the way. Its been extremely helpful, thanks! As of about ch.6 there is enough difference between the versions that it has become quite difficult to use. Any interest in getting a head start on the 2nd edition code translations so when he releases it, the code with be available in python as well? The 2nd edition can be found on his website w/ a password on one of the first 2019 youtube videos.

Chapter 1?

I might be missing something... but is there a chapter 1?

Proposal for code 6.12

This is the proposal for code 6.12 in Python

import numpy as np
import scipy.stats as stats
import pymc3 as pm
import theano

def sim_train_test(N=20, k=3, rho=[0.15, -0.4], b_sigma=100):
    
    n_dim = 1 + len(rho)
    if n_dim < k:
        n_dim = k
    Rho = np.diag(np.ones(n_dim))
    Rho[0, 1:3:1] = rho
    i_lower = np.tril_indices(n_dim, -1)
    Rho[i_lower] = Rho.T[i_lower]
    
    x_train = stats.multivariate_normal.rvs(cov=Rho, size=N)
    x_test = stats.multivariate_normal.rvs(cov=Rho, size=N)
    
    mm_train = np.ones((N,1))
    
    np.concatenate([mm_train, x_train[:, 1:k]], axis=1)
    
    #Hacer modelo con pymc3
    
    with pm.Model() as m_sim:
        vec_V = pm.MvNormal('vec_V', mu=0, tau=b_sigma * np.eye(n_dim), 
                            shape=(1, n_dim), testval=np.random.randn(1, n_dim)*.01)
        mu = pm.Deterministic('mu', 0 + theano.tensor.dot(x_train, vec_V.T))
        y = pm.Normal('y', mu=mu, sd=1, observed=x_train[:, 0])
        trace_m_sim = pm.sample(1000, tune=1000)
        
        
    vec = pm.summary(trace_m_sim)['mean'][:n_dim]
    vec = np.array([i for i in vec]).reshape(n_dim, -1)
    
    dev_train = - 2 * sum(stats.norm.logpdf(x_train, loc = np.matmul(x_train, vec), scale = 1))    
    
    mm_test = np.ones((N,1))
    
    mm_test = np.concatenate([mm_test, x_test[:, 1:k +1]], axis=1)
    
    dev_test = - 2 * sum(stats.norm.logpdf(x_test[:,0], loc = np.matmul(mm_test, vec), scale = 1))    
    
#    print('Deviation for training', '\t', 'Deviation for test')
#    print(np.mean(dev_train), '\t', np.mean(dev_test))
    
    return np.mean(dev_train), np.mean(dev_test)
    
    
n = 20
tries = 50
param = 5
r = np.zeros(shape=(param - 1, 4))

train = []
test = []

for j in range(2, param + 1):
    print(j)
    for i in range(1, tries):
        tr, te = sim_train_test(N=n, k=param)
        train.append(tr), test.append(te)
    r[j -2, :] = np.mean(train), np.std(train, ddof=1), np.mean(test), np.std(test, ddof=1)

Unfortunately, it is slower than R version (it's very, very slow). I know this code doesn't have comments, and to undestand it you have to look this one. I will try to opmitize it (although I don't know how).

If anyone has any ideas, I'd like to hear them.

New home for this repo

What about transferring this repository into an organization. Maybe inside pymc-devs? Together with other books/docs? Thoughts @junpenglao, @AustinRochford @twiecki

postcheck not implemented

I know there are some functions from the rethinking package that we need to do without. I was wondering, however, if there is an implementation of postcheck which plots posterior predictions that can be validated against actual data. This is very helpful when reviewing models to see if they fit the data well or not.

You can find the function on Richard's github page. It's used, for example, on page 306 of the book, code 10.27.

Chapter 14 needs to be ported

We need contributors to work on Chapter 14

unused import in chapter 6

@agustinaarroyuelo

Numba is imported but not used

Update Links in README

Moreso for my own personal tracking.

Need to update links in Pymc3 Main README to point to new repository for ipython notebooks

Why should both betas have same distribution in quadratic regression?

In 4.66, we set beta to generate two numbers for our quadratic model so both are generated from the same distribution. But the book generates it from two different distributions. Why would we do this? Doesn't this decrease the flexibility of the model?

with pm.Model() as m_4_5:
    alpha = pm.Normal('alpha', mu=178, sd=100)
    
    beta = pm.Normal('beta', mu=0, sd=10, shape=2) # shape of 2 generates two betas
    sigma = pm.Uniform('sigma', lower=0, upper=50)
    mu = pm.Deterministic('mu', alpha + beta[0] * d.weight_std + beta[1] * d.weight_std2)
    height = pm.Normal('height', mu=mu, sd=sigma, observed=d.height)
    trace_4_5 = pm.sample(1000, tune=1000)

Error in code 10.10

If I run this cell:

rt = chimp_ensemble['pulled_left']
pred_mean = np.zeros((1000, 4))
cond = d.condition.unique()
prosoc_l = d.prosoc_left.unique()

for i in range(len(rt)):
    
    tmp = []
    for cp in cond:
        for pl in prosoc_l:
            tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
    pred_mean[i, :] = tmp
        
ticks = range(4)
mp = pred_mean.mean(0)
hpd = pm.hpd(pred_mean)

plt.figure(figsize=(13,6))
plt.fill_between(ticks, hpd[:,1], hpd[:,0], alpha=0.25, color='k')
plt.plot(mp, color='k')
plt.xticks(ticks, ("0/0","1/0","0/1","1/1"))
chimps = d.groupby(['actor', 'prosoc_left', 'condition']).agg('mean')['pulled_left'].values.reshape(7, -1)
for i in range(7):
    plt.plot(chimps[i], 'C0')

plt.ylim(0, 1.1);

I get this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-18-509a6637b6be> in <module>()
     11     for cp in cond:
     12         for pl in prosoc_l:
---> 13             tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
     14     pred_mean[i, :] = tmp
     15 

IndexError: too many indices for array

which it is very strange. Fortunately, after an hour, I found a fix. Change the loop to this one:

for i in range(len(rt)):
    
    tmp = []
    if rt[i].shape == ():
        continue
    for cp in cond:
        for pl in prosoc_l:
            tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
    pred_mean[i, :] = tmp

compare_df = pm.compare([trace_m6_11, trace_m6_12, trace_m6_13, trace_m6_14], 
                        [m6_11, m6_12, m6_13, m6_14], 
                        method='pseudo-BMA')

should be revised to

compare_df = pm.compare({m6_11: trace_m6_11, 
                         m6_12: trace_m6_12, 
                         m6_13: trace_m6_13, 
                         m6_14: trace_m6_14}, 
                         method='pseudo-BMA')

Proposal for code 6.14

This is the proposal for code 6.14 in Python

num_param = np.arange(2, param + 1)

plt.figure()
plt.scatter(num_param, r[:, 0], color='C0')
plt.xticks(num_param)

for j in range(param - 1):
    plt.vlines(num_param[j], r[j,0] - r[j, 1], r[j,0] + r[j,1], color='C0', 
               zorder=-1, alpha=0.50)

plt.scatter(num_param + 0.1, r[:, 2], facecolors='none', edgecolors='k')

for j in range(param - 1):
    plt.vlines(num_param[j] + 0.1, r[j,2] - r[j, 3], r[j,2] + r[j,3], color='k', 
               zorder=-2, alpha=0.50)    

dist = 0.20
plt.text(num_param[1] - dist, r[1, 0] - dist, 'in', color='C0')
plt.text(num_param[1] + dist, r[1, 2] - dist, 'out', color='k')
plt.text(num_param[1] + dist, r[1, 2] + r[1,3] - dist, '+1 SD', color='k', fontsize=9)
plt.text(num_param[1] + dist, r[1, 2] - r[1,3] - dist, '+1 SD', color='k', fontsize=9)
plt.xlabel('Number of parameters')
plt.ylabel('Deviance')
plt.title('N = {}'.format(n))
plt.show()

Remember from this that you have to provide: n, the number of cases; tries, the number of simulations; and param, the number of parameters.

n = 20
tries = 50
param = 5
r = np.zeros(shape=(param - 1, 4))

train = []
test = []

for j in range(2, param + 1):
    print(j)
    for i in range(1, tries):
        tr, te = sim_train_test(N=n, k=param)
        train.append(tr), test.append(te)
    r[j -2, :] = np.mean(train), np.std(train, ddof=1), np.mean(test), np.std(test, ddof=1)

Code in 4.54,4.58

Cant understand this piece of code in 4.54 and 4.58:

weigth_seq = np.arange(25, 71)
# Given that we have a lot of samples we can use less of them for plotting (or we can use all!)
chain_N_thinned = chain_N[::10]
mu_pred = np.zeros((len(weigth_seq), len(chain_N_thinned)*chain_N.nchains))
for i, w in enumerate(weigth_seq):
    mu_pred[i] = chain_N_thinned['alpha'] + chain_N_thinned['beta'] * w

What is chain_N[::10]? and chain_N.nchains doing? What is mu_pred supposed to be?

Add Launch in Azure notebooks badge

I think it will be great to add the launch in azure notebook badge, like the one here: https://github.com/wesm/pydata-book

<a href="https://notebooks.azure.com/import/gh/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3"><img src="https://notebooks.azure.com/launch.png" /></a>