Coder Social home page Coder Social logo

aloctavodia / statistical-rethinking-with-python-and-pymc3 Goto Github PK

View Code? Open in Web Editor NEW
828.0 48.0 251.0 53.69 MB

Python/PyMC3 port of the examples in " Statistical Rethinking A Bayesian Course with Examples in R and Stan" by Richard McElreath

Jupyter Notebook 100.00%
pymc bayesian-data-analysis statistics python data-science

statistical-rethinking-with-python-and-pymc3's Introduction

Statistical Rethinking with Python and PyMC3

This repository has been deprecated in favour of this one, please check that repository for updates, for opening issues or sending pull requests

Statistical Rethinking is an incredible good introductory book to Bayesian Statistics, its follows a Jaynesian and practical approach with very good examples and clear explanations.

In this repository we ported the codes (originally in R and Stan) in the book to PyMC3. We are trying to keep the examples as close as possible to those in the book, while at the same time trying to express them in the most Pythonic and PyMC3onic way we can.

Display notebooks

View Jupyter notebooks in nbviewer

Contributing

If you want to contribute please, send your pull request to this. All contributions are welcome!

Installing the dependencies

To install the dependencies to run these notebooks, you can use Anaconda. Once you have installed Anaconda, run:

conda env create -f environment.yml

to install all the dependencies into an isolated environment. You can switch to this environment by running:

source activate stat-rethink-pymc3

Creative Commons License
Statistical Rethinking with Python and PyMC3 by All Contributors is licensed under a Creative Commons Attribution 4.0 International License.

statistical-rethinking-with-python-and-pymc3's People

Contributors

agustinaarroyuelo avatar aloctavodia avatar arokem avatar austinrochford avatar betatim avatar canyon289 avatar femtotrader avatar gladomat avatar junpenglao avatar napsternxg avatar nigeljyng avatar pmbaumgartner avatar zonca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

statistical-rethinking-with-python-and-pymc3's Issues

Work-in-progress tracking

Is there a way you'd like to track progress on this? I'd love to contribute, but don't want to duplicate a chapter someone is already working on.

Chp_04 doesnt load

I get the message saying

Sorry, something went wrong. Reload?

. I believe this happens for other Chapters as well.

Rename environment name in environment.yml

I think the environment name pymc in environment.yml is a bit confusing and may cause conflicts with existing environments of the user. A better name might be stat-rethink-pymc3.

This will require changes in the README.md and environment.yml.

Enhancement of the notebooks

Since almost all the chapter is ported (I will also work on Chapter 14), maybe it is good to

  • add legends to the figures in the early chapter.
  • reproduce the rest of the figures shown in the book. Some of the figures are related to the analysis without code presented.

Conda Environment File

It might be helpful for contributors to have a conda environment file so they can quickly create a new environment and get things rolling. I have one I'll paste below, but I haven't checked it against all of the notebooks to make sure everything necessary is included and anything extraneous is not.

Here's mine:

name: pymc
channels:
- conda-forge/label/rc
- conda-forge
- defaults
dependencies:
- cycler=0.10.0=py36_0
- freetype=2.7=1
- joblib=0.11=py36_0
- libgpuarray=0.6.2=np112py36_0
- libpng=1.6.28=0
- mako=1.0.6=py36_0
- matplotlib=2.0.0=np112py36_3
- nose=1.3.7=py36_2
- pandas=0.19.2=np112py36_1
- patsy=0.4.1=py36_0
- pyparsing=2.2.0=py36_0
- pytz=2017.2=py36_0
- theano=0.9.0=py36_0
- tqdm=4.11.2=py36_0
- pymc3=3.1rc3=py36_0
- appnope=0.1.0=py36_0
- bleach=1.5.0=py36_0
- decorator=4.0.11=py36_0
- entrypoints=0.2.2=py36_1
- h5py=2.7.0=np112py36_0
- hdf5=1.8.17=1
- html5lib=0.999=py36_0
- ipykernel=4.6.1=py36_0
- ipython=5.3.0=py36_0
- ipython_genutils=0.2.0=py36_0
- jinja2=2.9.6=py36_0
- jsonschema=2.5.1=py36_0
- jupyter_client=5.0.1=py36_0
- jupyter_core=4.3.0=py36_0
- markupsafe=0.23=py36_2
- mistune=0.7.4=py36_0
- mkl=2017.0.1=0
- mkl-service=1.1.2=py36_3
- nbconvert=5.1.1=py36_0
- nbformat=4.3.0=py36_0
- notebook=5.0.0=py36_0
- numpy=1.12.1=py36_0
- openssl=1.0.2k=1
- pandocfilters=1.4.1=py36_0
- path.py=10.1=py36_0
- pexpect=4.2.1=py36_0
- pickleshare=0.7.4=py36_0
- pip=9.0.1=py36_1
- prompt_toolkit=1.0.14=py36_0
- ptyprocess=0.5.1=py36_0
- pygments=2.2.0=py36_0
- python=3.6.1=0
- python-dateutil=2.6.0=py36_0
- pyzmq=16.0.2=py36_0
- readline=6.2=2
- scipy=0.19.0=np112py36_0
- seaborn=0.7.1=py36_0
- setuptools=27.2.0=py36_0
- simplegeneric=0.8.1=py36_1
- six=1.10.0=py36_0
- sqlite=3.13.0=0
- terminado=0.6=py36_0
- testpath=0.3=py36_0
- tk=8.5.18=0
- tornado=4.4.2=py36_0
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- wheel=0.29.0=py36_0
- xz=5.2.2=1
- zlib=1.2.8=3
- pip:
  - ipython-genutils==0.2.0
  - jupyter-client==5.0.1
  - jupyter-core==4.3.0
  - prompt-toolkit==1.0.14
  - pygpu==0.6.2
prefix: /Users/Peter/anaconda/envs/pymc

Smaller WAIC in PyMC3

Just wondering if anybody dig into the code to see why WAIC return from the book and rethinking is sometimes larger than we computed using pm.compare. It appears in some models but not the others. I dont think it is too much of a problem, as the orders return from both is the same.

Proposal for Code 5.55 (Ch 5)

It seems to me that the Code 5.55, as it presently stands, does not reflect the intended "index variable" approach introduced in the book? Hence, I propose the following code:

with pm.Model() as m5_16_alt:
    a = pm.Normal('a',mu = 0.6, sd=10, shape=len(d['clade_id'].unique()))
    mu = pm.Deterministic('mu', a[d['clade_id'].values])
    sigma = pm.Uniform('sigma', lower= 0 , upper= 10)
    kcal_per_g = pm.Normal('kcal_per_g', mu = mu, sd=sigma, observed = d['kcal.per.g'])
    trace_5_16_alt = pm.sample(1000, tune=1000) 

Proposed code includes shape parameter for the variable a and uses index variable the way it was intended by the author of the book. Is my reasoning on this correct?
The summary produces following output (excerpt):

a:

  Mean             SD               MC Error         89% HPD interval
  -------------------------------------------------------------------
  
  0.544            0.044            0.001            [0.482, 0.620]
  0.713            0.044            0.001            [0.641, 0.781]
  0.788            0.054            0.002            [0.709, 0.882]
  0.506            0.059            0.002            [0.407, 0.593]

sigma:

  Mean             SD               MC Error         89% HPD interval
  -------------------------------------------------------------------
  
  0.131            0.019            0.001            [0.101, 0.159]

Compare this output with the book (page 159).

Pip failed to install pyMC3

Hello,
Thanks very much for putting this package together.

I tried to install with :
git clone https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3.git
and
conda env create -f environment.yml

But I got a warning:
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

Everything progressed seemingly OK until:

Installing pip dependencies: \ Ran pip subprocess with arguments:
['/Users/me/opt/anaconda3/envs/stat-rethink-pymc3/bin/python', '-m', 'pip', 'install', '-U', '-r', '/Users/me/Dropbox/Coding/McElreath/Statistical-Rethinking-with-Python-and-PyMC3/condaenv.35uxzzws.requirements.txt']
Pip subprocess output:
Collecting git+git://github.com/pymc-devs/pymc3.git@master (from -r /Users/me/Dropbox/Coding/McElreath/Statistical-Rethinking-with-Python-and-PyMC3/condaenv.35uxzzws.requirements.txt (line 1))
  Cloning git://github.com/pymc-devs/pymc3.git (to revision master) to /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u
  fatal: unable to connect to github.com:
  github.com[0: 140.82.121.3]: errno=Operation timed out

  error: subprocess-exited-with-error

  ร— git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u did not run successfully.
  โ”‚ exit code: 128
  โ•ฐโ”€> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

ร— git clone --filter=blob:none --quiet git://github.com/pymc-devs/pymc3.git /private/var/folders/w9/f07_8c6j6lx90vr9hksp9r3h0000gq/T/pip-req-build-t1jvdh8u did not run successfully.
โ”‚ exit code: 128
โ•ฐโ”€> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

failed

CondaEnvException: Pip failed

Would appreciate any help possible!

New code for Statistcal Rethinking 2nd edition

I've been working through the not yet released 2nd version of McElreath's book and have been using this repository to help guide me along the way. Its been extremely helpful, thanks! As of about ch.6 there is enough difference between the versions that it has become quite difficult to use. Any interest in getting a head start on the 2nd edition code translations so when he releases it, the code with be available in python as well? The 2nd edition can be found on his website w/ a password on one of the first 2019 youtube videos.

Chapter 1?

I might be missing something... but is there a chapter 1?

Proposal for code 6.12

This is the proposal for code 6.12 in Python

import numpy as np
import scipy.stats as stats
import pymc3 as pm
import theano

def sim_train_test(N=20, k=3, rho=[0.15, -0.4], b_sigma=100):
    
    n_dim = 1 + len(rho)
    if n_dim < k:
        n_dim = k
    Rho = np.diag(np.ones(n_dim))
    Rho[0, 1:3:1] = rho
    i_lower = np.tril_indices(n_dim, -1)
    Rho[i_lower] = Rho.T[i_lower]
    
    x_train = stats.multivariate_normal.rvs(cov=Rho, size=N)
    x_test = stats.multivariate_normal.rvs(cov=Rho, size=N)
    
    mm_train = np.ones((N,1))
    
    np.concatenate([mm_train, x_train[:, 1:k]], axis=1)
    
    #Hacer modelo con pymc3
    
    with pm.Model() as m_sim:
        vec_V = pm.MvNormal('vec_V', mu=0, tau=b_sigma * np.eye(n_dim), 
                            shape=(1, n_dim), testval=np.random.randn(1, n_dim)*.01)
        mu = pm.Deterministic('mu', 0 + theano.tensor.dot(x_train, vec_V.T))
        y = pm.Normal('y', mu=mu, sd=1, observed=x_train[:, 0])
        trace_m_sim = pm.sample(1000, tune=1000)
        
        
    vec = pm.summary(trace_m_sim)['mean'][:n_dim]
    vec = np.array([i for i in vec]).reshape(n_dim, -1)
    
    dev_train = - 2 * sum(stats.norm.logpdf(x_train, loc = np.matmul(x_train, vec), scale = 1))    
    
    mm_test = np.ones((N,1))
    
    mm_test = np.concatenate([mm_test, x_test[:, 1:k +1]], axis=1)
    
    dev_test = - 2 * sum(stats.norm.logpdf(x_test[:,0], loc = np.matmul(mm_test, vec), scale = 1))    
    
#    print('Deviation for training', '\t', 'Deviation for test')
#    print(np.mean(dev_train), '\t', np.mean(dev_test))
    
    return np.mean(dev_train), np.mean(dev_test)
    
    
n = 20
tries = 50
param = 5
r = np.zeros(shape=(param - 1, 4))

train = []
test = []

for j in range(2, param + 1):
    print(j)
    for i in range(1, tries):
        tr, te = sim_train_test(N=n, k=param)
        train.append(tr), test.append(te)
    r[j -2, :] = np.mean(train), np.std(train, ddof=1), np.mean(test), np.std(test, ddof=1)

Unfortunately, it is slower than R version (it's very, very slow). I know this code doesn't have comments, and to undestand it you have to look this one. I will try to opmitize it (although I don't know how).

If anyone has any ideas, I'd like to hear them.

postcheck not implemented

I know there are some functions from the rethinking package that we need to do without. I was wondering, however, if there is an implementation of postcheck which plots posterior predictions that can be validated against actual data. This is very helpful when reviewing models to see if they fit the data well or not.

You can find the function on Richard's github page. It's used, for example, on page 306 of the book, code 10.27.

Update Links in README

Moreso for my own personal tracking.

Need to update links in Pymc3 Main README to point to new repository for ipython notebooks

Why should both betas have same distribution in quadratic regression?

In 4.66, we set beta to generate two numbers for our quadratic model so both are generated from the same distribution. But the book generates it from two different distributions. Why would we do this? Doesn't this decrease the flexibility of the model?

with pm.Model() as m_4_5:
    alpha = pm.Normal('alpha', mu=178, sd=100)
    
    beta = pm.Normal('beta', mu=0, sd=10, shape=2) # shape of 2 generates two betas
    sigma = pm.Uniform('sigma', lower=0, upper=50)
    mu = pm.Deterministic('mu', alpha + beta[0] * d.weight_std + beta[1] * d.weight_std2)
    height = pm.Normal('height', mu=mu, sd=sigma, observed=d.height)
    trace_4_5 = pm.sample(1000, tune=1000)

Error in code 10.10

If I run this cell:

rt = chimp_ensemble['pulled_left']
pred_mean = np.zeros((1000, 4))
cond = d.condition.unique()
prosoc_l = d.prosoc_left.unique()

for i in range(len(rt)):
    
    tmp = []
    for cp in cond:
        for pl in prosoc_l:
            tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
    pred_mean[i, :] = tmp
        
ticks = range(4)
mp = pred_mean.mean(0)
hpd = pm.hpd(pred_mean)

plt.figure(figsize=(13,6))
plt.fill_between(ticks, hpd[:,1], hpd[:,0], alpha=0.25, color='k')
plt.plot(mp, color='k')
plt.xticks(ticks, ("0/0","1/0","0/1","1/1"))
chimps = d.groupby(['actor', 'prosoc_left', 'condition']).agg('mean')['pulled_left'].values.reshape(7, -1)
for i in range(7):
    plt.plot(chimps[i], 'C0')

plt.ylim(0, 1.1);

I get this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-18-509a6637b6be> in <module>()
     11     for cp in cond:
     12         for pl in prosoc_l:
---> 13             tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
     14     pred_mean[i, :] = tmp
     15 

IndexError: too many indices for array

which it is very strange. Fortunately, after an hour, I found a fix. Change the loop to this one:

for i in range(len(rt)):
    
    tmp = []
    if rt[i].shape == ():
        continue
    for cp in cond:
        for pl in prosoc_l:
            tmp.append(np.mean(rt[i][(d.prosoc_left==pl) & (d.chose_prosoc==cp)]))
    pred_mean[i, :] = tmp

Chapter 4 code 4.38 and 4.39 missing a comma.

The commented line for model m4_3 is missing a comma between 'mu' and alpha.
#mu = pm.Deterministic('mu' alpha + beta * d2.weight) # try uncomenting this line and comenting the above line

5.33 sum of posteriors

Why the posterior distribution of the sum of bl and br is not in the same scale (centered at 2.05) as shown in the book (top page 144)?
thanks,

Proposal for code 6.24

The pm.compare function is using a different syntax (I am using version 3.5). Instead of the old syntax which takes two lists for traces and models respectively, the new syntax takes a dictionary with model:trace as key-value pairs.

The old code in 6.24:

compare_df = pm.compare([trace_m6_11, trace_m6_12, trace_m6_13, trace_m6_14], 
                        [m6_11, m6_12, m6_13, m6_14], 
                        method='pseudo-BMA')

should be revised to

compare_df = pm.compare({m6_11: trace_m6_11, 
                         m6_12: trace_m6_12, 
                         m6_13: trace_m6_13, 
                         m6_14: trace_m6_14}, 
                         method='pseudo-BMA')

Proposal for code 6.14

This is the proposal for code 6.14 in Python

num_param = np.arange(2, param + 1)

plt.figure()
plt.scatter(num_param, r[:, 0], color='C0')
plt.xticks(num_param)

for j in range(param - 1):
    plt.vlines(num_param[j], r[j,0] - r[j, 1], r[j,0] + r[j,1], color='C0', 
               zorder=-1, alpha=0.50)

plt.scatter(num_param + 0.1, r[:, 2], facecolors='none', edgecolors='k')

for j in range(param - 1):
    plt.vlines(num_param[j] + 0.1, r[j,2] - r[j, 3], r[j,2] + r[j,3], color='k', 
               zorder=-2, alpha=0.50)    

dist = 0.20
plt.text(num_param[1] - dist, r[1, 0] - dist, 'in', color='C0')
plt.text(num_param[1] + dist, r[1, 2] - dist, 'out', color='k')
plt.text(num_param[1] + dist, r[1, 2] + r[1,3] - dist, '+1 SD', color='k', fontsize=9)
plt.text(num_param[1] + dist, r[1, 2] - r[1,3] - dist, '+1 SD', color='k', fontsize=9)
plt.xlabel('Number of parameters')
plt.ylabel('Deviance')
plt.title('N = {}'.format(n))
plt.show()

Remember from this that you have to provide: n, the number of cases; tries, the number of simulations; and param, the number of parameters.

n = 20
tries = 50
param = 5
r = np.zeros(shape=(param - 1, 4))

train = []
test = []

for j in range(2, param + 1):
    print(j)
    for i in range(1, tries):
        tr, te = sim_train_test(N=n, k=param)
        train.append(tr), test.append(te)
    r[j -2, :] = np.mean(train), np.std(train, ddof=1), np.mean(test), np.std(test, ddof=1)

Code in 4.54,4.58

Cant understand this piece of code in 4.54 and 4.58:

weigth_seq = np.arange(25, 71)
# Given that we have a lot of samples we can use less of them for plotting (or we can use all!)
chain_N_thinned = chain_N[::10]
mu_pred = np.zeros((len(weigth_seq), len(chain_N_thinned)*chain_N.nchains))
for i, w in enumerate(weigth_seq):
    mu_pred[i] = chain_N_thinned['alpha'] + chain_N_thinned['beta'] * w

What is chain_N[::10]? and chain_N.nchains doing? What is mu_pred supposed to be?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.