Coder Social home page Coder Social logo

arviz-devs / arviz Goto Github PK

View Code? Open in Web Editor NEW
1.6K 1.6K 389.0 113.14 MB

Exploratory analysis of Bayesian models with Python

Home Page: https://python.arviz.org

License: Apache License 2.0

Python 66.93% Shell 0.30% Dockerfile 0.07% R 0.01% TeX 1.00% PowerShell 0.11% CSS 0.24% HTML 0.05% Makefile 0.03% Jupyter Notebook 31.26%
bayesian closember python

arviz's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arviz's Issues

Cannot install without matplotlib

On a fresh virtual environment,

pip install -e .

fails with

Obtaining file:///home/colin/projects/arviz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/colin/projects/arviz/setup.py", line 7, in <module>
        from matplotlib import get_configdir
    ModuleNotFoundError: No module named 'matplotlib'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /home/colin/projects/arviz/

I wonder if we can do some matplotlib magic after installing the requirements, or maybe just on import?

`utils.trace_to_dataframe` fails on `pymc3.MultiTrace` input.

Hello everyone,

I just noticed that trace_to_dataframe may fail on entirely valid pymc3.MultiTrace objects, this is due
to the following code:
if type(trace).__name__ == "MultiTrace".

I get the intention, I believe, one wants to avoid a dependency on pymc3.
However, if I import pymc3 as import pymc3 as pm and then use pm.backends.base.MultiTrace to construct a trace I will get an error as:
type(trace).__name__ will return "pm.backends.base.MultiTrace.

I suggest the following approach:

try:
    from pymc3.backends.base import MultiTrace
except ImportError:
    pass   # this still avoids a dependency on pymc3
else:
    is_multitrace = isinstance(trace, MultiTrace) 

This avoids a dependency and handles arbitrary ways to import MultiTrace.
I have code ready for this, if you agree with me on the way forward I'm willing to provide a pull request.

Best,
MFreidank

autocorr vs autocorrplot

This issue is actually more of a question, it is quite possible that we can resolve this without actual code changes. arviz maintains autocorr code in stats.diagnostics._autocorr and plots.autocorrplot (both are inherited from the pymc3 codebase). Both here and in pymc3 the way the autocorrelation is computed differs, in particular plots.autocorrplot uses np.correlate(y, y, mode=2) whereas stats.diagnostics.autocorr uses fftconvolve(y, y[::-1]). Is there a reason why plots.autocorrplot does not simply use stats.diagnostics.autocorr under the hood?

String formatting instead of rounding

Currently values are rounded and then printed. More formal / better way should use string formatting. This would not affect location of plotted lines.

Example from plot_post

post_summary['mean'] = round(np.mean(sample), roundto)
...
plt.plot(0, label='mean = {:g}'.format(post_summary['mean']), alpha=0)

to

post_summary['mean'] = np.mean(sample)
...
plt.plot(0, label='mean = {:.{roundto}g}'.format(post_summary['mean'], roundto=roundto), alpha=0)

Base for numba version of n_effective

Just so that we don't lose this code...

@numba.jit(nopython=True)
def mult_conj(a):
    """Multiply a with conj(a) and store in a.
    
    [Re, Re, Im, Re, Im...]
    like output from scipy.fftpack.rfft
    """
    n = len(a)
    if n == 0:
        return
    a[0] = a[0] * a[0]
    for i in range(1, (n + 1) // 2):
        j, k = 2 * i - 1, 2 * i
        a[j] = a[j] * a[j] + a[k] * a[k]
        a[k] = 0
    if n > 1 and n % 2 == 0:
        a[n - 1] = a[n - 1] * a[n - 1]

        
@numba.jit
def autocov2(x):
    assert len(x.shape) == 2
    y = x - np.mean(x, axis=-1, keepdims=True)
    n = y.shape[-1]
    shape = 2 * n - 1
    
    fft_len = fftpack.next_fast_len(shape)
    fx = fftpack.rfft(y, fft_len, overwrite_x=False)
    for i in range(x.shape[0]):
        mult_conj(fx[i, :])
    out = fftpack.irfft(fx, fft_len, overwrite_x=False)
    out = out[..., :n]
    out[:] /= np.arange(n, 0, -1)
    return out


@numba.jit(nopython=True)
def get_neff_inner(trace_value, acov):
    """Compute the effective sample size for a 2D array
    """
    n_chains, n_samples = trace_value.shape
    assert acov.shape == (n_chains, n_samples)

    chain_mean = np.sum(trace_value, 1) / n_samples

    chain_var = acov[:, 0] * n_samples / (n_samples - 1.)
    acov_t = acov[:, 1] * n_samples / (n_samples - 1.)
    mean_var = np.mean(chain_var)
    var_plus = mean_var * (n_samples - 1.) / n_samples
    var_plus += np.var(chain_mean) * n_chains / (n_chains - 1.)

    rho_hat_t = np.zeros(n_samples)
    rho_hat_even = 1.
    rho_hat_t[0] = rho_hat_even
    rho_hat_odd = 1. - (mean_var - np.mean(acov_t)) / var_plus
    rho_hat_t[1] = rho_hat_odd
    max_t = 1
    t = 1

    acov_means = np.sum(acov, 0) / n_chains
    while t < (n_samples - 2) and (rho_hat_even + rho_hat_odd) >= 0.:
        rho_hat_even = 1. - (mean_var - acov_means[t + 1]) / var_plus
        rho_hat_odd = 1. - (mean_var - acov_means[t + 2]) / var_plus
        if (rho_hat_even + rho_hat_odd) >= 0:
            rho_hat_t[t + 1] = rho_hat_even
            rho_hat_t[t + 2] = rho_hat_odd
        max_t = t + 2
        t += 2

    t = 3
    while t <= max_t - 2:
        if (rho_hat_t[t + 1] + rho_hat_t[t + 2]) > (rho_hat_t[t - 1] + rho_hat_t[t]):
            rho_hat_t[t + 1] = (rho_hat_t[t - 1] + rho_hat_t[t]) / 2.
            rho_hat_t[t + 2] = rho_hat_t[t + 1]
        t += 2
    ess = nchain * n_samples
    ess = ess / (-1. + 2. * np.sum(rho_hat_t))
    return ess


@numba.jit()
def get_neff2(x):
    """Compute the effective sample size for a 2D array
    """
    neff = np.zeros(x.shape[0])
    nvals, nchain, n_samples = x.shape
    
    for i in range(len(x)):
        trace_value = x[i]
        nchain, n_samples = trace_value.shape
        acov = autocov2(trace_value)
        neff[i] = get_neff_inner(trace_value, acov)
    return neff


def effective_n(mtrace, varnames=None, include_transformed=False):
    def generate_neff(trace_values):
        """Trace values is an array (*varshape, n_chains, n_samples)"""
        x = np.stack([val.T for val in trace_values], axis=-2)
        *varshape, n_chains, n_samples = x.shape
        x = np.ascontiguousarray(x.reshape((-1, n_chains, n_samples)))
        return get_neff2(x).reshape(varshape)

    if not isinstance(mtrace, pm.backends.base.MultiTrace):
        # Return neff for non-multitrace array
        return generate_neff(mtrace)

    if mtrace.nchains < 2:
        raise ValueError(
            'Calculation of effective sample size requires multiple chains '
            'of the same length.')

    if varnames is None:
        varnames = pm.util.get_default_varnames(mtrace.varnames,include_transformed=include_transformed)

    n_eff = {}

    for var in varnames:
        n_eff[var] = generate_neff(mtrace.get_values(var, combine=False))

    return n_eff
with pm.Model() as model:
    a = pm.Normal('a', 0, 1)
    pm.Normal('b', sd=tt.exp(a), shape=1000)
    
    trace = pm.sample(10000)

%timeit effective_n(trace)
#1.96 s ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pm.effective_n(trace)
#25.7 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Possible bug in TravisCI pipeline

I think there's a bug in the travisci jobs regarding which python version is being used for testing

On line 513 (https://travis-ci.org/arviz-devs/arviz/jobs/411966731#L513( python 3.5 in a virtualenv is correctly used, but when travisci runs pytest its using python 3.6 in all jobs
(https://travis-ci.org/arviz-devs/arviz/jobs/411966731#L1430)

Same condition can be seen here in the intended python 3.4 pipeline
https://travis-ci.org/arviz-devs/arviz/jobs/411966732#L1430

Traceplot parity

Is it intentional to have a different traceplot from pymc3? Each element of an RV with shape > 1 gets a plot in arviz, while they are all the same for pymc3.

Explicitly, with eight schools (I only capture the first few plots from arviz):

J = 8
y = np.array([28,  8, -3,  7, -1,  1, 18, 12])
sigma = np.array([15, 10, 16, 11,  9, 11, 10, 18])

with pm.Model() as hierarchical:
    eta = pm.Normal('eta', 0, 1, shape=J)
    mu = pm.Normal('mu', 0, sd=1e6)
    tau = pm.HalfCauchy('tau', 5)
    theta = pm.Deterministic('theta', mu + tau*eta)
    obs = pm.Normal('obs', theta, sd=sigma, observed=y)
    trace_h = pm.sample(1000)

image

Plot KDEs horizontally?

Hey, not sure if there's an interest in this or if this package is actively maintained? I really like this idea of centralizing MCMC plotting tools.

When doing a traceplot, it can help to link the kernel density/histogram to the trace if the histogram/kdeplot closes over the trace. If seaborn is used for the kernel density, it's only a keyword argument to flip. I think, if this option would be useful in this package, then it should be possible to use ax.fill_betweenx.

There are a lot of other views I've built that might be more useful generally for MCMC that could take arbitrary sample chains.

Refactor interface (datastructure)

Hi,

I suggest that we refactor the interface to have a common data structure. I created this issue so we could start working on these.

In pystan we get Ordered dict with

odict = fit.extract()

and with pymc3

from itertools import OrderedDict
odict = OrderedDict()
for name in trace.varnames:
    odict[name] = trace.get_values(name)

These have a common structure considering only the data. Sample size is on the first axis.

Most of the plots do not need chain number etc (excluding forestplot etc). I have tested these with needed modifications and it works with results from both libraries.

The other "problems" are logp (not a problem) and possibility to extract priors and other information.

Both pystan and pymc3 have a possibility to call logp with a function call so these should be handled.

Also there are functions from pymc3 that need to be handled.

ps. Some naming conventions

  • trace vs traces / Multitrace
  • burn-in vs warm-up

Representing schema in xarray

In order to get to feature parity with pymc3 plotting, we need to have a way to access sampler statistics (specifically, to access divergences), and ppc samples. @aseyboldt outlined a good way to think about the rest of the schema here.

At the same time, xarray supports groups, but it doesn't look like it does so natively (yet? see the discussion at pydata/xarray#1092).

I am proposing something like

import xarray as xr
import netCDF4 as nc

class Trace(object):
    def __init__(self, filename):
        self.filename = filename
        self.data = nc.Dataset(filename)
        self.groups = self.data.groups
        
    def __getattr__(self, name):
        if name in self.groups:
            return xr.open_dataset(self.filename, group=name)
        raise AttributeError("informative message")
    
    def __dir__(self):
        """Allows for tab completion on netCDF group names"""
        return super(Trace, self).__dir__() + list(self.groups.keys())

This is a pretty light wrapper around netCDF and xarray. Usage is something like

t = Trace('mytrace.nc')
t.posterior  # this is an xarray.Dataset
t.posterior.mu.mean()  # calculate the mean of a variable

I think this will have to change a little bit so that nested groups work fine. In particular, something like

t = Trace('mytrace.nc')
t.sampler_stats.divergences  # should return an xarray.Dataset
t.sampler_stats  #  I think this would tend to return an empty xarray.Dataset

Use xarray throughout

There have been proposals to use xarray as a common language for pymc3, pystan, and pymc4. This library might be a good place to start that by implementing utility functions for translating pymc3's Multitrace and pystan's OrderedDict into xarray objects, and then having all plotting functions work with xarrays.

Support Python 3.4+

This mostly means removing f-strings throughout. I may play around with the travis build, too, to try to balance thoroughness and test time.

Converting PyStan records to dataframe

@ahartikainen Working on adding quick start examples in #95, and would love to include pystan examples.

Do you have any examples of turning the pystan fit to a dataframe? I think xarray is a better answer long term (#4 (comment)), but for a first release, hacking a small function around dataframes seems doable.

Multidimensional traceplot with Pystan and xarray

I realize that this is undergoing development at the moment, but what is the intended method for constructing a simple traceplot for a Pystan fit object with 2-dimensional array variables? The following line seems to discard variables that are 2-dimensional? I get an error saying that pars is empty if I try to just call traceplot directly on the Pystan fit object:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-101-28cd15193e9c> in <module>()
----> 1 arviz.traceplot(samples)

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/plots/traceplot.py in traceplot(data, var_names, coords, figsize, textsize, lines, combined, kde_kwargs, trace_kwargs)
     37     axes : matplotlib axes
     38     """
---> 39     data = convert_to_xarray(data)
     40 
     41     if coords is None:

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in convert_to_xarray(obj, coords, dims, chains)
     73         return obj
     74     else:
---> 75         return get_converter(obj, coords=coords, dims=dims, chains=chains).to_xarray()
     76 
     77 

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in get_converter(obj, coords, dims, chains)
     37         return DictToXarray(obj, coords, dims, chains=chains)
     38     elif obj.__class__.__name__ == 'StanFit4Model':  # ugly, but doesn't make PyStan a requirement
---> 39         return PyStanToXarray(obj, coords, dims)
     40     elif obj.__class__.__name__ == 'MultiTrace':  # ugly, but doesn't make PyMC3 a requirement
     41         return PyMC3ToXarray(obj, coords, dims)

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in __init__(self, fit, coords, dims)
    389             The coordinates are those passed in and ('chain', 'draw')
    390         """
--> 391         super().__init__(fit, coords=coords, dims=dims)
    392 
    393     def to_xarray(self):

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in __init__(self, obj, coords, dims)
     80         self.obj = obj
     81         # pylint: disable=assignment-from-none
---> 82         self.varnames, self.coords, self.dims = self.default_varnames_coords_dims(obj, coords, dims)
     83         self.verified, self.warning = self.verify_coords_dims()
     84 

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in default_varnames_coords_dims(obj, coords, dims)
    451         for varname in varnames:
    452             if varname not in dims:
--> 453                 vals = obj.extract(varname, permuted=False)[varname]
    454                 if len(vals.shape) == 1:
    455                     vals = np.expand_dims(vals, axis=1)

stanfit4template_f63d55caee4d725d95ec10b01b4500a0_1018734232036915966.pyx in stanfit4template_f63d55caee4d725d95ec10b01b4500a0_1018734232036915966.StanFit4Model.extract()

~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/pystan/misc.py in _check_pars(allpars, pars)
    805 def _check_pars(allpars, pars):
    806     if len(pars) == 0:
--> 807         raise ValueError("No parameter specified (`pars` is empty).")
    808     for par in pars:
    809         if par not in allpars:

ValueError: No parameter specified (`pars` is empty).

Has the final shape of the expected xarray been settled? When calling traceplot, can you still specify variables using the names of vector/matrix parameters, or do you need to specify the variables with the flatnames (i.e. unfolded array variables, scalars with names like X_dims_1_2)?

Docs get deployed three times

Looks like each travis task deploys docs to github. It still all goes smoothly, so no big deal, but it'd be nice to fix. Fixing this should be an if statement, possibly in bash, somewhere.

Logo

Probably a little bit too early, but is nice to have a logo. What do you think? @twiecki @ahartikainen

arviz_logo_00

This logo gets 3 💯 at PyMC3's slack channel and I did not vote! 😄

Is there a better way to show Rope in relation to HDP for PosteriorPlot?

Hey all,
Open ended issue here. I feel like the ROPE visualization right now would be better if it didn't hide the HDP. See the plot of mu below.

Proposals I have are

  1. Move the HDP line above the ROPE Line since the HDP line is smaller
  2. Plot ROPE as a vertical line like Kruschke does in this blog post http://doingbayesiandataanalysis.blogspot.com/2013/08/how-much-of-bayesian-posterior.html
  3. Give ROPE and opacity/alpha/transparency value lower than 1

image

I can provide mockups of any of these if people want.

Alternatively if this behavior is desired do nothing, sort of like accepting the null hypothesis if you know what I mean :) (Sorry I had to)

Auto-scaling figures

It would be nice to have a good general solution to autoscale the figure size (and fontsize, markersize, etc) as a function of the number of subplots. Our current implementation is not general enough and needs special care for each type of plot.

Make pickle files work for different systems

They should probably be marked with a system name and a 32/64 for floating point. As an alternative, maybe just a try/except, and rebuild the pickles if there's a problem? This would be easier and not very time-consuming (the pickles are mostly for travis' benefit).

Even though everything will eventually work with netCDF files, this is important since making sure pystan/pymc3 objects convert correctly is important.

Pip installation broken

Simple problem: the pip install in the readme does not work at the moment. It seems to be a problem with the two relative import statements in __init__.py:

from .plots import *
from .stats import *

Add examples to gallery

If you add a script to examples/ that includes a plot, the docs will put it onto http://arviz-devs.github.io/arviz/. A few things to note:

  • There are two saved traces (centered and non-centered) you can use, from the famous eight schools example
  • The script used to build docs searches for az.*plot(...) to title the plot
  • There is a markdown block at the top of each example that will show up in the documentation
  • The markdown plot also can have a last line of _thumb: .5, .5. The second number does nothing, the first number shifts the thumbnail left/right or up/down, depending on which is thinner (the other dimension takes up the full height or width).
  • We can have multiple examples for a single plot

I believe this should replace #71 -- we will still need notebooks for any interactive altair plots, but this is a pretty pleasant format. Note also that these serve as a sort of integration test.

Convert forestplot to xarray

Wanted to open an issue, since this is tricky, and I do not want to get too far away from the current implementation while doing this.

I have copied the current version I have below. Notice that the styling is slightly different: notably there is one subplot for each variable being plotted. Happy to hear any thoughts, otherwise I am going to keep working on getting all the options from the current version working here!

image

Nicer constructors for InferenceData objects

It would be nice to let as many of the plots work with numpy arrays as possible, and to have generic constructors that the pymc3 and pystan converters can use.

I'm working on this now.

use pandas.DataFrame, as interface between sampling library and mcmcplotlib

Since DataFrames are simply indexed and labelled numpy arrays (from memory and performance point of view, at least), why don't we use DFs as interface between tracing library and plotting?

e.g. the code might look like this:

import pymc3 as pm
import mcmcplotlib as mcp

with pm.Model() as model:
  .. 
  trace = pm.sample(2000..)

trace_df = trace.as_df(varnames=['Intercept', 'x', 'sigma'])

mcp.plot_trace(trace_df, varnames=['x', 'sigma']) #e.g. we are not interested in plotting intercept

So that mcmcplotlib don't have to know about library-specific trace object(s).
And we don't have to pass a list of numpy arrays + list of varnames for plotting.

I believe it is a safe assumption that most users of pymc, pystan and emcee have pandas installed.

[Suggestion] FFT-based kde for large data

Currently mcmcplotlib use gaussian_kde for kde generation and this can be quite slow with large number of points.

There is a fft-based fastkde function in faststats package that give comparable results if there are large number of points.

faststats is MIT-licenced so in theory the fastkde function could be copied/implemented into the source code of mcmcplotlib so nothing should need to be imported (function need scipy).

faststats https://github.com/mfouesneau/faststats/tree/master/
fastkde https://github.com/mfouesneau/faststats/blob/master/faststats/fastkde.py

Implement `distplot` equivalent

There are a few places where we would like to use arviz.kdeplot if a variable is continuous, and a histogram if not (traceplots, for example).

This might involve implementing a histplot first and then having distplot just inspect the dtype.

(happy if someone wants to change the names of any of these...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.