dfm / emcee Goto Github PK

View Code? Open in Web Editor NEW

1.4K 87.0 430.0 33.65 MB

The Python ensemble sampling toolkit for affine-invariant MCMC

Home Page: https://emcee.readthedocs.io

License: MIT License

Python 85.14% TeX 14.01% Shell 0.85%

python mcmc mcmc-sampler probabilistic-data-analysis

emcee's Introduction

emcee

The Python ensemble sampling toolkit for affine-invariant MCMC

emcee is a stable, well tested Python implementation of the affine-invariant ensemble sampler for Markov chain Monte Carlo (MCMC) proposed by Goodman & Weare (2010). The code is open source and has already been used in several published projects in the Astrophysics literature.

Documentation

Read the docs at emcee.readthedocs.io.

Attribution

Please cite Foreman-Mackey, Hogg, Lang & Goodman (2012) if you find this code useful in your research. The BibTeX entry for the paper is:

@article{emcee,
   author = {{Foreman-Mackey}, D. and {Hogg}, D.~W. and {Lang}, D. and {Goodman}, J.},
    title = {emcee: The MCMC Hammer},
  journal = {PASP},
     year = 2013,
   volume = 125,
    pages = {306-312},
   eprint = {1202.3665},
      doi = {10.1086/670067}
}

License

emcee is free software made available under the MIT License. For details see the LICENSE file.

emcee's People

Contributors

Stargazers

Watchers

Forkers

jsread jonathansick drphilmarshall lmoustakas jeremysanders deapplegate adrn farr braingrylls davidwhogg jellis18 pirg eblur lpsinger wafels mrtommyb zblz abonaca manodeep steven-murray fredros aarchiba sgallancy pombredanne lauralwatkins ruthangus simonrw marcioweck demitri baranas earthastronaut mtazzari pld guillochon iancze mdekauwe mrgeorge gravitationalarry pkgw zeugirdor tbs1980 terhardt zhitai prodigeni andrewadare accesso nick-c90 manueloriol justinonstot haaakon shobhitsood kvoss wandahalpert israelmeirwi llazzaro 1974kpkpkp teachar michaelflory mablevins08 ikapika dcrichton milosbella wiewie wavelets novov sturlamolden joezuntz arunkgupta eletufe wanghaiquan simongibbons jibril-jama johnarban apevyhouse zogr kurvandubey jerriclynsjohn hungrycloud arsentiykaptsan eteq r0k3 tcollett thusodangersimon scopatz johnveitch tgods zhchen2014 turingkmplt jonduan jspilker fromradio addertrap haolan hariomrana junhrocks aprillya amckay xavier0987 rbiswas4 jmankin

emcee's Issues

Periodic model parameters?

Hello Dan and co.,

great work on emcee!

Is there a recommended way, or even a hack, to treat periodic model parameters?

I'm thinking about phases, sky positions, etc. It's not a problem to make the likelihood behave correctly, but I'm afraid that if we leave such parameters wander, the walkers will spread around multiple local maxima, and their linear combinations will be less effective in moving across parameter space. Is that the case in your understanding?

Thanks a bunch...

Write documentation about how to check convergence

The can and will be hackish but it's worth saying something. This will also help for any upcoming pedagogical documents about MCMC.

push resubmitted version

comment on Algorithm 3 "map"-ability

put a comment "this loop can be performed in a multiprocessor "map" framework" or something like that on the for... loop in the parallel stretch move. That will be a comment to be compared with the computationally expensive comments on Algorithm 1 and 2. This makes the main point of the

Priors / limits

It's quite possible I haven't read enough to know this, but is there a way to enforce priors or limits on fitted parameters (eg, positive temperature or distance)? At the moment I'm just having my likelihood function return ludicrously low probabilities if any walkers stray into unphysical territory, but is there a better way? It only takes one thrown NaN from one stray walker to foul up subsequent analysis, so I'd just as soon set certain regions of the parameter space as off-limits from the beginning.

Typo in ensemble.py Metropolis-Hastings code path

The following patch fixes a typo (self._getlnprob --> self._get_lnprob) in the branch of code taken for a Metropolis-Hastings proposal function in ensemble.py.

From c6782d3 Mon Sep 17 00:00:00 2001
From: Will Meierjurgen Farr [email protected]
Date: Tue, 18 Sep 2012 22:45:09 -0500
Subject: [PATCH] Fixed typo bug in Metropolis-Hastings part of ensemble.py

emcee/ensemble.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/emcee/ensemble.py b/emcee/ensemble.py
index bc7777c..361f029 100644
--- a/emcee/ensemble.py
+++ b/emcee/ensemble.py
@@ -199,7 +199,7 @@ class EnsembleSampler(Sampler):
if mh_proposal is not None:
# Draw proposed positions & evaluate lnprob there
q = mh_proposal(p)

           newlnp, blob = self._getlnprob(q)

           newlnp, blob = self._get_lnprob(q)

         # Accept if newlnp is better; and ...
         acc = (newlnp > lnprob)

1.7.9.6 (Apple Git-31.1)

Cite all the papers that're using us.

Just do it.™

Documentation of chain is inconsistent

The docs for the chain() function seem inconsistent:

In the warning at the top of the "class emcee.EnsembleSampler" documentation it says that the shape is "(nwalkers, nlinks, dim)";
For the chain() function itself, the documentation states that the shape is "(k, dim,iterations)"

The former seems correct from the code (so the warning is necessary; I'm finally trying to migrate to emcee from markovpy....).

Where does emcee fail?

Add some comments to the HTML docs about when it is not good to use emcee.

emcee and MPI

Is it possible to parallelize emcee with MPI? From what I understand, both threading and Multiprocessing (which emcee supports) only allow using one single node in a cluster, which is pretty limiting.

Document the usage of the "thin" keyword argument

See subject.

resubmit

alternative map function

Hi - I'm using multiprocessing, but my lnprob function talks to external programs via subprocess to get probabilities. This seems to make multiprocessing break quite often in random ways. Looking at the source code, it seems it uses pool.map to do the parameter -> probability mapping, if pool is set. It looks like I could avoid multiprocessing if I make my own pool object which provides a map. I could communicate with multiple external programs in a single thread.

Can I assume this API will continue? Could it be documented, if so. Perhaps it would be nice to have a "map" option to EnsembleSampler which would allow a user-defined map function.

multiprocessing pickling error

I run into the following error when setting the threads to > 1:

sampler = emcee.EnsembleSampler(nwalkers , 14 , lnpost , args =['12071006', priors, orderedkeys ], threads = 2)
pos , prob , state = sampler.run_mcmc(p0 , 100)

Exception in thread Thread-3:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/threading.py", line 552, in *bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/threading.py", line 505, in run
self.__target(_self.__args, _self.__kwargs)
File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin.instancemethod failed

I am defining lnpostfn as a static function within my module. Curiously, I am able to pickle the lnpostfn that I use in the sampler as well as the data object that I call within the lnpostfn. So what is the multiprocessing module trying to pickle?

Here's a simplified version of my lnpostfn:

import ooplc
def lnpost(theta, name, prior, keys):
    # lnprior is defined as a static function within my module
    logprior = lnprior(theta, prior, keys) # compute prior

    # binary is a class inside the ooplc module
    binary = ooplc.binary('bin2', name, override=theta, params='new2', timegrid='coarse')
    model = binary.ldF_of_t

    data = binary.kepler_lcdata1
    data_err = np.sqrt(binary.kepler_lcdata1**2 + theta[0]**2)

    loglikelihood = sum( -0.5 * ( (binary.ldF_of_t - data) / data_err )**2)
    return logprior + loglikelihood

Any insights? I do not fully understand what multiprocessing.pool is trying to pickle.

Grants to acknowledge?

I think we need to.

Allow lnprob to return extra info

Often times my likelihood function calculates several quantities besides the likelihood itself which I'd like to store as the chain runs. Currently there's no way to easily store this info. I suggest something like

    def lnprob(x):

explicitly define complementary ensemble

For the next update of the paper on arXiv or for the next thing we write, make sure that the complementary ensemble is clearly defined (see, eg, email train from 2013-03-21 and 2013-03-22 with Bailer-Jones). It is used in Algorithm 2, but not defined in an equation as it should be (and as it is in Goodman and Weare).

make sure there is actual code for the examples

A la Goodman's comments, we should make sure for each example we have in the example or the Appendix something like

def rosenbrock_prob(x,y,z):
return this and that

chain = emcee.whatever(rosenbrock_prob, foo, bar)

Include Bovy's comments

Bad performance for Gaussian likelihoods?

I'm comparing MH vs GW for sampling a 1D unit Gaussian likelihood. For MH I use a proposal covariance of 2.4**2. For GW I do 100 walkers and initialize them by drawing samples from the unit Gaussian. Here's a plot of the auto-correlation of the MH chain vs. the auto-correlation of one of the GW walkers.

Increasing the number of walkers seems to have no effect. I find similar results for N-d unit Gaussians. In general I'm unable to make GW perform better than MH for problems where the likelihood is fairly Gaussian and I have a decent proposal matrix. Am I doing anything wrong? Is that expected? Is there anything I can do?

Thanks alot for any help!

Here's the code I use to run the chains,

nwalk = 100
ndim = 1
nstep = 10000

gw_samp=emcee.EnsembleSampler(nwalk,ndim,lambda x: -norm(x)**2/2)
gw_samp.run_mcmc(np.random.multivariate_normal([0]*ndim,identity(ndim),nwalk),nstep);

mh_samp=emcee.MHSampler(cov=[[2.4**2]],dim=1,lnprobfn=lambda x: -norm(x)**2/2)
mh_samp.run_mcmc(random.normal(0,1,1),N=nstep);

and to generate the above plot:

plot(acor.function(mh_samp.chain[1000:,0])[:50],label='Metropolis-Hastings')
plot(acor.function(gw_samp.chain[0,1000:,0])[:50],label='Goodman-Weare')
legend()
xlabel('steps')
ylabel('auto-correlation')

Affiliations for publication

Hey @dstndstn: I have an out of date affiliation for you on the paper. Do you mind updating it?

update your CV

submit paper to PASP

use of ~ hard space when you want \_

In the LaTeX you do things like \this~is but you want to use \this\ is.

Add an example to the documentation to show how you might restart from the end of a previous run

This could probably go next to the "incrementally saving progress" section.

Add "fitting a line" example.

It would be good to add examples of a few of the exercises from Fitting a Line. I'm happy to do this because I've already implemented most of the problems but it will take me a while to get around to it so if anyone else is interested, they should go for it. Pull requests will be happily discussed and accepted!

Multithread speed scaling

Hi,

Running the basic quickstart example and checking the computation time for increasing thread numbers, I seem to get the counterintuitive results that computation time increases with increasing thread numbers.

Here are some quick benchmark results on a 8CPU machine.
nthread: time to completion
1: 4.2 sec
2: 9.1 sec
4: 11.9 sec
8: 12.8 sec

Furthermore, looking at the CPU usage, I always seem to get ntread+1 processes, the first one taking 100% of a CPU and the nthread others sharing about 1/nthread of ONE of the other 7 CPUs.

Any idea of what going on ?
Pierre

GW paper

did you see this one? http://arxiv.org/abs/1301.6673 What does it conclude about emcee?

send paper to Lang, Goodman, Bovy...

...and get feedback. If there were "milestones" this would be associated with one called "submit to arXiv by Thursday".

email the editor and say that a new version is coming asap

Allow lnprob to return extra info

    def lnprob(x):

Documentation on lnprob and other prob. density containers in ensemble.sample()

Now all containers for densities have shape (nwalkers, dim). I think they should be (nwalkers,). One value of logprobability for walker.

return something when lnprob returns NaN

Starting with

pos, prob, state = sampler.run_mcmc( p0, NSTEPS )

If lnprob breaks and returns NaN, sampler.run_mcmc stops but does not return anything.

It would be helpful if it returned the offending walker, or the positions of the walker, so the lnprob function can be debugged.

Add checks for parameters becoming NaN

What are the downsides to too many walkers?

We can do this async I'd say.

Rewrite the autocorrelation time section...

it's inconsistent in notation and content. I'm on it!

replace ` / 2.0` with `0.5 * `

It's faster, right?

difference between - and -- and ---

Correct usages include:

non-trivial process

Metropolis--Hastings step

The code---which is slow---is located at...

The - symbol is used to hyphenate words, ie, two words that combine to make an adjective. The -- symbol is used to connect words that represent a relationship (like "push--pull"). The --- symbol is used to separate parenthetical phrases.

Explain usage of sample method

In particular make it clear that it returns a generator and doesn't actually do the sampling.

Thanks @eggplantbren.

multiprocessing using more resources than expected.

Dan Weisz (UW) says:

I have a question for you about the proper usage of the 'threads' keyword.
From your write up, my interpretation was that if I say set 'threads=5', it
would use 5 processors. But that doesn't seem to be what is happening.
Morgan and I share a 32-core machine, and I ran some code with 'threads=5'
and 30 walkers, and it seemed to eat up more than 5 processors worth of
resources. So, perhaps I'm doing something wrong? Or maybe I also have to
provide a pool object?

His interpretation is right and in my experience this should work well. I'm not sure what the answer is to is problem. Any thoughts?

Make the M–H sampler work with blobs

Well that was silly.

Installation test failed (Parallel Sampler)

I just installed emcee from source code (Apple OS 10.8.3, Python 2.7 installed with Macports). It seemed fine but I tested the installation as suggested. One of them failed.

$ python -c 'import emcee; emcee.test()'
Starting tests...
Test: Parallel Sampler failed with error:
AttributeError: Tests instance has no attribute 'test_parallel'
Test: Ensemble Sampler passed.
Test: Metropolis-Hastings passed.
Test: Parallel Tempering passed.
3 tests passed
1 tests failed

post resubmitted version as revision on arXiv

after resubmitting; update comment field

variable likelihood speed

Suppose in some regions of parameter space one can evaluate the likelihood very quickly, while in others it can take orders of magnitude longer. This could happen for example if some approximations can be applied but only in certain regions, or if some numerical solver component of the likelihood evaluation converges much more slowly in some places. This means that for that step, several of the walkers will finish quickly then be idle waiting for the stragglers to finish, is that correct? Does emcee have a way to remedy this problem?

avoid latin abbrevs

change i.e. occurrences to "that is" and e.g. to "for example" and so on. Reads better, I think. Feel free to close without fixing.

Discuss "expected squared jump distance" diagnostic

http://www.stat.columbia.edu/~gelman/phd.students/metrop_esjd2.pdf

Run some experiments

Low priority for now but easy.

Transpose bug in EnsembleSampler.acor

From Alex Conley (Colorado):

In ensemble.py the chain that is passed to acor is: self.chain[:,:,i].T

This is of dimension nsamples by nwalkers.

But when I actually look in the acor code, it suggests that the input
should by M by N, with M the number of series and N the number of
elements per series.

Shouldn't the .T be removed? That gives me much more reasonable
seeming acor values

This is a bug.

Allow lnprob to return extra info

    def lnprob(x):
        return lnl, extra

The sample method could then yield tuples which include the extra info calculated at each step,

    for pos, lnprob, rstate, extra in sampler.sample():
         ...

where extra would be a length number-of-walkers list of the extra information returned by the likelihood evaluations.

FAIL if initialized with ANY walkers at -Inf

If the reality is that emcee is pwned if it is initialized with any walkers at ln_p == -np.Inf, then emcee should fail in that case. Or not.

dfm / emcee Goto Github PK

emcee's Introduction

emcee

Documentation

Attribution

License

emcee's People

Contributors

Stargazers

Watchers

Forkers

emcee's Issues

Recommend Projects

Recommend Topics

Recommend Org