Light

aimalz / chippr Goto Github PK

View Code? Open in Web Editor NEW

0.0 7.0 3.0 49 MB

Cosmological Hierarchical Inference with Probabilistic Photometric Redshifts

License: MIT License

Python 24.71% Makefile 0.01% TeX 70.79% Jupyter Notebook 4.44% Shell 0.04%

chippr's Introduction

chippr

Cosmological Hierarchical Inference with Probabilistic Photometric Redshifts

Motivation

This repository is the home of chippr, a Python package for estimating quantities of cosmological interest from surveys of photometric redshift posterior probability distributions.
It is a refactoring of previous work on using probabilistic photometric redshifts to infer the redshift distribution.

Examples

You can browse the demo notebook here:

Basic Demo for Python 2.7

Documentation

Documentation can be found on ReadTheDocs.
The draft of the paper documenting the details of the method can be found here.

Disclaimer

As can be seen from the git history and Python version, this code is stale and should be understood to be a prototype, originally scoped out for applicability to SDSS DR10-era data of low dimensionality. As a disclaimer, it will need a major upgrade for flexibility and computational scaling before it can run on data sets like those of modern and future galaxy surveys.

People

Alex I. Malz (German Centre for Cosmological Lensing)

License, Contributing etc

The code in this repo is available for re-use under the MIT license, which means that you can do whatever you like with it, just don't blame me. If you end up using any of the code or ideas you find here in your academic research, please cite me as Malz et al, in preparation\footnote{\texttt{https://github.com/aimalz/chippr}}. If you are interested in this project, please do drop me a line via the hyperlinked contact name above, or by writing me an issue. To get started contributing to the chippr project, just fork the repo -- pull requests are always welcome!

chippr's People

Contributors

Watchers

Forkers

hoyleb eacharles mardom

chippr's Issues

Implement maximum a posteriori n(z) estimation

Implement estimation of n(z) by maximum of photo-z posteriors.

Write a script for running multiple parameter sets

In the older version of this project, it was possible to run catalog generation and inference on many test cases at once. This should be re-implemented outside of the demo that serves as Travis's test.

Simplify/speed up analysis

Travis tests are taking a long time with the current demo notebook. I'm going to simplify the tests in the notebook and use separate, local scripts to run more meaningful tests. Similar to #16, I'm going to return to the same branch

Add math to the demo

The demo goes over use of the code without any connection to the math. In preparation for publication, I'm going to make the demo representative of the paper's content.

Fill in skeleton of simulation suite

Settle on an API for simulating surveys of photo-z posteriors and start a module implementing that choice.

Proper multi-threading

My tests currently run in parallel via external scripts, not using an inherent chippr mode. The reason is that the function that evaluates the posterior over n(z) is not pickleable, so emcee multi-threading doesn't work. This issue is for either choosing a sampler that somehow doesn't care about this shortcoming (I suspect there may not be such a thing) or making the function pickleable. It's a low priority for the scopes of paper 0 (the model and proof of concept) and paper 1 (demonstration on toy test cases) but essential for paper 2 (more realistic and/or real data at scale).

Improve file formats

I'm currently storing data in plaintext, which is obviously unacceptable for nontrivial survey sizes! I'll update that soon.

Add plotting options

As with the vb keyword, I will add an option to automatically make informative plots throughout the catalog generation process and inference procedures.

Respond to referee report

The referee requests the following:

Upgrade to Python 3.
Calculate and include concrete benchmarking values of computational expense.
Assess accuracy of MCMC error bars on distribution of n(z) samples.
Add back cosmology forecast.
Cite an appearance of the analytic n(z) of Eqn. 13.
Explain chippr's overestimate at redshift extrema.
Remove repeated sentence in conclusion.

I would like to take this opportunity to do the following:

Cite more recent n(z) estimation papers. (also 2011.01836)
Update references (for publication status and formatting).
Rerun with lower Gelman-Rubin threshold.
Calculate and plot Delta z.
Clean up demo notebook with CosmoLike propagation.
Put CosmoLike data files on GitHub so others can run forecasting notebook.
Eliminate redundant plotting code.
Refactor and write unit tests.
Isolate catalog generation code and migrate to zeppole.

Speed up catalog generation

For some reason, the catalog generation procedure is much slower than in @davidwhogg's version. I will find that reason and squash the bug!

Improve user interface with error checks

Currently the input file/parameter dictionary format is pretty sloppy. I want to introduce some sensible checks that halt execution if inputs don't make sense.

I'm also going to use this issue to implement more use of the vb keyword to print updates to screen.

Implement parallelization of scripts

The current scripts run in series. I will enable parallelization next.

Save to documentation directory some version of demo notebook with plots

@drphilmarshall got an HTML version working in qp so I will try to do the same.

Implement sampling of n(z)

Implement estimation of n(z) by MCMC samples.

Establish consistent keyword/variable names

This is a problem in catalog.py and log_z_dens.py, and it will be harder to fix the longer I put it off!

Clean up test cases and plots

I made a big mess and am making a branch to simply clean up for my thesis. This issue can be closed when the scripts yield the plots I need for my thesis.

Enable file i/o in inference

Enable saving/opening files in inference code.

Enable file i/o in simulation

Enable saving data files in simulation code.

Literature review

The literature review from the old version is a good place to start on putting together an introduction for the chippr paper.

show that in the permissable "stacking the posteriors" case, no information flows

If you stack posteriors, you get just as good an answer if your data contain no information (so you just re-stack the prior over and over) as you do if your data contain lots of information. In fact, from a bootstrap perspective, the former (no information) will give a better N(z). Absurd!

Scale up number of galaxies

chippr needs to be able to obtain posterior samples of n(z) for catalogs of ~1 million galaxies and will need an infrastructure overhaul to do so. In particular, multiprocessing will have to happen at the level of inference, not just in wrapper scripts for tunning different test conditions.

Fix the sampler

Something has gone wrong, and the mean of the samples is not approaching the MMLE.

Replace basic probability distribution construction

The whole mvn.py/multi_dist.py/gmix.py/gauss.py/discrete.py system is woefully inefficient. This issue can be closed when this redundant structure is eliminated in favor of something based on scipy.stats and/or qp and/or pomegranate objects that are much faster.

Split/rescope paper

The paper is now overgrown and needs to be pruned, but the three points it makes are all valuable. This issue can be closed when @davidwhogg is satisfied with the three smaller papers into which I will split the current version. For reference, the papers will be scoped as follows:

Paper 1: A thorough presentation of the CHIPPR model and chippr prototype code, which are currently Sec. 2 and App. A, demonstrated on the three canonical forms of photo-z systematic error (bias, scatter, catastrophic outliers), which are currently Secs. 3.1, 3.2, 3.3.
Paper 2: A deeper exploration of the implicit prior and model misspecification, which is currently Secs. 3.4 and 4.2.
Paper 3: The propagation of CHIPPR results to cosmological parameter constraints, which is currently Sec. 4.1.

I wonder if there should also be another paper about the forward model of photo-z PDFs, which I've been meaning to separate out into a standalone package anyway. . .

New fiducial case

The new fiducial case must include the three effects about which LSST (and other surveys) are most concerned:

RMS scatter sigma_z / (1+z) ~ 0.05
3-sigma outlier fraction ~ 0.1
bias (z_p - z_s) / (1+z_s) ~ 0.003

It should also have a smooth n(z) consistent with the CosmoLike analysis.

Paper catch-up

In general, I shouldn't have separate branches for the paper because it should always be updated along with the code and any test results. But, I still need to do some catching up to make that possible since #55 was only just resolved.

Split up evaluation of hyperposterior

This should help with unit testing, as there remain discrepancies in the null test with Hogg's code.

Quantile-quantile plot

Since I'm now doing inference of n(z), which is a normalized probability distribution function, instead of N(z), I could implement a quantile-quantile plot to compare different estimators to the truth.

This is related to #23 and #33 but I'm making it a new issue anyway because it's about content that wasn't in the old version.

Optimize the optimizer

I'm going to try to break the optimizer with limiting cases that broke it in the old version. If those problems remain, I'll experiment with different optimization algorithms and step sizes (and possibly other parameters of those algorithms) to understand why those failures happen and hopefully find ways around them.

Upgrade to Python 3

Python 2 is on its way out, so chippr is way overdue for an update,

Finalize test conditions

Should the true n(z) in the test cases be kept as is or should we use something else for the paper? The same goes for the prior probability distribution and the details of the test cases (i.e at what redshift should outlier populations be located, etc.).

Add simulation of catastrophic outliers

Extend simulation code to simulate catastrophic outliers.

Epic: paper 0

I'm going to collect issues here now that this is progressing again.

Gather relevant references. (#65)
Document demo for publication. (#54)
Finish validating/interpreting propagation to cosmology (#49)
~~[ ] Scale up to more galaxies. (#60)~~ migrated to Future Analysis milestone
~~[ ] Make a forecast for Euclid. (#69)~~ migrated to Future Analysis milestone
Polish the sloppy plots.
Write a brief version of the math into an appendix.
Write a brief version of the mock data emulation procedure into an appendix.
Run sampler as well as optimizer.

Euclid forecast

Can you propagate the different n(z) estimates in your investigation to the predictions of shear 2PCF for a Euclid-like survey?

Refactor catalog production

The final test case in #25, that of catastrophic outliers as seen in empirical p(z) methods, suggests a different approach that's very similar to how the previous version was at the very end.

All the test cases currently implemented in #25 begin with true redshifts, devise p(z_est | z_true, params), and use that to get p(z_true | z_est, params). What I should really be doing is devising a probability distribution in the space of p(z_true, z_est | params, n_true(z)), sampling pairs (z_true, z_est), and evaluating p(z_true | z_est, params) as the horizontal cuts through the z_true, z_est plot.

Here are some sketchy notes to remind me of how I want this to work, accounting for how the true n(z) and interim prior must enter into catalog construction in this way:

ReadTheDocs: Basic Demo link broken

Following #44 perhaps?

Error: Cannot load file https://raw.githubusercontent.com/aimalz/chippr/docs/notebooks/demo2.ipynb

Start a new version of the paper

I'll be making a pared down revision of the paper.

Consider a new title.
Revise the abstract.
Streamline the introduction.

Propagate n(z) estimates to cosmology

I had this implemented in the previous version using randomfield and will re-implement it.

Implement maximum likelihood estimation of n(z)

Implement estimation of n(z) by likelihood optimization.

Pass nontrivial tests (extend support for challenging cases)

The old version of the code supported different physically-motivated test cases that must be re-implemented. EDIT: I will only implement the tests outlined in #55, but there are still some choices that must be made!

unfeatured true n(z)
(sdss interim prior as truth)
fiducial case
(featured truth, constant standard deviations for each galaxy)
high intrinsic scatter
(include (1+z) dependence in standard deviations?)
template-like catastrophic outliers
(constant-ish likelihood components)
training-like catastrophic outliers
(multimodal likelihood components)
template-like interim prior
(multimodal interim prior)
training-like interim prior
(low-z favoring interim prior)

Implement MCMC diagnostics

Currently, there are no diagnostics for the sampler. There should be checking of a burn-in condition to discard non-converged samples, and other convergence measures should be recorded.

Implement goodness-of-fit measures

Implement Kullback-Leibler Divergence and Root-Mean-Square difference calculations for different estimators, as in qp.

Implement expected value n(z) estimation

Implement estimation of n(z) by expected value of photo-z posteriors.

Implement diagnostic printouts to file

I used to do this to monitor progress of each process in the old version and will do it again.

Generalize n(z)/p(z) parameterization

As qp shows, redshift posteriors and redshift density functions can be parameterized in many possible ways. chippr unfortunately was scoped out before I'd really thought about a catalog that might not use the piecewise constant parameterization of SDSS. KiDS has photo-z PDFs parameterized as Gaussians and wants n(z) with way more parameters than chippr can actually handle. This issue can be closed when:

chippr can accept Gaussian photo-z PDFs without forcing them onto a grid
chippr can yield n(z) in a different parameterization than the input photo-z PDFs

Epic: refactoring

This issue is for collecting programming issues that are driving me nuts but aren't actually holding back completion of the paper.

Establish consistent keyword/variable names (#15)
Improve user interface with error checks (#26)
Improve file formats (#51)
Diagnostic printouts to file (#52)
Eliminate the abundant magic numbers and give more control to the user via an improved config file interface.
Use qp (or pomegranate) for probability distributions to replace discrete.py, gauss.py, and gmix.py (#68)

Implement stacking estimator of n(z)

Implement estimation of n(z) by stacking photo-z posteriors.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.