scikit-hep / histbook Goto Github PK

View Code? Open in Web Editor NEW

109.0 109.0 9.0 1.5 MB

Versatile, high-performance histogram toolkit for Numpy.

License: BSD 3-Clause "New" or "Revised" License

Python 39.57% Jupyter Notebook 60.43%

histbook's Introduction

`scikit-hep`: metapackage for Scikit-HEP

https://codecov.io/gh/scikit-hep/scikit-hep/graph/badge.svg?branch=master

Project info

The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python embracing all major topics involved in a physicist's work. The project started in Autumn 2016 and its packages are actively developed and maintained.

It is not just about providing core and common tools for the community. It is also about improving the interoperability between HEP tools and the Big Data scientific ecosystem in Python, and about improving on discoverability of utility packages and projects.

For what concerns the project grand structure, it should be seen as a toolset rather than a toolkit.

Getting in touch

There are various ways to get in touch with project admins and/or users and developers.

scikit-hep package

scikit-hep is a metapackage for the Scikit-HEP project.

Installation

You can install this metapackage from PyPI with pip:

python -m pip install scikit-hep

or you can use Conda through conda-forge:

conda install -c conda-forge scikit-hep

All the normal best-practices for Python apply; you should be in a virtual environment, etc.

Package version and dependencies

Please check the setup.cfg and requirements.txt files for the list of Python versions supported and the list of Scikit-HEP project packages and dependencies included, respectively.

For any installed scikit-hep the following displays the actual versions of all Scikit-HEP dependent packages installed, for example:

>>> import skhep
>>> skhep.show_versions()

System:
    python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
executable: /srv/conda/envs/notebook/bin/python
   machine: Linux-5.15.0-72-generic-x86_64-with-glibc2.27

Python dependencies:
       pip: 23.1.2
     numpy: 1.24.3
     scipy: 1.10.1
    pandas: 2.0.2
matplotlib: 3.7.1

Scikit-HEP package version and dependencies:
        awkward: 2.2.2
boost_histogram: 1.3.2
  decaylanguage: 0.15.3
       hepstats: 0.6.1
       hepunits: 2.3.2
           hist: 2.6.3
     histoprint: 2.4.0
        iminuit: 2.21.3
         mplhep: 0.3.28
       particle: 0.22.0
          pylhe: 0.6.0
       resample: 1.6.0
          skhep: 2023.06.09
         uproot: 5.0.8
         vector: 1.0.0

Note on the versioning system:

This package uses Calendar Versioning (CalVer).

histbook's People

Contributors

Stargazers

Watchers

Forkers

sebrown imandr lukasheinrich henryiii ryanmwhitephd clelange weidenka sh4zkh4n hercules261188

histbook's Issues

filling is slower with numpy 1.15

It looks like histogram filling is about 10 times slower with numpy 1.15 than with 1.13

Sensible plotting defaults for trivial cases

If I have a Hist with only one axis and I want to plot it, it would be convenient for me not to have to re-specify the axis expression. For example,

histogram = Hist(bin("x", 10, -5, 5))     # simple one-axis histogram
histogram.fill(...)
histogram.step().to(canvas)      # default: step("x")

Sensible defaults exist in more complicated cases:

histogram = Hist(bin("x", 10, -5, 5), bin("y", 10, 0, 1))     # simple one-axis histogram
histogram.fill(...)
histogram.heatmap().to(canvas)      # default: heatmap("x", "y")

NumExpr integration

Ensure that the histogram-specifying language is identical to NumExpr, so that NumExpr expressions can be used to book histograms.
Identify NumExpr sub-expressions that are used only once (in instr.py) and merge them (only if NumExpr is available! introduce an internal flag).
Use NumExpr to evaluate them, if available.
Use formulate to accept TTree::Draw syntax.

Efficiency diagram

Hi, I would like to produce a Graph/Histogram similar to ROOT's TEfficiency. So I have two data samples var_total and var_pass. I thought one could probably use the cut axis but I didn't succeed. Furthermore, a correct error estimate (e.g. binomial errors) would be necessary. Any idea how to do this using histbook?

Hist and Book should have metadata

...to carry pyhf fit information even if it's irrelevant for filling. It's to make a 1:1 correspondence between booking trees and fit models. This will make life easier for the physicist.

Implementation of export to TH2

Hi, I've been playing around with uproot and histbook, and I find it nice for data exploration.
However, I then would like to profit from RooFit in my analysis workflow, and I usually export the histogram to ROOT for this purpose (and then import it as RooDataHist)
When trying to export a 2-dimensional Hist, defined e.g. as

hp_dwcReferenceType_ntracks = Hist(bin("dwcReferenceType", 16, -.5, 15.5), profile("ntracks"))

h2_dwcReferenceType_ntracks = Hist(bin("dwcReferenceType", 16, -.5, 15.5), bin("ntracks", 2, -.5, 1.5))

I get a NotImplementedError: TH2 (from https://github.com/scikit-hep/histbook/blob/master/histbook/export.py#L290-L292).

Will exporting to TH2 be available soon and/or do you have a timescale for that?

Update to Vega-Lite 3

Vega-Lite 3 has been officially released and JupyterLab will only support Vega-Lite 3 in the next version.

Plotting histos and saving to pdf/png in script

Hi,

I've been trying Histbook for some experimental analysis and it seems very good! However I'd like to run over data samples and create plots of e.g. signal vs background(s) for many plottable variables, and save them to png/pdf.

As it is, I can do this via VegaLite using a web browser, by pressing save, but cannot seem to find a way to save directly to a file in the script without opening a browser, so that one can run e.g. in a headless gnu screen session or some sort of cluster/batch mode.

Is there a simple way plots can be exported/saved in Histbook without going through a web browser?

Thanks,
Alex

Need a way to project a booking tree to a particular sample

... so that you can project to just the histograms that should be filled by a particular Monte Carlo sample and call .fill(mc) on only that projection. Lukas suggested the name "view", which I like for its resonance with Numpy.

histbook/Spark questions

Hey Jim (writing here since I think you prefer it over slack. lemme know otherwise)- I have a couple questions

** Is it possible to take the following code and fill via spark with a single fill() call?

wjet_met = Hist(bin("MET_pt", 100, 0, 200))
wjet_met.fill(wjet_df.where('HLT_IsoMu24 == True'))

dyjet_met = Hist(bin("MET_pt", 100, 0, 200))
dyjet_met.fill(dyjet_df.where('HLT_IsoMu24 == True'))
overlay(wjet_met.step(), dyjet_met.step()).to(canvas)

Filling 'at once' would let histbook perform the collects in parallel, which increases parallelism (if i'm reading the code right).

** I would like to normalize some histograms to 1, is there a native way to do this, or should I export to pandas, perform the manipulation, then convert from pandas back into histbook?

** Finally, is there some type of "subobject" operator in the expression syntax? I.e. to do something like the following pseudocode:

h = Hist(bin("sqrt( (x.MET_x - y.MET_x) ** 2 + (z.MET_y - y.MET_y) ** 2)"))
h.fill(x = object_1, y = object_2)

User defined binning

Hi Jim,
Is custom binning on the list for the next releases?
Matt

Category color changes in stack()

I fill the histogram with a category axis in several iterations and I want to see it after each iteration, e.g.:

h = Hist(bin("x", 10, 0, 1), groupby("type"))

x = np.random.random((100,))
h.fill(type="A", x=x*x)
h.stack("type").area("x").to(canvas)

Here is the plot:

Here comes second iteration:

x = np.random.random((100,))
h.fill(type="B", x=1-x*x)
h.stack("type").area("x").to(canvas)

Notice that the color for type=A histogram has changed from blue to orange and type=B is blue now. It would be good to make subplots keep their representation attributes.

Btw, it works correctly with overlay()

Variable bin width histograms

It would be great if one could create variable bin width histograms through specifying a sequence that defines the bin edges similar to numpy.histogram.

histogram += another

Should not Hist.iadd() return self ?

examples and Binder demo ?

Thoughts on an examples/ directory with notebooks and the needed hooks for binder?

Example repo:
https://github.com/jupyterlab/jupyterlab

Export histogram metadata

It would be useful if there was a way to dump histogram (and book ?) configuration to, say, JSON and to have an ability to recreate the empty histogram from it.

[feature request]: 2D plots

Following up on the comment made on https://stackoverflow.com/questions/50820894/filling-numpy-arrays-slower-than-for-loop-h-fill-vs-h-fill-numpy:
It would be good to replicate 2D plots that are common in HEP like the on http://histogrammar.org/docs/tutorials/python-matplotlib/

Vega-lite has a scatter plot as well as a heatmap (and table binned heatmap):

https://vega.github.io/vega-lite/examples/point_2d.html
https://vega.github.io/vega-lite/examples/rect_heatmap.html
https://vega.github.io/vega-lite/examples/rect_binned_heatmap.html
The latter is probably closest to what we need.

Confusing exception

I know maybe I am doing something wrong here:

h = Hist(bin("a", 10, 0, 1), groupby("tt"))
h.fill(a=np.random.random((100,)), tt=["a"]*100)
h.fill(a=np.random.random((100,)), tt=["b"]*100)
h.area("a").to(canvas)

Proper way would probably be

h = Hist(bin("a", 10, 0, 1), groupby("tt"))
h.fill(a=np.random.random((100,)), tt=["a"]*100)
h.fill(a=np.random.random((100,)), tt=["b"]*100)
h.stack("tt").area("a").to(canvas)

but the error message I get running first fragment is very puzzling:

...

/Users/ivm/anaconda/lib/python2.7/site-packages/histbook-1.0.9-py2.7.egg/histbook/proj.pyc in handlearray(content)
    504 
    505         def handlearray(content):
--> 506             content = content.reshape((-1, self._shape[-1]))
    507 
    508             out = numpy.zeros((content.shape[0], len(columns)), dtype=content.dtype)

AttributeError: 'list' object has no attribute 'reshape'

NameError: global name 'threading' is not defined

Hi,

Threading is not defined in book.py

https://github.com/scikit-hep/histbook/blob/master/histbook/book.py#L558

adding import threading fix the issue.

Thanks

Improved support for binned data in Vega-Lite

Veha-Lite 3 comes with improved support for prebinned data. You can find an example at https://vega.github.io/vega-lite/docs/bin.html#binned. I think this could be relevant here.

I'm one of the authors of Vega-Lite and if there is anything we can help with, please let me know.

Logscale not working properly

This fragment plots empty histogram in log scale, but works fine in linear scale:

from histbook import Hist, bin, beside
from vega import VegaLite
import numpy as np

h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3,3,3]))
beside(h.step(), h.step(yscale={"type":"log"})).to(VegaLite)

If I fill it with [3,3,3,2,4], then the logscale plot is not empty, but it is incorrect.

Also, I noticed that logscale does not work well with errors enabled until all bins have non-zero counts.

error in binder tutorial

Cell 19, select method gave
ValueError: no axis can select 'x <= 0' (axis bin('x', 100, -5.0, 5.0) has the wrong inequality; low edges are closed)

List of input variables

For what I am doing, it would be useful if a Hist was able to return the list of basic variables used in all its axis expressions.

For example

Hist(bin("sqrt(x**2 + y**2)", 5, 0, 1), bin("arctan2(y, x)", 3, -math.pi, math.pi))

would return the list ["x", "y"]

Get rid of expression labels

Like Hist._weightlabel. They're useless now that expressions are input only as strings.

does it work with dask arrays?

I just discovered this very cool-looking package thanks to your tweet. This package could potentially have great value in the climate / atmosphere / ocean space, where our data tends to live in xarray / dask data structures.

Quick question: does it work with dask arrays? My naive reading of the code is that you coerce inputs explicitly to numpy. If not, is that on your roadmap?

Sensitivity to variable name ?

I have different results for these 2 cases, which differ only by the variable name, "z" vs. "e":

h = Hist(bin("z", 5, 0, 1000.0))
h.fill(z=np.array([123.0]))

h.step("z", error=True).to(canvas)

and

h = Hist(bin("e", 5, 0, 1000.0))
h.fill(e=np.array([123.0]))

h.step("e", error=True).to(canvas)

It works correctly with "z", but with "e", it uniformly fills all bins with 1

Support JupyterLab & other frontends

Hello – very cool project.

I notice that for the notebook plotting you depend on vega. Unfortunately, this does not work in JupyterLab because of its different frontend and backend architecture.

Have you thought displaying via altair instead? If you do, you could take advantage of the work we've done on rendering backends, and you'd get automatic support for notebook, jupyterlab, nteract, colab, hydrogen, etc.

You wouldn't have to use Altair's API to build the charts – if you have a valid vega-lite spec in the form of a Python dictionary, alt.Chart.from_dict(spec) will return an Altair object that's connected to all the display architecture.

In the meantime, I want to think about adopting your vegascope project in altair: vega/altair#919

Let me know if you have any questions, and thanks for this great package!

marker() should place marker at bin center

I'm trying to generate a standard HEP plot: stacks of simulated estimates overlaid with data points.

import histbook

array = numpy.random.normal(0, 1, 100)
histogram1 = Hist(bin("data", 10, -5, 5))
histogram1.fill(data=array)

array = numpy.random.normal(1, 1, 100)
histogram2 = Hist(bin("data", 10, -5, 5))
histogram2.fill(data=array)

array = numpy.random.normal(0, 1, 100)
data_hist = Hist(bin("data", 10, -5, 5))
data_hist.fill(data=array)

group = Hist.group(by='sample', h1 = histogram1, h2 = histogram2)
group.stack("sample").area("data").to(canvas)
overlay(group.stack("sample").area("data"), data_hist.marker()).to(canvas)

the results I get, looks like it's doing the right thing, but the markers for the data histogram are on the bin edge. I think it would be more intuitive if they were in the bin center (also if there were some more styling option, I'd plot is as black filled circular markers)

Specialized Books

Data and samples (with a one-step method for making a stacked plot overlaid by points with errorbars).
Set of systematic variations, each associated with a vector of [theta1, theta2, theta3]. Number of dimensions is the number of parameters (3 here) and the value of each parameter may be a number of sigmas or a string (e.g. "MadGraph" vs "Pythia").

@lukasheinrich: is the common numerical unit "number of sigmas" or "confidence level"? There's a monotonic function between them (assuming Gaussian). Number of sigmas is easier to work with because it ranges over the whole real line, but confidence level is more semantically appropriate for probability distributions that aren't Gaussian. Like, if the probability distribution has polynomial tails with a certain degree, it doesn't have a standard deviation to make the translation.

feature request: clear the histogram

Could you please add a method to clear histogram counts ?
How about a histogram book ?
Thanks.

logarithmic scale on Y axis of Stack histogram

Hi,
I am attempted to produce a stacked HEP histogram consists of various background's histograms. I would like to logarithmic scale my stack histogram but no such option in stack and area function. Its seem like the only option i have is to use step function (enabling logarithmic scale on y) on each background histogram (declare in group) before stacking them up. However, the return Plotable1d object from hist.step() does not have attribute stack, if i go by .stack().step(logy), stack() complain about only bar and area could be stacked. Is there a way to set logarithmic scale via .stack().area(logy) or any other equivalent method to achieve this?

Thanks!

Remove meta dependency

I've downplayed the ability to define axis expressions from Python functions, rather than strings, because it hasn't worked on some systems due to a bug in meta. meta is not a very active project; I should probably remove the dependency altogether.

At the same time, perhaps I should ensure that my language is identical to or an extension of numexpr, so that there's an obvious candidate for accelerating the calculation of complex expressions. And then— I'd be able to use formulate to take those expressions in TTree::Draw syntax! (I don't know why I hadn't thought of that earlier.)

Fill from Pandas DataFrame

Of course.

"grid" histogram placement

In addition to beside() and below(), it would be good to have grid() method to place a list of histograms in grid fashion, e.g:

grid(5, [hist1.step(), hist2.step(), ...]) - place histograms in Nx5 grid (N rows by 5 histograms)
hist.grid(5, "param") - produce views with constant "param" and place them into Nx5 grid

Hist constructor should have a "filter" method

To apply a cut to the data because weight="where(muonpt > 20, 1, 0)" is less guessable than `filter="muonpt > 20". They'll have the same effect, though.

Plotting normalised histograms

Is it possible to plot normalised overlayed histograms using something like:

histogram.overlay("x").marker("y", error=True, normed=True).to(canvas)
histogram.overlay("x").normalize().marker("y", error=True).to(canvas)

histbook import from itself bug

This code gives me an error:

import math
from histbook import Hist, bin, beside
from vega import VegaLite as canvas

hist = Hist(bin("sqrt(x**2 + y**2)", 5, 0, 1),
             bin("atan2(y, x)", 3, -math.pi, math.pi))
hist.fill(x=numpy.random.normal(0, 1, 1000000),
           y=numpy.random.normal(0, 1, 1000000))
beside(hist.step("sqrt(y**2 + x**2)"), hist.step("atan2(y,x)")).to(canvas)

Error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-574307c16e07> in <module>()
      4 
      5 hist = Hist(bin("sqrt(x**2 + y**2)", 5, 0, 1),
----> 6              bin("atan2(y, x)", 3, -math.pi, math.pi))
      7 hist.fill(x=numpy.random.normal(0, 1, 1000000),
      8            y=numpy.random.normal(0, 1, 1000000))

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in __init__(self, *axis, **opts)
    435                 newaxis.append(old)
    436             else:
--> 437                 expr, label = histbook.expr.Expr.parse(old._expr, defs=defs, returnlabel=True)
    438                 new = old.relabel(label)
    439                 new._original = old._expr

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/expr.pyc in parse(expression, defs, returnlabel)
    327 
    328         if returnlabel:
--> 329             return recurse(pyast, relations=True), label
    330         else:
    331             return recurse(pyast, relations=True)

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/expr.pyc in recurse(node, relations)
    319 
    320                 if fcn is None:
--> 321                     raise ExpressionError("unhandled function in expression: {0}".format(histbook.util.astunparse.tostring(node).strip()))
    322 
    323                 return Call(fcn, *(recurse(x, relations=(i == 0 and fcn == "where")) for i, x in enumerate(node.args)))

NameError: global name 'histbook' is not defined

Broadcasting ?

Imagine I have a histogram like this:

Hist(bin("x", 100, 0.0, 1.0), groupby("condition"))

and I have an array of x collected under condition="test". So I want to put these data into the histogram. To do that, I need to do something like this:

hist.fill(x=xarray, condition=["test"]*len(xarray))

I think it would be useful to adopt a simple "broadcasting" rule to allow this:

hist.fill(x=xarray, condition="test")

Multiple object output in JupyterLab

Currently, the README advises to just drop .to(canvas) call and let jupyterlab try to display the histogram object. Often times it is convenient to output multiple histograms from the same cell. For that case one can use

from IPython.display import display
display(histogram.step("data"))

I think many people would want to know about that. Even further, it would make sense to unify interface for this. I think something like this can be implemented as a part of histbook.vega:

import numpy
from histbook import *
from IPython.display import display
class VegaJson:
    def __init__(self, x):
        self._mimejson = {"application/vnd.vegalite.v2+json": x}
    def _repr_mimebundle_(self, **kwargs):
        return self._mimejson
canvas = lambda x: display(VegaJson(x))

array = numpy.random.normal(0, 1, 1000000)
histogram = Hist(bin("data", 10, -5, 5))
histogram.fill(data=array)
histogram.step("data").to(canvas)

What do you think about this?

Bayesian Blocks

Any thoughts about this type of data-dependent binning?

http://docs.astropy.org/en/stable/api/astropy.stats.bayesian_blocks.html

https://jakevdp.github.io/blog/2012/09/12/dynamic-programming-in-python/

width is ignored

in this fragment width=... is ignored, when used with below()

from histbook import Hist, bin, beside, grid, below
array = numpy.random.normal(0, 1, 1000000)
histogram = Hist(bin("data", 10, -5, 5))
histogram.fill(data=array)
below(histogram.step("data", width=500)).to(canvas)

without below() it works fine

Another error

I have this fragment, which is not working:

h_muon_kinematics = Hist(
    bin("mu_phi", 10, -180, 180),
    bin("mu_theta", 10, -90, 90),
    bin("mu_p3", 10, 0, 1000),
    bin("mu_e", 10, 0, 1000),
    bin("mu_pt", 10, 0, 1000)
)

h_muon_kinematics.fill(
    mu_phi = np.random.random((100,)),
    mu_theta = np.random.random((100,)),
    mu_p3 = np.random.random((100,))*1000,
    mu_e = np.random.random((100,))*1000,
    mu_pt = np.random.random((100,))*1000
)

grid(2,
        h_muon_kinematics.step("mu_phi"),
        h_muon_kinematics.step("mu_theta"),
        h_muon_kinematics.step("mu_p3"),
        h_muon_kinematics.step("mu_pt"),
        h_muon_kinematics.step("mu_e")
).to(canvas)

Error is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-e8fa35fd32f6> in <module>()
     21         h_muon_kinematics.step("mu_pt"),
     22         h_muon_kinematics.step("mu_e")
---> 23 ).to(canvas)

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/vega.pyc in to(self, fcn)
    453     def to(self, fcn):
    454         """Call ``fcn`` on the Vega-Lite JSON for this plot."""
--> 455         return fcn(self.vegalite())
    456 
    457     def ipyvega(self):

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/vega.pyc in vegalite(self)
   1087                "vconcat": [{"hconcat": []}]}
   1088 
-> 1089         self._fill(0, allaxis, alldomains, out["vconcat"])
   1090         return out

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/vega.pyc in _fill(self, i, allaxis, alldomains, tofill)
   1069             else:
   1070                 varname = self._varname(i)
-> 1071                 marks, encodings, transforms = plotable._vegalite(allaxis[i], alldomains[i], varname)
   1072                 thislayer = [{"filter": {"field": "id", "equal": varname}}]
   1073 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/vega.pyc in _vegalite(self, axis, domains, varname)
    628                 raise AssertionError(values)
    629 
--> 630         encoding = {"x": {"field": varname + str(axis.index(self._last.axis)), "type": xtype, "scale": {"zero": False}, "axis": {"title": xtitle}},
    631                     "y": {"field": varname + str(len(axis)), "type": "quantitative", "axis": {"title": ytitle}}}
    632         for channel in self._chain[:-1]:

ValueError: bin('mu_theta', 10, -90.0, 90.0) is not in list

Books should have a tree structure

Remove the code that flattens Book structure to a single level, allowing the user to build trees. Calling a fill from the top level should still propagate all the way down.

This is a prerequisite for making special Books representing semantic relationships among histograms. You lose meaning if the structure gets flattened.

statelessness has changed with logscale

Showing dynamically updated histograms, I have been using this feature of histbook, which I found very useful, quite cool and "pythonic":

from histbook import Hist, bin, beside
from vega import VegaLite
import numpy as np

h=Hist(bin("x", 50, -5, 5))

display = beside(      # this looks (and used to act) like plotting instructions without actual plotting
    h.step(),
    h.step(yscale={"type":"log"})
)

# fill and show the histogram in iterations

x = np.random.normal(size=110)
h.fill(x=x)
display.to(VegaLite)        # now do actual plotting using "instructions" defined earlier

# 2nd iteration
x = np.random.normal(size=10000)
h.fill(x=x)
display.to(VegaLite)        # display again using same instructions

The left histogram looks good, while the right (logscale) histogram would not update its y axis scale to accommodate for new counts