scikit-hep / boost-histogram Goto Github PK

View Code? Open in Web Editor NEW

138.0 12.0 19.0 3.06 MB

Python bindings for the C++14 Boost::Histogram library

Home Page: https://boost-histogram.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

CMake 0.76% C++ 11.64% Jupyter Notebook 61.60% Python 26.00%

scikit-hep histogram python hacktoberfest

boost-histogram's Introduction

boost-histogram for Python

Python bindings for Boost::Histogram (source), a C++14 library. This is one of the fastest libraries for histogramming, while still providing the power of a full histogram object. See what's new.

Other members of the boost-histogram family include:

Hist: The first-party analyst-friendly histogram library that extends boost-histogram with named axes, many new shortcuts including UHI+, plotting shortcuts, and more.
UHI: Specification for Histogram library interop, especially for plotting.
mplhep: Plotting extension for matplotlib with support for UHI histograms.
histoprint: Histogram display library for the command line with support for UHI.
dask-histogram: Dask support for boost-histogram.

Usage

Text intro (click to expand)

import boost_histogram as bh

# Compose axis however you like; this is a 2D histogram
hist = bh.Histogram(
    bh.axis.Regular(2, 0, 1),
    bh.axis.Regular(4, 0.0, 1.0),
)

# Filling can be done with arrays, one per dimension
hist.fill(
    [0.3, 0.5, 0.2],
    [0.1, 0.4, 0.9],
)

# NumPy array view into histogram counts, no overflow bins
values = hist.values()

# Make a new histogram with just the second axis, summing over the first, and
# rebinning the second into larger bins:
h2 = hist[::sum, :: bh.rebin(2)]

We support the uhi PlottableHistogram protocol, so boost-histogram/Hist histograms can be plotted via any compatible library, such as mplhep.

Cheatsheet

Simplified list of features (click to expand)

Many axis types (all support metadata=...)
- bh.axis.Regular(n, start, stop, ...): Make a regular axis. Options listed below.
  - overflow=False: Turn off overflow bin
  - underflow=False: Turn off underflow bin
  - growth=True: Turn on growing axis, bins added when out-of-range items added
  - circular=True: Turn on wrapping, so that out-of-range values wrap around into the axis
  - transform=bh.axis.transform.Log: Log spacing
  - transform=bh.axis.transform.Sqrt: Square root spacing
  - transform=bh.axis.transform.Pow(v): Power spacing
  - See also the flexible Function transform
- bh.axis.Integer(start, stop, *, underflow=True, overflow=True, growth=False, circular=False): Special high-speed version of regular for evenly spaced bins of width 1
- bh.axis.Variable([start, edge1, edge2, ..., stop], *, underflow=True, overflow=True, circular=False): Uneven bin spacing
- bh.axis.IntCategory([...], *, growth=False): Integer categories
- bh.axis.StrCategory([...], *, growth=False): String categories
- bh.axis.Boolean(): A True/False axis
Axis features:
- .index(value): The index at a point (or points) on the axis
- .value(index): The value for a fractional bin (or bins) in the axis
- .bin(i): The bin edges (continuous axis) or a bin value (discrete axis)
- .centers: The N bin centers (if continuous)
- .edges: The N+1 bin edges (if continuous)
- .extent: The number of bins (including under/overflow)
- .metadata: Anything a user wants to store
- .traits: The options set on the axis
- .size: The number of bins (not including under/overflow)
- .widths: The N bin widths
Many storage types
- bh.storage.Double(): Doubles for weighted values (default)
- bh.storage.Int64(): 64-bit unsigned integers
- bh.storage.Unlimited(): Starts small, but can go up to unlimited precision ints or doubles.
- bh.storage.AtomicInt64(): Threadsafe filling, experimental. Does not support growing axis in threads.
- bh.storage.Weight(): Stores a weight and sum of weights squared.
- bh.storage.Mean(): Accepts a sample and computes the mean of the samples (profile).
- bh.storage.WeightedMean(): Accepts a sample and a weight. It computes the weighted mean of the samples.
Accumulators
- bh.accumulator.Sum: High accuracy sum (Neumaier) - used by the sum method when summing a numerical histogram
- bh.accumulator.WeightedSum: Tracks a weighted sum and variance
- bh.accumulator.Mean: Running count, mean, and variance (Welfords's incremental algorithm)
- bh.accumulator.WeightedMean: Tracks a weighted sum, mean, and variance (West's incremental algorithm)
Histogram operations
- h.ndim: The number of dimensions
- h.size or len(h): The number of bins
- +: Add two histograms (storages must match types currently)
- *=: Multiply by a scaler (not all storages) (hist * scalar and scalar * hist supported too)
- /=: Divide by a scaler (not all storages) (hist / scalar supported too)
- .kind: Either bh.Kind.COUNT or bh.Kind.MEAN, depending on storage
- .storage_type: Fetch the histogram storage type
- .sum(flow=False): The total count of all bins
- .project(ax1, ax2, ...): Project down to listed axis (numbers). Can also reorder axes.
- .to_numpy(flow=False, view=False): Convert to a NumPy style tuple (with or without under/overflow bins)
- .view(flow=False): Get a view on the bin contents (with or without under/overflow bins)
- .values(flow=False): Get a view on the values (counts or means, depending on storage)
- .variances(flow=False): Get the variances if available
- .counts(flow=False): Get the effective counts for all storage types
- .reset(): Set counters to 0 (growing axis remain the same size)
- .empty(flow=False): Check to see if the histogram is empty (can check flow bins too if asked)
- .copy(deep=False): Make a copy of a histogram
- .axes: Get the axes as a tuple-like (all properties of axes are available too)
  - .axes[0]: Get the 0th axis
  - .axes.edges: The lower values as a broadcasting-ready array
  - .axes.centers: The centers of the bins broadcasting-ready array
  - .axes.widths: The bin widths as a broadcasting-ready array
  - .axes.metadata: A tuple of the axes metadata
  - .axes.size: A tuple of the axes sizes (size without flow)
  - .axes.extent: A tuple of the axes extents (size with flow)
  - .axes.bin(*args): Returns the bin edges as a tuple of pairs (continuous axis) or values (describe)
  - .axes.index(*args): Returns the bin index at a value for each axis
  - .axes.value(*args): Returns the bin value at an index for each axis
Indexing - Supports UHI Indexing
- Bin content access / setting
  - v = h[b]: Access bin content by index number
  - v = h[{0:b}]: All actions can be represented by axis:item dictionary instead of by position (mostly useful for slicing)
- Slicing to get histogram or set array of values
  - h2 = h[a:b]: Access a slice of a histogram, cut portions go to flow bins if present
  - h2 = h[:, ...]: Using : and ... supported just like NumPy
  - h2 = h[::sum]: Third item in slice is the "action"
  - h[...] = array: Set the bin contents, either include or omit flow bins
- Special accessors
  - bh.loc(v): Supply value in axis coordinates instead of bin number
  - bh.underflow: The underflow bin (use empty beginning on slice for slicing instead)
  - bh.overflow: The overflow bin (use empty end on slice for slicing instead)
- Special actions (third item in slice)
  - sum: Remove axes via projection; if limits are given, use those
  - bh.rebin(n): Rebin an axis
NumPy compatibility
- bh.numpy provides faster drop in replacements for NumPy histogram functions
- Histograms follow the buffer interface, and provide .view()
- Histograms can be converted to NumPy style output tuple with .to_numpy()
Details
- All objects support copy/deepcopy/pickle
- Fully statically typed, tested with MyPy.

Installation

You can install this library from PyPI with pip:

python3 -m pip install boost-histogram

All the normal best-practices for Python apply; Pip should not be very old (Pip 9 is very old), you should be in a virtual environment, etc. Python 3.7+ is required; for older versions of Python (3.5 and 2.7), 0.13 will be installed instead, which is API equivalent to 1.0, but will not be gaining new features. 1.3.x was the last series to support Python 3.6.

Binaries available:

The easiest way to get boost-histogram is to use a binary wheel, which happens when you run the above command on a supported platform. Wheels are produced using cibuildwheel; all common platforms have wheels provided in boost-histogram:

System	Arch	Python versions	PyPy versions
ManyLinux2014	64-bit	3.7, 3.8, 3.9, 3.10, 3.11, 3.12	3.7, 3.8, 3.9, 3.10
ManyLinux2014	ARM64	3.7, 3.8, 3.9, 3.10, 3.11, 3.12	3.7, 3.8, 3.9, 3.10
MuslLinux_1_1	64-bit	3.7, 3.8, 3.9, 3.10, 3.11, 3.12
macOS 10.9+	64-bit	3.7	3.7, 3.8, 3.9, 3.10
macOS Universal2	Arm64	3.8, 3.9, 3.10, 3.11, 3.12
Windows	32 & 64-bit	3.7, 3.8, 3.9, 3.10, 3.11, 3.12
Windows	64-bit		3.7, 3.8, 3.9, 3.10

manylinux2014: Requires pip 19.3.
ARM on Linux is supported. PowerPC or IBM-Z available on request.
macOS Universal2 wheels for Apple Silicon and Intel provided for Python 3.8+ (requires Pip 21.0.1 or newer).

If you are on a Linux system that is not part of the "many" in manylinux or musl in musllinux, such as ClearLinux, building from source is usually fine, since the compilers on those systems are often quite new. It will just take longer to install when it is using the sdist instead of a wheel. All dependencies are header-only and included.

Conda-Forge

The boost-histogram package is available on conda-forge, as well. All supported variants are available.

conda install -c conda-forge boost-histogram

Source builds

For a source build, for example from an "SDist" package, the only requirements are a C++14 compatible compiler. The compiler requirements are dictated by Boost.Histogram's C++ requirements: gcc >= 5.5, clang >= 3.8, or msvc >= 14.1. You should have a version of pip less than 2-3 years old (10+).

Boost is not required or needed (this only depends on included header-only dependencies). You can install directly from GitHub if you would like.

python -m pip install git+https://github.com/scikit-hep/boost-histogram.git@develop

Developing

See CONTRIBUTING.md for details on how to set up a development environment.

Contributors

We would like to acknowledge the contributors that made this project possible (emoji key):

_{Henry Schreiner} 🚧 💻 📖	_{Hans Dembinski} 🚧 💻	_N!no ⚠️ 📖	_{Jim Pivarski} 🤔	_{Nicholas Smith} 🐛	_{physicscitizen} 🐛	_{Chanchal Kumar Maji} 📖
_{Doug Davis} 🐛	_{Pierre Grimaud} 📖	_{Beojan Stanislaus} 🐛	_Popinaodude 🐛	_{Congqiao Li} 🐛	_{alexander-held} 🐛	_{Chris Burr} 📖
_{Konstantin Gizdov} 📦 🐛	_{Kyle Cranmer} 📖	_{Aman Goel} 📖 💻	_{Jay Gohil} 📖

This project follows the all-contributors specification.

Talks and other documentation/tutorial sources

The official documentation is here, and includes a quickstart.

Acknowledgements

This library was primarily developed by Henry Schreiner and Hans Dembinski.

Support for this work was provided by the National Science Foundation cooperative agreement OAC-1836650 (IRIS-HEP) and OAC-1450377 (DIANA/HEP). Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

boost-histogram's People

Contributors

Stargazers

Watchers

Forkers

hdembinski chanchalkumarmaji douglasdavis zuysal giovanni-mariano davidt0x trendingtechnology kgizdov cranmer mraunak amangoel185 andrzejnovak colesbury durbar2003 gohil-jay heng-ye arpitjain799 btovar saransh-cpp

boost-histogram's Issues

Remaining tasks

These are the key elements to releasing the beta (an alpha might come sooner):

Beta release plan

I will merge the remaining PRs Monday morning. The Numpy module will be provisional, with a warning on import. I'll possibly do some minor polish (such as work on the examples) and then aim for a beta release the day before my histogram talk at PyHEP.

This will not set the API in stone, but it will be only changed carefully and with good reason, hopefully avoiding invalidating training material if possible. The contents of core (and even the name) are not part of the API. Accumulators May have a warning (at least in the training material, just to reserve possible changes).

Sharing serialization implementation with boost::histogram

I changed the implementation/serialization of accumulators::mean in develop. This may happen again at any time, which shows me that maintaining two serialization implementations is brittle. The serialization code is tricky to write, complex, and deeply tied to the implementations of the classes in C++, we need to find a way for boost-histogram to use the underlying implementation of boost::histogram.

The problem with the boost::histogram version right now is that it is hard-coded to use boost::serialization. In the long-term, this could be generalized, but in the short term, I think another solution is possible. Instead of adding boost::serialization to the externs, we could add some fake headers to this project which mimic paths and namespaces of boost::serialization on the surface so that the boost::histogram implementation of the serialization code works, while actually calling our stand-ins.

I would like to work on this for 0.5

Apply clang-format to all files

My editor is set up to apply clang-format automatically on each save. Some files are not clang-format, e.g. register_histogram.hpp. If I edit such a file, it creates a huge diff because of the formatting.

To prevent this, please clang-format all files once.

Adaptive support of operators

The register_histogram_by_type should statistically check what operations the histogram type supports and then enable only those. Infrastructure to achieve this is already implemented in boost::histogram and should be ported to boost-histogram, see
https://github.com/boostorg/histogram/blob/7301d6a4c2d2500245648388ab7a9709985a1f4f/include/boost/histogram/detail/meta.hpp#L137
and the following lines 215, 218, 221, 224, 227.

These would be used like so:

if (has_operator_rmul<histogram_t, double>::value)
   hist.def(py::self *= double());

Don't include master header "boost/histogram.hpp"

The master header "boost/histogram.hpp" includes almost all other headers. This causes all files in this project to be recompiled if any header in boost::histogram changes, even if they are not affected by this change. Some people in boost think we shouldn't even provide master headers because of this issue, but they are very useful to quickly get started and seems to outweigh the potential problems. This project should not include the master header, though.

Work on indexed accesses

There should be support for the indexed iterator. Something like this:

for item in hist.indexed():
    item.index(0) # current index along first axis
    item.index(1) # current index along second axis
    item.bin(0) # current bin interval along first axis
    item.bin(1) # current bin interval along second axis
    item.content = 3 # Set and read bin contents

This could provide the same shortcuts axes already provide, like .bins() and .centers(). So, for example, this would be the way you evaluate a function f(x,y,...) and fill a histogram with the values:

for item in hist.indexed():
    item.content = f(*item.centers())

Also add a way to access the contents in the bins through numpy arrays:

X, Y = hist.centers() # similar to mgrid output

This could remove the need to add indexed; the alternative would be, for functions that accept arrays and for histogram storages that can be converted to buffers (see #35 for an idea to allow the more common syntax for setting, and might support different storages):

hist.view() = f(*hist.centers())

And, for functions that do not, but still for storages that can be converted to buffers:

it = np.nditer([hist.view(), hist.centers()], [], [['readwrite'], ['readonly']])
with it:
    for (content, center) in it:
        content[...] = f(*center)

This would do two unnecessary copies due to the way 'readwrite' on nditer was written. If you want bin edges, etc., that has to be added to the lists.

Add GitHub Actions reformatter

I want to remove the clang-format check. It has caused me only trouble and it is annoying to see a change fail just because the format check thinks some headers should be ordered slightly differently.

Automation should save time, and so far this is only causing an annoyance.

clang-format issues on my computer

clang-format produces a very different formatting for register_histogram.hpp on my computer. I have clang-format version 7.0.0-3~ubuntu0.18.04.1 (tags/RELEASE_700/final)

h.view() for all storage types

Check builds with pyproject.toml present

Adding pyproject.toml (for the pre-commit config) may have activated PEP 517 mode. Make sure this doesn't break the versions of PIP that had buggy half-support for PEP 517 mode, especially when building. Since we don't use the file to ask for anything, I think it will be okay, but it needs to be checked/verified.

UHI repo

We should copy UHI description to a separate repository (local version can mention current status in boost-histogram).

Rework or remove examples for current version

These were from the old bindings. Also add them to testing so they don't break in the future.

Numpy compat features

Add Numpy like functions

Add to_numpy (Produces Numpy-like output)
Add from_numpy (Takes Numpy-like arguments and makes a histogram )
Add histogram1d, etc... shortcuts (= from_numpy(...).to_numpy())

A special Numpy-backed storage could also be an interesting idea, but likely not needed.

This also covers having good compatibility with numpy:

Make sure .view() works with all storages
Buffer should either be supported completely, not at all, or conditionally on some storages.

Support for static histograms

Static tuples of axes could be supported except for project.

Find Python Package or combined python import

It would be nice to have something like find_python_package that mirrors find_package. It would set X_FOUND and one can then conditionally add the test targets when all Python packages are available.

find_python_package(pytest)
find_python_package(pytest-benchmark)

if (pytest_FOUND AND pytest_benchmark_FOUND)
  enable_testing()
  add_subdirectory(tests) # or similar
else()
  message(STATUS "Tests disabled, please do `pip install pytest pytest-benchmark` to enable") # or similar
endif()

Originally posted by @HDembinski in #64 (comment)

Combine the packages in python_import and provide a combined error message.

Partial projection

Add slice/shrink + project support.

Python 2.7+ Windows extra requirement

The Python 2.7 package on Windows requires the 2015+ MSVC redistributable. This is a requirement of PyBind11 (or any C++11+ package built with Python 2.7 on Windows). The question is, what is the best course of action:

Put a note on the readme about this requirement
Check and print a warning in the setup.py
Check and print a warning or specialize the import in __init__.py (something I would like to avoid touching too much for other reasons)
Somehow redistribute the redistributable in the wheel

I've currently just done item 1. Up for discussion if any more is needed. I'm hoping that the 2.7 + Windows combination won't be common...

Note: Anyone using windows should always have the redistributables, such as recommended here. But they might not (Only 2008 is required to run Python).

This is the unhelpful error message a user sees if they don't have the 2015 redistributable:

>>> import boost.histogram as bh
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: DLL load failed: The specified module could not be found.

Zenodo DOI

We should have this starting with 0.5.

Support for multiple weight variations

A feature I recently implemented in pygram11 allows an array of input data to be histogrammed with multiple weight variations in a single function call. In ATLAS (and I'm sure most HEP experiments) we carry around a lot of MC generator weights and derive uncertainties from many scale factor variations-- in the downstream parts of an analysis we end up comparing a histogrammed variable using one set of weights to the same variable histogrammed with a different weight variation. For a histogram with non-fixed width binning there's actually a nice speedup from avoiding repeating the binary search to grab the necessary bin index.

I'm not sure how this might fit into the object-focused design of boost-histogram, but I just wanted to mention it as an idea for a feature-- perhaps if there ends up being some kind of factory to create histograms from a function call which has the data to be histogrammed as an argument.

Projection issue with UHI

UHI projection is reversed (my fault).

fix projection axes
allow axes to be removed by projection

Name of main import

Currently, I have the name of the main folder as boost.histogram. Should we keep it that way, or change it? This is actually the main compiled library, so if we use boost_histogram, then we will need to call it boost_histogram.core or something like that, then import from it. Might need to look into keeping the type names reasonable in that case. @jpivarski expressed some concerns for boost.histogram, probably because we don't own boost on PyPI?

axis interface cleanup

remove axis.bins(...), make axis iterable instead
expose bin objects, or yield tuples/values? I think yielding tuples/values is better.
nice __repr__ for each axis
remove axis.update(...) for growing axis [it is not to be called by users and there is no need to unit test this in Python as it is already unit tested in C++]

Nice reprs

We need to provide nice reprs for histograms (and a few other objects need nicer reprs that the current default PyBind11 ones, too). The repr can use Numpy to format a preview of the histogram.

Have nice, Python-runnable reprs for axes
Find some way to have a nice bins view or repr (handled a bit differently, but valid)
Have a nice repr for histogram, maybe also Python runable

Switch to fill method

Remaining feature:

Support string category fills

The C++ Boost::Histogram has a fill method now. With this, filling a std::vector<axis::variant<...>> based histogram is as fast as std::tuple<...> based. Therefore, the specializations for std::tuple histograms can be dropped. This reduces the amount of static code that has to be generated by a lot, which should make the binary smaller and reduce the compilation time. The code in histogram_fill.hpp has to be completely replaced. Multi-threading support has to be done inside Boost::Histogram now and cannot be done at the level of the Python bindings.

The following types remain:

histogram<Axes, default_storage>
histogram<Axes, dense_storage<double>>
histogram<Axes, weight_storage>
histogram<Axes, profile_storage>
histogram<Axes, weighted_profile_storage>

Where Axes = std::vector<axis::variant<all-supported-axis-types>>

I won't be able to work on this for the next two weeks due to vacation. @henryiii if you want to work on it before that, go ahead.

Rename src/histogram/accumulators.cpp to src/register_accumulators.cpp etc

It is generally a good idea to call a file after the class or function if the file only includes a single class or function. For me it would be much more logical to rename accumulators and the other implementation files to register_something so it matches the functions in these files.

Let's benchmark before and after #65

Let's benchmark before and after 499f429, to see if there's a difference in any-histogram.

Also TODO: Remove the MPL limit setting.

Originally posted by @henryiii in #65 (comment)

Renaming axes::any

I don't like the name axes::any, because the actual type is std::vector<axis::variant<...>>. I would prefer the name axis_vector. I think we can call this any way we like, because it is not exposed to Python. The problem is that axes::any does not clarify that this is a vector.

Move factory_meta stuff to Python

I shouldn't have to recompile the code whenever this changes.

    // This factory makes a class that can be used to create histograms and also be used in is_instance
    py::object factory_meta_py = py::module::import("boost.histogram_utils").attr("FactoryMeta");

    m.attr("histogram")
        = factory_meta_py(m.attr("make_histogram"),
                          py::make_tuple(hist.attr("any_double"), hist.attr("any_unlimited"), hist.attr("any_weight")));

`copy` and `deepcopy`

Adding __copy__ and __deepcopy__ everywhere seems weird. Is pybind11 not calling the copy ctor of the C++ instance when a __copy__ or __deepcopy__ is requested?

I think a better solution that doing this in register_histogram

        .def("__deepcopy__",
             [](const histogram_t &self, py::object memo) {
                 histogram_t *a  = new histogram_t(self);
                 py::module copy = py::module::import("copy");
                 for(unsigned i = 0; i < a->rank(); i++) {
                     bh::unsafe_access::axis(*a, i).metadata() = copy.attr("deepcopy")(a->axis(i).metadata(), memo);
                 }
                 return a;
             })

is to add a copy ctor to metadata_t which does the deep copy. Then the deep copy is automatically done in the copy ctors of the axis types and by the copy ctor of histogram. Doing it in metadata_t is better encapsulation.

project, shrink, rebin interface as in C++

Project, etc are currently member functions of histogram. This is breaking the interface consistency to the C++ version. Reduce should be a free function with rebin, shrink etc as keyword parameters.

Methods vs. Properties

I have mostly been leaving everything as similar as possible to Boost::Histogram. There are some possible Pythonizations we could investigate, especially for things that do not exist in the same form in Boost::Histogram.

Names: Python expects classes to have CamelCase. Except for the standard library and numpy. I think we should stay lowercase to match Boost::Histogram, though the classes really can't map 1:1 to templated classes.
Methods in Boost::Histogram: we can keep these the same or make them properties. Sometimes it is helpful to have a flow=False extra parameter, as well.
Methods not in Boost::Histogram: some of these are there to make Python easier to write, such as .centers(). They might be slightly more pythonic as .centers, but some of these might have the option to have flow=False, so I'm inclined to leave them as methods.
Settable methods: These are clear, they should be properties. Like ind.content.
Names: Should indexed's setter .content or a .contents to set the value contained in the bin? There probably are other cases.

We should settle down on these before the release in a few days.

Increase to C++17 standard?

It would simplify the code in several places if we could use C++17 in the wrapper. If we distribute binaries anyway, we could require this recent standard for the compilation. Is there a reason to stick to C++14?

bh.project rename

This is based on some feedback I received in a recent discussion:

In Boost.Histogram, mathematics, and other places, you associate project with the axes you project onto, rather than the axes you remove. The axes you remove are "integrated out". So h project 1 would be expected to project an Nd histogram to the 1 axis, rather than integrate the 1 axis out.

With that knowledge, should we provide a different name instead of project for UHI (which needs to move to a separate repository)? UHI's design means this only affects the axis you use it on - h[:,:,::project] has two remaining axes, not one.

We could provide both, as another option.

TL,DR: h[::bh.project] replaced by [::bh.remove]. bh.project will exist for a while, but provide a DepreciationWarning.

@jpivarski, @HDembinski, @benkrikler, thoughts ?

Fix docs

Looks like docs went out with the module rename.

Unlimited storage pickling

The following test fails (disabled for now):

s = bh.storage.unlimited()
pickle.loads(pickle.dumps(s, -1))

(And, the unlimited storage in a histogram fails, too, of course)

The error is:

RuntimeError: make_tuple(): unable to convert argument of type 'boost::histogram::unlimited_storage<std::__1::allocator<char> >::buffer_type' to Python object

So the parts in storage do not have Python types, causing it to fail pickling.

Signatures in Python 3

In Python 3, we can replace the signature of **kwargs with the actual keywords-only argument signature. Until we drop Python 2, this will at least mean Python 3 users are not dragged through the ugliness of Python 2.

Won't be ready for PyHEP, though.

all pytest calls fail after rebasing to latest develop

python3 -m pytest ../tests/test_histogram_internal.py
ERROR: usage: pytest.py [options] [file_or_dir] [file_or_dir] [...]
pytest.py: error: unrecognized arguments: --benchmark-disable
  inifile: /Users/hdembins/Code/boost-histogram/tests/pytest.ini
  rootdir: /Users/hdembins/Code/boost-histogram/tests

The current directory is a build folder inside the repo. make test also generates failures for each pytest.

My pytest version is 4.4.1. It was working fine on an older checkout of the develop branch.

Port to boost::histogram multi-threaded storage

How should we proceed with this? The develop branch now has support for multi-threaded fills for any axis and all dense storage types. It doesn't work yet for the unlimited_storage, this will come in the not so near future, because I expect it to be very difficult to get that right. I want to first get some experience with the current threaded capabilities.

Do you want to keep your current self-made multi-threaded storage for comparison or should I replace it with the boost::histogram one?

Rename storages

I think that a type called storage::double_/storage.double looks quite weird in C++ and in Python. Since in C++ the storages are not put in a sub-namespace, they shouldn't be in boost-histogram, either. I would like to rename the C++ type to double_storage and likewise the corresponding Python type. The storages shouldn't be in their own submodule, because that's not how it is in C++ either. To not put them in a namespace was an intentional choice in C++. Whether it really was the right choice I don't know, but I want boost-histogram to be consistent with boost::histogram.

Vectorize accumulators

This is a placeholder for a discussion on the best way to provide a vectorized accumulator interface in Python. This is not directly mappable to C++ (since in C++ a loop is not 100x slower than an array, and C++14 does not really have anything like Numpy).

Possible options so far:

Something in the constructor

Possibly nicest Python interface: bh.accumulor.sum(arr) creates a pre-filled accumulator from an array.

Specific vectorized function

Probably bh.accumulor.sum().fill(arr), nice symmetry with Histograms.

Vectorized call operator

Make very ugly Python: bh.accumulor.sum()(arr), does not play well with duck typing, and modifying a object with a call operator is very ugly (not: mean already has a 1-item call operator). I don't like the idea of adding call operators to sum, where C++ doesn't even have them.

Vectorized operators

Like the one above, but uses += arr for sum. It is not possible to add weights for weighted_sum, however.

Consistency in flow

Currently, many methods support flow=bool, but the default might not be consistent. Here are the options:

Make flow=True the default everywhere. Then a user can use regular_noflow if they really want to avoid the under/overflow bins, or use flow=False manually everywhere.
- Possible followup: rename regular_noflow to regular, and regular to regular_flow or similar.
Make flow=False the default everywhere. This makes the overflow basically a detail that doesn't bother users unless they ask for it, but doesn't force them to turn it off completely.
Be inconsistent about usage and confuse everyone.

I think we are choosing option 2. This includes changing np.asarray to ignore flow bins.

Smart slicing

TODO: Add setting support

This will be closed when setting support is added, along with underflow/overflow.

The current version of this proposal lives here: https://boost-histogram.readthedocs.io/en/latest/usage/indexing.html.

Don't allocate small objects

Allocation of small objects is a performance killer. An allocation from the heap costs between O(100) and O(10000) cpu cycles. An allocation on the stack costs 1 cpu cycle. Therefore, small objects with short life-time should be allocated on the stack and not the heap.

Therefore, the following code is wasteful:

/// Get and remove a value from a keyword argument dict, or return a empty pointer
template <typename T>
std::unique_ptr<T> optional_arg(py::kwargs &kwargs, const char *name) {
    if(kwargs.contains(name)) {
        return std::make_unique<T>(py::cast<T>(kwargs.attr("pop")(name)));
    } else {
        return std::unique_ptr<T>();
    }
}

C++17 has std::optional for this and boost has boost::optional. This code should those facilities or simply return a pybind11 handle without casting. The handle can be null, so it can act as an optional thing without allocating anything new.

`.at` Docstring update

I'm adding _at_set so that the Python code can add __setitem__. We cannot support h.at(i) = 2 (invalid Python syntax), whereas [] can - so that should probably be the replacement.

Should we rename at to _at_get? Or should we try to provide a replacement? I'm think of the following:

h.at(-1) = 3 # syntax error

We can do this using [] indexing:

h[bh.underflow] = 3

But should there be a way to use -1? Should there be an .at_set(3, -1), or h[bh.at(-1)] = 3? If we leave .at public, then .at(-1) does this, just setting is an issue.

Fix the accumulator interface

The __init__ and __call__ interface should be removed.

See arguments for storage interface.

Remove storage interface

The storage classes have some interface exposed in Python. I would like to remove this. Neither users nor our tests should touch the storage directly. The storage implementation is tested in boost::histogram, so there is no need to test it in Python. Furthermore, storage functionality can always be tested indirectly by testing a histogram that uses this storage.

I don't want any interface exposed on the storages except their constructor. Exposing more is giving users wrong ideas. In general, we have to be stingy with public interface, because everything that is exposed as public we cannot change easily later. Therefore, it is better to expose as little as possible, so we maintain as much flexibility as possible in changing things.

clang-format pipeline checks out externals...

... which is not needed and slows down the test.

Throw error when creating histogram with too many axes

The following test succeeds, but it should not:

def test_out_of_limit_axis():
    for i in range(36):
        ax = (bh.axis.regular(1, -1, 1, underflow=False, overflow=False) for a in range(i))
        bh.histogram(*ax)

Bin lifetime bug

Found this issue in my presentation notes:

While axis.bin(X) usually works, axis.bins() may have corrupted bins. For example:

import boost.histogram as bh
bh.axis.regular_log(10,1,5).bins()

[<bin [1.000000, 1.000000]>,
 <bin [1.000000, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>,
 <bin [inf, inf]>]

(varies randomly)

scikit-hep / boost-histogram Goto Github PK

boost-histogram's Introduction

boost-histogram for Python

Usage

Cheatsheet

Installation

Binaries available:

Conda-Forge

Source builds

Developing

Contributors

Talks and other documentation/tutorial sources

Acknowledgements

boost-histogram's People

Contributors

Stargazers

Watchers

Forkers

boost-histogram's Issues

Recommend Projects

Recommend Topics

Recommend Org