The boost-histogram beta release

boost-histogram logo

The foundational histogramming package for Python, boost-histogram, hit beta status with version 0.6! This is a major update to the new Boost.Histogram bindings. Since I have not written about boost-histogram yet here, I will introduce the library in its current state. Version 0.6.2 was based on the recently released Boost C++ Libraries 1.72 Histogram package. Feel free to visit the docs, or keep reading this post.

This Python library is part of a larger picture in the Scikit-HEP ecosystem of tools for Particle Physics and is funded by DIANA/HEP and IRIS-HEP. It is the core library for making and manipulating histograms. Other packages are under development to provide a complete set of tools to work with and visualize histograms. The Aghast package is designed to convert between popular histogram formats, and the Hist package will be designed to make common analysis tasks simple, like plotting via tools such as the mplhep package. Hist and Aghast will be initially driven by HEP (High Energy Physics and Particle Physics) needs, but outside issues and contributions are welcome and encouraged.

The key design feature of the boost-histogram package (and the C++ library on which it is based) is the idea of a configurable and highly performant Histogram object. A histogram is made of two things: a storage and a collection of axes.

Histogram design

There are a preset collection of storages in boost-histogram, with different datatypes. Storages can hold simple data types (Int64, Double, and AtomicInt64), a special mutating Unlimited datatype that starts as an 8 bit int and grows, and even converts to double if weights are used (or, currently, if a view is requested). There are also several accumulators. You can use WeightedSum to track an error1, or Mean / WeightedMean to implement a “profile” histogram.

Axes come in a variety of flavors. Regular provides even bin spacings, Variable provides arbitrary bin spacings, Integer is a high performance simple integer spacing, and IntCategory / StrCategory provide arbitrary non-continuous categories. Each of these has options, such as growth=, which allows the axis to grow when out-of-range fills are made, overflow= and underflow=, which hold out-of-range fills on continuous axes, and circular=, which causes an axis to wrap around. The Regular axis also supports transforms, which map values in and out of the regular binning space. A few common transforms are provided, such as Pow(v), sqrt, and log, and users can supply a pair of CTypes function pointers to build new transforms with full performance. Feel free to use numba.cfunc to write your transforms directly in Python!

Usage example

Let’s take a look at what using the library looks like. Let’s start with a simple, contrived 1D dataset:

import numpy as np
import boost_histogram as bh
import matplotlib.pyplot as plt

norm_vals = np.concatenate(
    [
        np.random.normal(loc=5, scale=1, size=1_000_000),
        np.random.normal(loc=2, scale=0.2, size=200_000),
        np.random.normal(loc=8, scale=0.2, size=200_000),
    ]
)

This has a large Gaussian peak at 5, with two narrow peaks at 2 and 8. Using the boost-histogram API, we would make a histogram object with a regular binning. We want 100 bins from 0 to 10:

hist = bh.Histogram(
    bh.axis.Regular(100, 0, 10),
)

Now, we can call .fill to fill the histogram in-place:

hist.fill(norm_vals)
Histogram(Regular(100, 0, 10), storage=Double()) # Sum: 1399999.0 (1400000.0 with flow)

The histogram has a nice representation on the command line that includes the sum (both with and without the extra flow bins). If we want to plot densities, boost-histogram does not currently have a .density method/property, but you can easily use the built-in properties to compute it:

# Compute the "volume" of each bin (useful for 2D+)
volumes = np.prod(hist.axes.widths, axis=0)

# Compute the density of each bin
density = hist.view() / hist.sum() / volumes

The above code works on any number of dimensions, not just 1D histograms!

Now, plotting is also simple, using the same collection of properties:

plt.bar(*hist.axes.centers, density, *hist.axes.widths)

1D histogram example plot

Since axes returns a tuple of items, we use * to unpack the only item (axes[0]), but you can just directly call axes[0] if you prefer.

We could use any other axis, or a transformed axis, and the downstream code would not change:

hist = bh.Histogram(
    bh.axis.Regular(100, 1, 10, transform=bh.axis.transform.log),
)

hist.fill(norm_vals)

volumes = np.prod(hist.axes.widths, axis=0)
density = hist.view() / hist.sum() / volumes

plt.bar(*hist.axes.centers, density, *hist.axes.widths)

1D histogram example plot with log axis

NumPy comparison

If you are familiar with NumPy, the equivalent code for the simple regular binning in NumPy would be:

%%timeit
bins, edges = np.histogram(norm_vals, bins=100, range=(0, 10))
17.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Of course, you then are either left on your own to compute centers, density, widths, and more, or in some cases you can change the computation call itself to add density=, or use the matching function inside Matplotlib, and the API is different if you want 2D or ND histograms. But if you already use NumPy histograms and you really don’t want to rewrite your code, boost-histogram has adaptors for the three histogram functions in NumPy:

%%timeit
bins, edges = bh.numpy.histogram(norm_vals, bins=100, range=(0, 10))
7.3 ms ± 55.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is only a hair slower2 than using the raw boost-histogram API, and is still a nice performance boost over NumPy. You can even use the NumPy syntax if you want a boost-histogram object later:

hist = bh.numpy.histogram(norm_vals, bins=100, range=(0, 10), histogram=bh.Histogram)

You can later get a NumPy style output tuple from a histogram object:

bins, edges = hist.to_numpy()

So you can transition your code slowly to boost-histogram.

ND histograms

All the above concepts (except for plotting details) expand gracefully to multiple dimensions. The .axes object can be indexed as a tuple, or it can directly call any method or property of the axes and return the results as a tuple. If the result is an array, the returned iterable of arrays will be ready for broadcasting, as well; this is why the above density calculation works regardless of the number of axes in your histogram.

h2 = bh.Histogram(bh.axis.Regular(400, -2, 2), bh.axis.Regular(200, -1, 1))

data = np.random.multivariate_normal((0, 0), ((1, 0), (0, 0.5)), 10_000_000).T.copy()

h2.fill(*data)

# pcolormesh requires xy indexing, we produce ij for consistency with higher dimensions
x, y = h2.axes.edges
plt.pcolormesh(x.T, y.T, h2.view().T)

# Or you could do:
edges = h2.axes.edges
plt.pcolormesh(*np.broadcast_arrays(h2.axes.edges), h2)

2D histogram example plot

Histograms support the NumPy array protocol, so can often be used directly in places like plotting code instead of calling .view().

We can check the performance against NumPy again; NumPy does not do well with regular spaced bins in more than 1D:

%%timeit
np.histogram2d(*data, bins=(400, 200), range=((-2,2), (-1, 1)))
1.31 s ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
bh.numpy.histogram2d(*data, bins=(400, 200), range=((-2,2), (-1, 1)))
101 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For more than one dimension, boost-histogram is more than an order of magnitude faster than NumPy for regular spaced binning. Although optimizations may be added to boost-histogram for common axes combinations later, in 0.6.2, all axes combinations share a common code base, so you can expect at least this level of performance regardless of the axes types or number of axes!

Unified Histogram Indexing (UHI)

One of the key developments in boost-histogram is the indexing system, known as UHI. It was designed so that the indexing “tags” could be shared across libraries, and (in some cases) be implemented by users. The design of UHI follows NumPy indexing when NumPy is valid, that is, h[foo] == h.view()[foo] for all foo that is valid for both histograms and arrays. Any other valid foo for one should be invalid for the other. For example, histograms do not support integer steps. NumPy arrays do not support the callables listed below.

For a histogram, the slice should be thought of like this:

histogram[start:stop:action]

The start and stop can be either a bin number (following Python rules), or a callable; the callable will get the axis being acted on and should return an extended bin number (-1 and len(ax) are flow bins). A provided callable is bh.loc, which converts from axis data coordinates into bin number.

The final argument, action, is special. A general API is being worked on, but for now, bh.sum will “project out” or “integrate over” an axes, and bh.rebin(n) will rebin by an integral factor. Both work correctly with limits; bh.sum will remove flow bins if given a range. h[0:len:bh.sum] will sum without the flow bins.

Here are a few examples that highlight the functionality of UHI:

Example 1:

You want to slice axis 0 from 0 to 20, axis 1 from .5 to 1.5 in data coordinates, axis 2 needs to have double size bins (rebin by 2), and axis 3 should be summed over. You have a 4D histogram.

Solution:

ans = h[:20, bh.loc(-0.5) : bh.loc(1.5), :: bh.rebin(2), :: bh.sum]

Example 2:

You want to set all bins above 4.0 in data coordinates to 0 on a 1D histogram.

Solution:

h[bh.loc(4.0) :] = 0

You can set with an array, as well. The array can either be the same length as the range you give, or the same length as the range + under/overflows if the range is open ended (no limit given). For example:

h = bh.Histogram(bh.axis.Regular(10, 0, 1))
h[:] = np.ones(10)  # underflow/overflow still 0
h[:] = np.ones(12)  # underflow/overflow now set too

Note that for clarity, while basic NumPy broadcasting is supported, axis-adding broadcasting is not supported; you must set a 2D histogram with a 2D array or a scalar, not a 1D array.

Example 3:

You want to sum from -infinity to 2.4 in data coordinates in axis 1, leaving all other axes alone. You have an ND histogram, with N >= 2.

Solution:

ans = h[:, : bh.loc(2.4) : bh.sum, ...]

Notice that last example could be hard to write if the axis number, 1 in this case, was large or programmatically defined. In these cases, you can pass a dictionary of {axis:slice} into the indexing operation. A shortcut to quickly generate slices is provided, as well:

ans = h[{1: slice(None, bh.loc(2.4), bh.sum)}]

# Identical:
s = bh.tag.Slicer()
ans = h[{1: s[: bh.loc(2.4) : bh.sum]}]

Example 4:

You want the underflow bin of a 1D histogram.

Solution:

val = h1[bh.underflow]

See more about indexing here.

Analyses using axes

Taken together, the flexibility in axes and the tools to easily sum over axes can be applied to transform the way you approach analysis with histograms. For example, let’s say you are presented with the following data in a 3xN table:

Data Details
value
is_valid True or False
run_number A collection of integers

In a traditional analysis, you might bin over value where is_valid is True, and then make a collection of histograms, one for each run number. With boost-histogram, you can make a single histogram, and use an axis for each:

value_ax = bh.axis.Regular(100, -5, 5)
bool_ax = bh.axis.Integer(0, 2, underflow=False, overflow=False)
run_number_ax = bh.axis.IntCategory([], growth=True)

hist = bh.Histogram(value_ax, bool_ax, run_number_ax)

hist.fill([-2, 2, 4, 3], [True, False, True, True], [3, 5, 5, 7])
Histogram(
  Regular(100, -5, 5),
  Integer(0, 2, underflow=False, overflow=False),
  IntCategory([3, 5, 7], growth=True),
  storage=Double()) # Sum: 4.0

Now, you can use these axes to create a single histogram that you can fill. If you want to get a histogram of all run numbers and just the True is_valid selection, you can use a sum:

hist[:, bh.loc(True), :: bh.sum]
Histogram(Regular(100, -5, 5), storage=Double()) # Sum: 3.0

If you want to pick a single category value, just use bh.loc(item). A range or set of values cannot be selected, but you can use a generator expression:

from functools import reduce
from operator import add

reduce(add, (hist[:, bh.loc(True), bh.loc(item)] for item in {3, 5}))
Histogram(Regular(100, -5, 5), storage=Double()) # Sum: 2.0

You should note that in the above examples, the index and the bin are the same for boolean axes (or any integer axis that starts at 0). So you can use True directly instead of bh.loc(True) if you prefer.

You can expand this example to any number of dimensions, boolean flags, and categories.

Accumulator storages

There are three accumulator-based storages. WeightedSum keeps track of both a weighted sum (like Double) and also a variance. Profile histograms track a Mean (or a WeightedMean) instead of a sum; this is available as well. These types return accumulators when asked for a single value, or a NumPy record array-like view when asked for a view. This is done without copying the data, and you can also access a few computed properties. Here’s an example:

hm = bh.Histogram(
    bh.axis.Regular(20, -1.5, 1.5, underflow=False, overflow=False),
    storage=bh.storage.Mean(),
)

# Interesting data
data = np.random.multivariate_normal(
    (0, 1), ((0.1, 0.3), (0.3, 1.0)), size=10_000
).T.copy()

# Fill with sample
hm.fill(data[0], sample=data[1])

# Plot
fig, ax = plt.subplots()
ax.hist2d(*data, bins=100, cmap="gray_r")
ax.errorbar(
    hm.axes[0].centers, hm.view().value, yerr=np.sqrt(hm.view().variance), fmt="ro"
)

Profile histogram example plot

Here we have overlaid a Matplotlib 2D histogram with a boost-histogram 1D profile histogram.

The little details

The library respects Python conventions and interacts well with IPython wherever possible. Python 2 is still supported for the 1.0 release so that Python 2 users will be able to access histogramming, but care was taken to ensure Python 3 users do not suffer for this. For example, if you inspect a function signature, you will see the proper Python 3 signature if you are in Python 3:

Python 3 + IPython

bh.axis.Regular?
Init signature:
bh.axis.Regular(
    bins,
    start,
    stop,
    *,
    metadata=None,
    underflow=True,
    overflow=True,
    growth=False,
    circular=False,
    transform=None,
)

Python 2 + IPython

bh.axis.Regular?
Init signature:
bh.axis.Regular(
    bins,
    start,
    stop,
    **kwargs,
)

This means tab completion in IPython for Python 3 will properly suggest the keyword completions as you write, etc.

The histogram object behaves nicely when adding two histograms, multiplying or dividing by a scalar (depending on the storage), comparing histograms, printing histograms (try it with 1D!), setting with an array/view, accessing as a view, copying or deep-copying, and with any other expected Python manipulations or behaviours that are applicable to histograms. Currently, mismatched axes types or storages cannot be added, though this restriction may be carefully relaxed slightly in the future.

The library supports pickle3, and special care was taken to make sure histograms provide excellent pickle performance even with accumulator storages.

import boost_histogram as bh
import pickle
from pathlib import Path

h = bh.Histogram(bh.axis.Regular(2, -1, 1))

h_saved = Path("h_saved.pkl")

# Now save the histogram
with h4_saved.open("wb") as f:
    pickle.dump(h, f, protocol=-1)

# And load
with h_saved.open("rb") as f:
    h = pickle.load(f)

Threaded filling

You can use boost-histogram to do threaded filling, as well; the GIL is released during the filling process. Version 0.7.0 has a threads= argument that can be used in regular or numpy filling. If you want to implement it yourself, such as in 0.6.2, you can. For example, using concurrent.futures, you can do:

from concurrent.futures import ThreadPoolExecutor


def threaded_fill(hist, threads, *data):
    def fun(*args):
        return hist.copy().reset().fill(*args)

    chunks_list = [np.array_split(d, threads) for d in data]

    with ThreadPoolExecutor(threads) as pool:
        results = pool.map(fun, *chunks_list)

    for h in results:
        hist += h

Then, a fill would look like this:

import os

threads = os.cpu_count()
threaded_fill(hist, threads, vals)

This can give you a 2-4x speedup over single-threaded filling, depending on the situation and the number of cores. The number of items in the fill must be large enough to offset the cost of making histogram copies and summing them afterwards.

Obtaining boost-histogram

While the only requirement to build boost-histogram is a C++14 compatible compiler, this still takes some time and is not always available. A wheel building system was set up for boost-histogram, and is now being used in other Scikit-HEP packages. Each release produces a large collection of wheels that pip will install in a couple of seconds on most systems.

System Arch Python versions
ManyLinux1 (custom GCC 9.2) 64 & 32-bit 2.7, 3.5, 3.6, 3.7, 3.8
ManyLinux2010 64-bit 2.7, 3.5, 3.6, 3.7, 3.8
macOS 10.9+ 64-bit 2.7, 3.6, 3.7, 3.8
Windows 64 & 32-bit 2.7, 3.6, 3.7, 3.8

You simply install with:

pip install boost-histogram

and pip will give you the appropriate wheel in most cases, and will fall back on a source build if you are not covered in the above list. Linux distributions such as Alpine and ClearLinux are usually the only outliers, but those tend to have modern compilers.

We also support Conda through Conda-Forge:

conda install -c conda-forge boost-histogram

All Conda-Forge platforms are supported, including ARM and PowerPC. The one exception is Python 2.7 on Windows, which is not supported due to the C++14 requirements.

Acknowledgements

This library was developed by Henry Schreiner and Hans Dembinski, and is based on the Boost.Histogram library, developed by Hans Dembinski and first released as part of the Boost libraries in version 1.70.

Support for this work was provided by the National Science Foundation cooperative agreement OAC-1836650 (IRIS-HEP) and OAC-1450377 (DIANA/HEP). Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.


  1. In ROOT, this is a histogram with .Sumw2() called before filling. ↩︎

  2. About a 5% penalty for the above example. If you use the bins='auto' feature of NumPy histograms, boost-histogram will be slower, because it calls NumPy to study the data to compute the bin edges. You can still use it to produce a boost-histogram object, however, so it is still quite useful. ↩︎

  3. Any protocol except protocol 0. In Python 2, this is the default, though several higher levels are available, and protocol 0 has dismal performance. ↩︎

comments powered by Disqus