UHI 1.0: histogram serialization

UHI 1.0 is out, with a major new feature: a new histogram serialization spec! This spec supports multiple formats (HDF5, zip, and JSON initially), and can be supported by multiple libraries (Boost-histogram/hist initially). There’s also a new test suite helper for libraries targeting the UHI indexing spec.

Serialization

The big new feature is a serialization spec. Supported by boost-histogram 1.6.1+ and hist 2.9.0+ (which are also now out), this new system lets you save and read histograms from multiple locations. We support hdf5, zip, and json files initially, and we expect to add ROOT and Zarr in the future. This is build around an intermediate representation; histogram libraries can produce this IR, and readers/writers are implemented in the uhi library for the above mentioned file types. If you’d rather, you can implement your own readers/writers, as well (such as for another language, for example).

The helpers in uhi are designed to integrate with the low level tooling provided by the standard library (or the h5py package) and give you maximum flexibility in storing your histogram.

Both boost-histogram and hist allow you to directly pass histograms to uhi’s functions instead of the IR; if a type has _to_uhi_() defined, it will be converted. And if you pass IR into a histogram constructor, it will be converted to a histogram for you.

JSON

The simplest format, JSON, is only recommended for small histograms. The format is basically the same as the IR, with just an added specification for how to serialize arrays. Here’s how you can use it:

from hist import Hist
import json
import uhi.io.json

# make some h = Hist...
ob = json.dumps(h, default=uhi.io.json.default)
hist_ir = json.loads(ob, object_hook=uhi.io.json.object_hook)
h_loaded = Hist(hist_ir)

You can see that this takes advantage of json’s mechanism for custom converters; if you wanted to add other objects, you can. It’s expected that you store your histogram(s) with names inside a .json file.

ZIP

A nearly identical format, the ZIP format allows you to store the data as NumPy arrays in separate files, and the metadata is in a .json file inside the zip file with reference to the stored data files.

from hist import Hist
import zip
import uhi.io.zip

# Make some h = Hist...
with zip.open("myfile.zip", "w") as z:
    uhi.io.zip.write(zip_file, "histogram", h)

with zip.open("myfile.zip", "r") as z:
    hist_ir = uhi.io.zip.read(zip_file, "histogram")
h_loaded = Hist(hist_ir)

You can store histograms with any name; above "histogram" was chosen. You run these methods on an open zip file, allowing you to add other things to the file if needed. You can also choose things like the compression level.

HDF5

The most advanced format, and a good one for large histograms and/or structured data, is the HDF5 format. You need the h5py library to use it. This format maps the metadata into HDF5 primitives. Use it looks like this:

from hist import Hist
import h5py
import uhi.zip.hdf5

# Make some h = Hist...
with h5py.File("myfile.hdf5", "w") as h5_file:
    uhi.io.hdf5.write(h5_file.create_group("histogram"), h)

with h5py.File("myfile.hdf5", "r") as h5_file:
    hist_ir = uhi.io.hdf5.read(h5_file["histogram"])

h_loaded = Hist(hist_ir)

You can place this inside any (including nested) group, allowing full flexibility of the hierarchical data format. You can also control the minimum size for compression with min_compress_elements (default: 1,000), or pass through custom compression and compression_opts.

Special thanks to Aryaman Jeendgar who developed the initial implementation of HDF5 support!

Sparse histograms

You can also support sparse histograms with UHI. Currently (boost-histogram 1.6.1) does not support sparse histograms, but you can still convert the UHI IR from dense to sparse before storing it, and convert back:

hist_ir_sparse = uhi.io.to_sparse(hist_ir)
hist_ir_dense = uhi.io.from_sparse(hist_ir_sparse)

These methods do nothing if the histogram is already in the target format, so calling from_sparse on every histogram you load into boost-histogram/hist is safe if you want to load arbitrary histograms. Keep in mind that sparse storage is not more efficient unless your histogram is fairly sparse (2-3x more empty cells than filled cells).

Type helpers

The types can be expressed as TypedDicts, so uhi.typing.serialization is provided with various useful types, mostly for the IR, but there’s also a ToUHIHistogram, which is a Protocol for an object that can be converted via a _to_uhi_() method. The rest of the types come in two forms, ones with “Any” in the name support all variations, while the others are Unions of specific Axis or Storage types.

Testing your library

UHI contains a framework to input your own histogram library and see if it passes. For example, if you are making a histogram library called my.Histogram, then you could test it like this:

import uhi.testing.indexing
import uhi.testing


class TestAccess1D(uhi.testing.indexing.Indexing1D[my.Histogram]):
    @classmethod
    def make_histogram(cls) -> my.Histogram:
        return my.Histogram(cls.get_uhi())

(there are also 2D and 3D tests). Placing this in your test suite will cause pytest/unittest to pick up and run dozens of tests to make sure you implement UHI indexing correctly!

The helper has cls.get_uhi() to get a UHI serialization IR that you can use; if you don’t support that, you can manually set up your histogram, details are in the test source files.

Currently we provide the core indexing helpers, more might be added in the future.

New features in boost-histogram 1.6(.1)

To support this, boost-histogram received a 1.6 release, which can produce the uhi IR. This also adds support for 3.14, 3.14t, iOS, Windows ARM, and GraalPy. There’s a new integrated diagnostic test suite that can be used to quickly check if boost-histogram was built correctly without dependencies or SDist files. You can set __dict__ in the histogram constructor now, just like axes. Lots of nice fixes went into this release too, like setting a range with a scalar, setting ranges with Histograms (fixes *= too, since that sets with itself), rebinning with edges is better, and lots of clang-tidy cleanups that should also reduce the number of refcounts we do internally. We also now recommend np.s_ instead of our custom slicer, since it does the same thing.

New features in hist 2.9

Hist 2.9, which supports boost-histogram 1.5 and 1.6, supports serialization as well. There’s now a legend=True parameter for stacked plots, fill_flattened works with string arguments, and label/name propagation was fixed when casting histograms.

All three projects now upload nightly wheels to the Scientific-Python nightly wheels, and all three have dropped support for Python 3.8.

Future plans

There are lots of things we could add, like CLI utilities to validate files, convert files, add files, and more. We might work on a higher-level interface in Hist, as well. We’d love to hear from you what you need, and we’d be happy to help you implement something if you want to get involved.

Special thanks to the decision making team of UHI for all the work on this standard!

Try it out today and see what you think!