UHI 1.0 is out, with a major new feature: a new histogram serialization spec! This spec supports multiple formats (HDF5, zip, and JSON initially), and can be supported by multiple libraries (Boost-histogram/hist initially). There’s also a new test suite helper for libraries targeting the UHI indexing spec.
Serialization
The big new feature is a serialization spec. Supported by boost-histogram
1.6.1+ and hist 2.9.0+ (which are also now out), this new system lets you
save and read histograms from multiple locations. We support hdf5
, zip
, and
json
files initially, and we expect to add ROOT and Zarr in the future. This
is build around an intermediate representation; histogram libraries can produce
this IR, and readers/writers are implemented in the uhi
library for the above
mentioned file types. If you’d rather, you can implement your own
readers/writers, as well (such as for another language, for example).
The helpers in uhi are designed to integrate with the low level tooling provided by the standard library (or the h5py package) and give you maximum flexibility in storing your histogram.
Both boost-histogram and hist allow you to directly pass histograms to uhi’s
functions instead of the IR; if a type has _to_uhi_()
defined, it will be
converted. And if you pass IR into a histogram constructor, it will be converted
to a histogram for you.
JSON
The simplest format, JSON, is only recommended for small histograms. The format is basically the same as the IR, with just an added specification for how to serialize arrays. Here’s how you can use it:
from hist import Hist
import json
import uhi.io.json
# make some h = Hist...
ob = json.dumps(h, default=uhi.io.json.default)
hist_ir = json.loads(ob, object_hook=uhi.io.json.object_hook)
h_loaded = Hist(hist_ir)
You can see that this takes advantage of json
’s mechanism for custom
converters; if you wanted to add other objects, you can. It’s expected that you
store your histogram(s) with names inside a .json
file.
ZIP
A nearly identical format, the ZIP format allows you to store the data as NumPy
arrays in separate files, and the metadata is in a .json
file inside the zip
file with reference to the stored data files.
from hist import Hist
import zip
import uhi.io.zip
# Make some h = Hist...
with zip.open("myfile.zip", "w") as z:
uhi.io.zip.write(zip_file, "histogram", h)
with zip.open("myfile.zip", "r") as z:
hist_ir = uhi.io.zip.read(zip_file, "histogram")
h_loaded = Hist(hist_ir)
You can store histograms with any name; above "histogram"
was chosen. You run
these methods on an open zip file, allowing you to add other things to the file
if needed. You can also choose things like the compression level.
HDF5
The most advanced format, and a good one for large histograms and/or structured
data, is the HDF5 format. You need the h5py
library to use it. This format
maps the metadata into HDF5 primitives. Use it looks like this:
from hist import Hist
import h5py
import uhi.zip.hdf5
# Make some h = Hist...
with h5py.File("myfile.hdf5", "w") as h5_file:
uhi.io.hdf5.write(h5_file.create_group("histogram"), h)
with h5py.File("myfile.hdf5", "r") as h5_file:
hist_ir = uhi.io.hdf5.read(h5_file["histogram"])
h_loaded = Hist(hist_ir)
You can place this inside any (including nested) group, allowing full
flexibility of the hierarchical data format. You can also control the minimum
size for compression with min_compress_elements
(default: 1,000), or pass
through custom compression
and compression_opts
.
Special thanks to Aryaman Jeendgar who developed the initial implementation of HDF5 support!
Sparse histograms
You can also support sparse histograms with UHI. Currently (boost-histogram 1.6.1) does not support sparse histograms, but you can still convert the UHI IR from dense to sparse before storing it, and convert back:
hist_ir_sparse = uhi.io.to_sparse(hist_ir)
hist_ir_dense = uhi.io.from_sparse(hist_ir_sparse)
These methods do nothing if the histogram is already in the target format, so
calling from_sparse
on every histogram you load into boost-histogram/hist is
safe if you want to load arbitrary histograms. Keep in mind that sparse storage
is not more efficient unless your histogram is fairly sparse (2-3x more empty
cells than filled cells).
Type helpers
The types can be expressed as TypedDicts, so uhi.typing.serialization
is
provided with various useful types, mostly for the IR, but there’s also a
ToUHIHistogram
, which is a Protocol for an object that can be converted via a
_to_uhi_()
method. The rest of the types come in two forms, ones with “Any” in
the name support all variations, while the others are Unions of specific Axis or
Storage types.
Testing your library
UHI contains a framework to input your own histogram library and see if it
passes. For example, if you are making a histogram library called
my.Histogram
, then you could test it like this:
import uhi.testing.indexing
import uhi.testing
class TestAccess1D(uhi.testing.indexing.Indexing1D[my.Histogram]):
@classmethod
def make_histogram(cls) -> my.Histogram:
return my.Histogram(cls.get_uhi())
(there are also 2D and 3D tests). Placing this in your test suite will cause pytest/unittest to pick up and run dozens of tests to make sure you implement UHI indexing correctly!
The helper has cls.get_uhi()
to get a UHI serialization IR that you can use;
if you don’t support that, you can manually set up your histogram, details are
in the test source files.
Currently we provide the core indexing helpers, more might be added in the future.
New features in boost-histogram 1.6(.1)
To support this, boost-histogram received a 1.6 release,
which can produce the uhi IR. This also adds support for 3.14, 3.14t, iOS,
Windows ARM, and GraalPy. There’s a new integrated diagnostic test suite that
can be used to quickly check if boost-histogram was built correctly without
dependencies or SDist files. You can set __dict__
in the histogram constructor
now, just like axes. Lots of nice fixes went into this release too, like setting
a range with a scalar, setting ranges with Histograms (fixes *=
too, since
that sets with itself), rebinning with edges is better, and lots of clang-tidy
cleanups that should also reduce the number of refcounts we do internally. We
also now recommend np.s_
instead of our custom slicer, since it does the same
thing.
New features in hist 2.9
Hist 2.9, which supports boost-histogram 1.5 and 1.6, supports serialization
as well. There’s now a legend=True
parameter for stacked plots,
fill_flattened
works with string arguments, and label/name propagation was
fixed when casting histograms.
All three projects now upload nightly wheels to the Scientific-Python nightly wheels, and all three have dropped support for Python 3.8.
Future plans
There are lots of things we could add, like CLI utilities to validate files, convert files, add files, and more. We might work on a higher-level interface in Hist, as well. We’d love to hear from you what you need, and we’d be happy to help you implement something if you want to get involved.
Special thanks to the decision making team of UHI for all the work on this standard!
Try it out today and see what you think!