Let’s compare several ways of making Histograms. I’m going to assume you would like to end up with a nice OO histogram interface, so all the 2D methods will fill a Physt histogram. We will be using a 2 x 1,000,000 element array and filling a 2D histogram, or 10,000,000 elemends in a 1D histogram. Binnings are regular.
1D 10,000,000 item histogram
Example | KNL | MBP | X24 |
---|---|---|---|
NumPy: histogram | 704 ms | 147 ms | 114 ms |
NumPy: bincount | 432 ms | 110 ms | 117 ms |
fast-histogram | 337 ms | 45.9 ms | 45.7 ms |
Numba | 312 ms | 58.8 ms | 60.7 ms |
2D 1,000,000 item histogram
Example | KNL | MBP | X24 |
---|---|---|---|
Physt | 1.21 s | 293 ms | 246 ms |
NumPy: histogram2d | 456 ms | 114 ms | 88.3 ms |
NumPy: add.at | 247 ms | 62.7 ms | 49.7 ms |
NumPy: bincount | 81.7 ms | 23.3 ms | 20.3 ms |
fast-histogram | 53.7 ms | 10.4 ms | 7.31 ms |
fast-hist threaded 0.5 | (6) 62.5 ms | 9.78 ms | (6) 15.4 ms |
fast-hist threaded (m) | 62.3 ms | 4.89 ms | 3.71 ms |
Numba | 41.8 ms | 10.2 ms | 9.73 ms |
Numba threaded | (6) 49.2 ms | 4.23 ms | (6) 4.12 ms |
Cython | 112 ms | 12.2 ms | 11.2 ms |
Cython threaded | (6) 128 ms | 5.68 ms | (8) 4.89 ms |
pybind11 sequential | 93.9 ms | 9.20 ms | 17.8 ms |
pybind11 OpenMP atomic | 4.06 ms | 6.87 ms | 1.91 ms |
pybind11 C++11 atomic | (32) 10.7 ms | 7.08 ms | (48) 2.65 ms |
pybind11 C++11 merge | (32) 23.0 ms | 6.03 ms | (48) 4.79 ms |
pybind11 OpenMP merge | 8.74 ms | 5.04 ms | 1.79 ms |
Computers used:
- MBP: MacBook Pro (Retina, 13-inch, Early 2015) with 2.7 GHz Intel Core i5 (Dual core with Hyperthreading)
- X24: CentOS 7 with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with Hyperthreading)
- KNL: CentOS 7 with Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz (64 cores with 4x Hyperthreading)
All computers were running Anaconda and Python 3.6 or 3.7. Thread numbers were reduced from cores times hyperthreads to produce the best timing if noted in parenthesis. For the KNL, threading does not seem to work at all, so 6 threads were chosen. On the KNL and pybind11, running an OpenMP version seems to stop the threaded versions from using threads.
Software:
- Physt: Object oriented histograms for Python
- fast-histogram: High speed
simple histograms. 0.5 and (m) is master branch, install master with:
pip install git+https://github.com/astrofrog/fast-histogram.git
- Numba: JIT for Python, now available via PIP. The 0.40 version now supports array variables in reduction, but I was not able to come up with a way to use it to do parallel histograms.
- Cython: C+Python language.
- pybind11: C++11 bindings for Python. Also
using
pip install git+https://github.com/aldanor/ipybind.git
for nice Jupyter integration.
Source code
Like other notebook powered posts, you can access the notebook to run this yourself at gitlab.com/iscinumpy/iscinumpy.gitlab.io.
import numpy as np
import numba
import math
from physt import h2
import fast_histogram
from physt.histogram_nd import Histogram2D
from concurrent.futures import ThreadPoolExecutor
from functools import partial
Let’s start with setting up the histogram we will be making today. We’ll produce several different quantities, since the libraries each expect something a bit different. Some of the methods require evenly spaced bins, which we have.
bins = (100, 100)
ranges = ((-1, 1), (-1, 1))
bins = np.asarray(bins).astype(np.int64)
ranges = np.asarray(ranges).astype(np.float64)
edges = (
np.linspace(*ranges[0, :], bins[0] + 1),
np.linspace(*ranges[1, :], bins[1] + 1),
)
Our 2D dataset will have 1,000,000 2D items. If you are on a many-core machine, you might see better performance for the parallel versions from 10M items instead.
np.random.seed(42)
vals = np.random.normal(size=[2, 1000000]).astype(np.float32)
vals1d = np.random.normal(size=[10000000]).astype(np.float32)
1D: NumPy histogram
%%timeit
h, _ = np.histogram(vals1d, bins=bins[0], range=(ranges[0,0],ranges[0,1]))
1D: NumPy bincount
%%timeit
a = np.zeros(bins[0], dtype=np.int_)
c = ((vals1d[(vals1d>=ranges[0,0]) & (vals1d<ranges[0,1])] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
gh = np.bincount(c)
1D: fast-histogram
%%timeit
h = fast_histogram.histogram1d(vals1d, bins=bins[0], range=(ranges[0,0],ranges[0,1]))
1D: Numba histogram
Even in Numba 0.40, this is not faster in parallel mode.
@numba.njit
def hist1d(v, b, r):
return np.histogram(v, b, r)[0]
%%timeit
h = hist1d(vals1d, bins[0], (ranges[0,0],ranges[0,1]))
2D
The following examples are in 2D.
Physt
Let’s start by just asking Physt to do the histogramming itself.
%%timeit
h2(*vals, bins=edges)
NumPy
%%timeit
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Histogram2D(ledges, H)
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Add.at and Bincount
This manually builds the histogram with bincount and some filtering. This takes advantage of the even bin spacing.
%%timeit
a = np.zeros(bins[0]*bins[1], dtype=np.int_)
cuts = (vals[0]>=ranges[0,0]) & (vals[0]<ranges[0,1]) & (vals[1]>=ranges[1,0]) & (vals[1]<ranges[1,1])
c0 = ((vals[0][cuts] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
c1 = ((vals[1][cuts] - ranges[1,0]) / (ranges[1,1] - ranges[1,0]) * bins[1]).astype(np.int_)
np.add.at(a, c0 + bins[0]*c1, 1)
Histogram2D(edges, a.reshape(*bins))
%%timeit
cuts = (vals[0]>=ranges[0,0]) & (vals[0]<ranges[0,1]) & (vals[1]>=ranges[1,0]) & (vals[1]<ranges[1,1])
c = ((vals[0,cuts] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
c += bins[0]*((vals[1,cuts] - ranges[1,0]) / (ranges[1,1] - ranges[1,0]) * bins[1]).astype(np.int_)
H = np.bincount(c, minlength=bins[0]*bins[1]).reshape(*bins)
Histogram2D(edges, H)
Fast-histogram
%%timeit
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1,1),(-1,1)))
Histogram2D(edges, H)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
Fast-histogram threaded
(Not much faster in 0.5, fixed in master)
%%timeit
splits = 6
with ThreadPoolExecutor(max_workers=splits) as pool:
chunk = vals.shape[1] // splits
chunk0 = [vals[0,i*chunk:(i+1)*chunk] for i in range(splits)]
chunk1 = [vals[1,i*chunk:(i+1)*chunk] for i in range(splits)]
f = partial(fast_histogram.histogram2d, bins=100, range=((-1,1),(-1,1)))
results = pool.map(f, chunk0, chunk1)
results = sum(results)
Numba
Custom function, since so far only 1D np.histogram
is supported natively.
# This is 8-9 times faster than np.histogram
# We give the signature here so it gets precompmiled
# In theory, this could even be threaded (nogil!)
@numba.njit(nogil=True, parallel=False)
def hist2d_numba_seq(tracks, bins, ranges):
H = np.zeros((bins[0], bins[1]), dtype=np.uint64)
delta = 1 / ((ranges[:, 1] - ranges[:, 0]) / bins)
for t in range(tracks.shape[1]):
i = (tracks[0, t] - ranges[0, 0]) * delta[0]
j = (tracks[1, t] - ranges[1, 0]) * delta[1]
if 0 <= i < bins[0] and 0 <= j < bins[1]:
H[int(i), int(j)] += 1
return H
%%timeit
H = hist2d_numba_seq(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)
Numba threaded
This is why we release the GIL in the definition above.
%%timeit
splits = 6
with ThreadPoolExecutor(max_workers=splits) as pool:
chunk = vals.shape[1] // splits
chunks = [vals[:,i*chunk:(i+1)*chunk] for i in range(splits)]
f = partial(hist2d_numba_seq, bins=bins, ranges=ranges)
results = pool.map(f, chunks)
results = sum(results)
Cython
Everyone has to compare to Cython. I’m maxing out the compilation options to try to match Numba, but so far have failed. Also much slower to compile.
%load_ext Cython
%%cython -a -c-O3 -c-march=native
# cython: wraparound = False
# cython: cdivision = True
# cython: nonecheck = False
# cython: boundscheck = False
import numpy as np
cimport libc.math as math
cimport numpy as np
def cy_histogram2d(np.float32_t[:,:] tracks, long[:] bins, double[:,:] ranges):
Hpy = np.zeros((bins[0], bins[1]), dtype=np.int64)
cdef np.uint64_t [:,:] H = Hpy
cdef double delta0 = 1/((ranges[0,1] - ranges[0,0]) / bins[0])
cdef double delta1 = 1/((ranges[1,1] - ranges[1,0]) / bins[1])
cdef int i, j
with nogil:
for t in range(tracks.shape[1]):
i = int(math.floor((tracks[0,t] - ranges[0,0]) * delta0))
j = int(math.floor((tracks[1,t] - ranges[1,0]) * delta1))
if 0 <= i < bins[0] and 0 <= j < bins[1]:
H[i,j] += 1
return Hpy
%%timeit
H = cy_histogram2d(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)
Cython threaded
We can use the same GIL releasing trick in Cython, too.
%%timeit
splits = 8
with ThreadPoolExecutor(max_workers=splits) as pool:
chunk = vals.shape[1] // splits
chunks = (vals[:,i*chunk:(i+1)*chunk] for i in range(splits))
f = partial(cy_histogram2d, bins=bins, ranges=ranges)
results = pool.map(f, chunks)
H = sum(results)
Histogram2D(edges, H)
Pybind11
We can set up a threaded and unthreaded version.
Note that we explicitly set a number of threads - the default hardware concurrency can be a bit high.
%load_ext ipybind
Add:
-c="-fopenmp" -Wl="-fopenmp"
(linux) or
-c="-Xpreprocessor" -c="-fopenmp" -Wl="-lomp"
(macOS with HomeBrew) to add OpenMP support.
%%pybind11 -c="-fopenmp" -Wl="-fopenmp"
#include <pybind11/numpy.h>
#include <vector>
py::array_t<long> py11_histogram2d_atomic(
py::array_t<float> xtrx,
py::array_t<float> ytrx,
long bins0, long bins1,
float ranges00, float ranges01,
float ranges10, float ranges11) {
auto xtrx_acc = xtrx.unchecked<1>();
auto ytrx_acc = ytrx.unchecked<1>();
std::vector<long> H(bins0 * bins1);
double delta0 = 1/((ranges01 - ranges00) / bins0);
double delta1 = 1/((ranges11 - ranges10) / bins1);
py::buffer_info buf_trks = xtrx.request();
size_t ntracks = buf_trks.size;
# pragma omp parallel for
for (size_t t=0; t<ntracks; t++) {
float i = (xtrx_acc(t) - ranges00) * delta0;
float j = (ytrx_acc(t) - ranges10) * delta1;
if (0 <= i && i < bins0 && 0 <= j && j < bins1)
# pragma omp atomic
H[size_t(i) * bins1 + size_t(j)]++;
}
py::array_t<long> Hpy({bins0, bins1});
auto Hpy_acc = Hpy.mutable_unchecked<2>();
for (long i=0; i<bins0; i++)
for (long j=0; j<bins0; j++)
Hpy_acc(i,j)=H[i * bins1 + j];
return Hpy;
}
PYBIND11_MODULE(example_atomic, m) {
m.def("py11_histogram2d_atomic", py11_histogram2d_atomic);
}
%%pybind11
#include <pybind11/numpy.h>
#include <atomic>
#include <thread>
#include <vector>
py::array_t<long> py11_histogram2d_threaded_atomic(
py::array_t<float> xtrx,
py::array_t<float> ytrx,
long bins0, long bins1,
float ranges00, float ranges01,
float ranges10, float ranges11,
size_t num_threads) {
auto xtrx_acc = xtrx.unchecked<1>();
auto ytrx_acc = ytrx.unchecked<1>();
std::vector<std::atomic<long>> H(bins0 * bins1);
double delta0 = 1/((ranges01 - ranges00) / bins0);
double delta1 = 1/((ranges11 - ranges10) / bins1);
py::buffer_info buf_trks = xtrx.request();
size_t ntracks = buf_trks.size;
auto chunked = [&](size_t start, size_t end){
for (size_t t=start; t<end; t++) {
float i = (xtrx_acc(t) - ranges00) * delta0;
float j = (ytrx_acc(t) - ranges10) * delta1;
if (0 <= i && i < bins0 && 0 <= j && j < bins1)
H[size_t(i) * bins1 + size_t(j)]++;
}
};
if (num_threads==0)
num_threads = std:🧵:hardware_concurrency();
std::vector<std::thread> threadpool;
size_t each = ntracks / num_threads;
for (int i=0; i<(int) num_threads; i++) {
threadpool.emplace_back(chunked, i*each, ((i-1)==num_threads) ? num_threads : (i+1)*each);
}
for (auto& thread : threadpool)
thread.join();
py::array_t<long> Hpy({bins0, bins1});
auto Hpy_acc = Hpy.mutable_unchecked<2>();
for (long i=0; i<bins0; i++)
for (long j=0; j<bins1; j++)
Hpy_acc(i,j)=H[i * bins1 + j];
return Hpy;
}
PYBIND11_MODULE(example_threaded_atomic, m) {
m.def("py11_histogram2d_threaded_atomic",
py11_histogram2d_threaded_atomic);
}
%%pybind11
#include <pybind11/numpy.h>
#include <thread>
#include <algorithm>
#include <vector>
py::array_t<long> py11_histogram2d_threaded_sep(
py::array_t<float> xtrx,
py::array_t<float> ytrx,
long bins0, long bins1,
float ranges00, float ranges01,
float ranges10, float ranges11,
size_t num_threads) {
auto xtrx_acc = xtrx.unchecked<1>();
auto ytrx_acc = ytrx.unchecked<1>();
double delta0 = 1/((ranges01 - ranges00) / bins0);
double delta1 = 1/((ranges11 - ranges10) / bins1);
py::buffer_info buf_trks = xtrx.request();
size_t ntracks = buf_trks.size;
if (num_threads==0)
num_threads = std:🧵:hardware_concurrency();
std::vector<long> H(num_threads * bins0 * bins1);
auto chunked = [&](size_t thread, size_t start, size_t end){
for (size_t t=start; t<end; t++) {
float i = (xtrx_acc(t) - ranges00) * delta0;
float j = (ytrx_acc(t) - ranges10) * delta1;
if (0 <= i && i < bins0 && 0 <= j && j < bins1)
H[thread*bins0*bins1 + size_t(i)*bins1 + size_t(j)]++;
}
};
std::vector<std::thread> threadpool;
size_t each = ntracks / num_threads;
for (int i=0; i<(int) num_threads; i++) {
threadpool.emplace_back(chunked, i, i*each, ((i-1)==num_threads) ? num_threads : (i+1)*each);
}
for (auto& thread : threadpool)
thread.join();
py::array_t<long> Hpy({bins0, bins1});
auto Hpy_acc = Hpy.mutable_unchecked<2>();
for (long i=0; i<bins0; i++)
for (long j=0; j<bins1; j++)
Hpy_acc(i,j)=0;
for (size_t thread=0; thread < num_threads; thread++)
for (long i=0; i<bins0; i++)
for (long j=0; j<bins1; j++)
Hpy_acc(i,j)+=H[thread*bins0*bins1 + i*bins1 + j];
return Hpy;
}
PYBIND11_MODULE(example_threaded_sep, m) {
m.def("py11_histogram2d_threaded_sep", py11_histogram2d_threaded_sep);
}
%%pybind11 -c="-fopenmp" -Wl="-fopenmp"
#include <pybind11/numpy.h>
#include <vector>
#include <omp.h>
py::array_t<long> py11_histogram2d_sep(
py::array_t<float> xtrx,
py::array_t<float> ytrx,
long bins0, long bins1,
float ranges00, float ranges01,
float ranges10, float ranges11) {
double delta0 = 1/((ranges01 - ranges00) / bins0);
double delta1 = 1/((ranges11 - ranges10) / bins1);
py::buffer_info buf_trks = xtrx.request();
size_t ntracks = buf_trks.size;
auto xtrx_acc = xtrx.unchecked<1>();
auto ytrx_acc = ytrx.unchecked<1>();
py::array_t<long> Hpy({bins0, bins1});
auto Hpy_acc = Hpy.mutable_unchecked<2>();
for (long i=0; i<bins0; i++)
for (long j=0; j<bins1; j++)
Hpy_acc(i,j)=0;
# pragma omp parallel
{
size_t num_threads = omp_get_num_threads();
std::vector<long> H = std::vector<long>(bins0 * bins1);
# pragma omp for
for (size_t t=0; t<ntracks; t++) {
float i = (xtrx_acc(t) - ranges00) * delta0;
float j = (ytrx_acc(t) - ranges10) * delta1;
if (0 <= i && i < bins0 && 0 <= j && j < bins1) {
H[size_t(i)*bins1 + size_t(j)]++;
}
}
for (long i=0; i<bins0; i++)
for (long j=0; j<bins1; j++)
# pragma omp atomic
Hpy_acc(i,j)+=H[i*bins1 + j];
}
return Hpy;
}
PYBIND11_MODULE(example_sep, m) {
m.def("py11_histogram2d_sep", py11_histogram2d_sep);
}
%%timeit
h = py11_histogram2d_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_threaded_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 0)
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_threaded_atomic(
vals[0], vals[1], *bins, *ranges[0], *ranges[1], 32
)
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_threaded_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 0)
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_threaded_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 32)
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
print("Should be 0:", (h - H).sum())
TODO: Boost::Histogram
Boost::Histogram should be coming out soon (early next year), and there should be nice pybind11 bindings for it. This would be very interesting to compare!