Histogram Speeds in Python

Let’s compare several ways of making Histograms. I’m going to assume you would like to end up with a nice OO histogram interface, so all the 2D methods will fill a Physt histogram. We will be using a 2 x 1,000,000 element array and filling a 2D histogram, or 10,000,000 elemends in a 1D histogram. Binnings are regular.

1D 10,000,000 item histogram

Example KNL MBP X24
NumPy: histogram 704 ms 147 ms 114 ms
NumPy: bincount 432 ms 110 ms 117 ms
fast-histogram 337 ms 45.9 ms 45.7 ms
Numba 312 ms 58.8 ms 60.7 ms

2D 1,000,000 item histogram

Example KNL MBP X24
Physt 1.21 s 293 ms 246 ms
NumPy: histogram2d 456 ms 114 ms 88.3 ms
NumPy: add.at 247 ms 62.7 ms 49.7 ms
NumPy: bincount 81.7 ms 23.3 ms 20.3 ms
fast-histogram 53.7 ms 10.4 ms 7.31 ms
fast-hist threaded 0.5 (6) 62.5 ms 9.78 ms (6) 15.4 ms
fast-hist threaded (m) 62.3 ms 4.89 ms 3.71 ms
Numba 41.8 ms 10.2 ms 9.73 ms
Numba threaded (6) 49.2 ms 4.23 ms (6) 4.12 ms
Cython 112 ms 12.2 ms 11.2 ms
Cython threaded (6) 128 ms 5.68 ms (8) 4.89 ms
pybind11 sequential 93.9 ms 9.20 ms 17.8 ms
pybind11 OpenMP atomic 4.06 ms 6.87 ms 1.91 ms
pybind11 C++11 atomic (32) 10.7 ms 7.08 ms (48) 2.65 ms
pybind11 C++11 merge (32) 23.0 ms 6.03 ms (48) 4.79 ms
pybind11 OpenMP merge 8.74 ms 5.04 ms 1.79 ms

Computers used:

  • MBP: MacBook Pro (Retina, 13-inch, Early 2015) with 2.7 GHz Intel Core i5 (Dual core with Hyperthreading)
  • X24: CentOS 7 with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with Hyperthreading)
  • KNL: CentOS 7 with Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz (64 cores with 4x Hyperthreading)

All computers were running Anaconda and Python 3.6 or 3.7. Thread numbers were reduced from cores times hyperthreads to produce the best timing if noted in parenthesis. For the KNL, threading does not seem to work at all, so 6 threads were chosen. On the KNL and pybind11, running an OpenMP version seems to stop the threaded versions from using threads.

Software:

  • Physt: Object oriented histograms for Python
  • fast-histogram: High speed simple histograms. 0.5 and (m) is master branch, install master with: pip install git+https://github.com/astrofrog/fast-histogram.git
  • Numba: JIT for Python, now available via PIP. The 0.40 version now supports array variables in reduction, but I was not able to come up with a way to use it to do parallel histograms.
  • Cython: C+Python language.
  • pybind11: C++11 bindings for Python. Also using pip install git+https://github.com/aldanor/ipybind.git for nice Jupyter integration.

Source code

Like other notebook powered posts, you can access the notebook to run this yourself at gitlab.com/iscinumpy/iscinumpy.gitlab.io.

import numpy as np
import numba
import math
from physt import h2
import fast_histogram
from physt.histogram_nd import Histogram2D
from concurrent.futures import ThreadPoolExecutor
from functools import partial

Let’s start with setting up the histogram we will be making today. We’ll produce several different quantities, since the libraries each expect something a bit different. Some of the methods require evenly spaced bins, which we have.

bins = (100, 100)
ranges = ((-1, 1), (-1, 1))
bins = np.asarray(bins).astype(np.int64)
ranges = np.asarray(ranges).astype(np.float64)

edges = (
    np.linspace(*ranges[0, :], bins[0] + 1),
    np.linspace(*ranges[1, :], bins[1] + 1),
)

Our 2D dataset will have 1,000,000 2D items. If you are on a many-core machine, you might see better performance for the parallel versions from 10M items instead.

np.random.seed(42)
vals = np.random.normal(size=[2, 1000000]).astype(np.float32)
vals1d = np.random.normal(size=[10000000]).astype(np.float32)

1D: NumPy histogram

%%timeit
h, _ = np.histogram(vals1d, bins=bins[0], range=(ranges[0,0],ranges[0,1]))

1D: NumPy bincount

%%timeit
a = np.zeros(bins[0], dtype=np.int_)
c = ((vals1d[(vals1d>=ranges[0,0]) & (vals1d<ranges[0,1])] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
gh = np.bincount(c)

1D: fast-histogram

%%timeit
h = fast_histogram.histogram1d(vals1d, bins=bins[0], range=(ranges[0,0],ranges[0,1]))

1D: Numba histogram

Even in Numba 0.40, this is not faster in parallel mode.

@numba.njit
def hist1d(v, b, r):
    return np.histogram(v, b, r)[0]
%%timeit
h = hist1d(vals1d, bins[0], (ranges[0,0],ranges[0,1]))

2D

The following examples are in 2D.

Physt

Let’s start by just asking Physt to do the histogramming itself.

%%timeit
h2(*vals, bins=edges)

NumPy

%%timeit
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Histogram2D(ledges, H)
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)

Add.at and Bincount

This manually builds the histogram with bincount and some filtering. This takes advantage of the even bin spacing.

%%timeit
a = np.zeros(bins[0]*bins[1], dtype=np.int_)
cuts = (vals[0]>=ranges[0,0]) & (vals[0]<ranges[0,1]) & (vals[1]>=ranges[1,0]) & (vals[1]<ranges[1,1])
c0 = ((vals[0][cuts] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
c1 = ((vals[1][cuts] - ranges[1,0]) / (ranges[1,1] - ranges[1,0]) * bins[1]).astype(np.int_)
np.add.at(a, c0 + bins[0]*c1, 1)
Histogram2D(edges, a.reshape(*bins))
%%timeit
cuts = (vals[0]>=ranges[0,0]) & (vals[0]<ranges[0,1]) & (vals[1]>=ranges[1,0]) & (vals[1]<ranges[1,1])
c = ((vals[0,cuts] - ranges[0,0]) / (ranges[0,1] - ranges[0,0]) * bins[0]).astype(np.int_)
c += bins[0]*((vals[1,cuts] - ranges[1,0]) / (ranges[1,1] - ranges[1,0]) * bins[1]).astype(np.int_)
H = np.bincount(c, minlength=bins[0]*bins[1]).reshape(*bins)
Histogram2D(edges, H)

Fast-histogram

%%timeit

H = fast_histogram.histogram2d(*vals, bins=100, range=((-1,1),(-1,1)))
Histogram2D(edges, H)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)

Fast-histogram threaded

(Not much faster in 0.5, fixed in master)

%%timeit
splits = 6
with ThreadPoolExecutor(max_workers=splits) as pool:
    chunk = vals.shape[1] // splits
    chunk0 = [vals[0,i*chunk:(i+1)*chunk] for i in range(splits)]
    chunk1 = [vals[1,i*chunk:(i+1)*chunk] for i in range(splits)]
    f = partial(fast_histogram.histogram2d, bins=100, range=((-1,1),(-1,1)))
    results = pool.map(f, chunk0, chunk1)
    results = sum(results)

Numba

Custom function, since so far only 1D np.histogram is supported natively.

# This is 8-9 times faster than np.histogram
# We give the signature here so it gets precompmiled
# In theory, this could even be threaded (nogil!)
@numba.njit(nogil=True, parallel=False)
def hist2d_numba_seq(tracks, bins, ranges):
    H = np.zeros((bins[0], bins[1]), dtype=np.uint64)
    delta = 1 / ((ranges[:, 1] - ranges[:, 0]) / bins)

    for t in range(tracks.shape[1]):
        i = (tracks[0, t] - ranges[0, 0]) * delta[0]
        j = (tracks[1, t] - ranges[1, 0]) * delta[1]
        if 0 <= i < bins[0] and 0 <= j < bins[1]:
            H[int(i), int(j)] += 1

    return H
%%timeit
H = hist2d_numba_seq(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)

Numba threaded

This is why we release the GIL in the definition above.

%%timeit
splits = 6
with ThreadPoolExecutor(max_workers=splits) as pool:
    chunk = vals.shape[1] // splits
    chunks = [vals[:,i*chunk:(i+1)*chunk] for i in range(splits)]
    f = partial(hist2d_numba_seq, bins=bins, ranges=ranges)
    results = pool.map(f, chunks)
    results = sum(results)

Cython

Everyone has to compare to Cython. I’m maxing out the compilation options to try to match Numba, but so far have failed. Also much slower to compile.

%load_ext Cython
%%cython -a -c-O3 -c-march=native

# cython: wraparound = False
# cython: cdivision = True
# cython: nonecheck = False
# cython: boundscheck = False

import numpy as np
cimport libc.math as math
cimport numpy as np

def cy_histogram2d(np.float32_t[:,:] tracks, long[:] bins, double[:,:] ranges):
    Hpy = np.zeros((bins[0], bins[1]), dtype=np.int64)
    cdef np.uint64_t [:,:] H = Hpy
    cdef double delta0 = 1/((ranges[0,1] - ranges[0,0]) / bins[0])
    cdef double delta1 = 1/((ranges[1,1] - ranges[1,0]) / bins[1])
    cdef int i, j

    with nogil:
        for t in range(tracks.shape[1]):
            i = int(math.floor((tracks[0,t] - ranges[0,0]) * delta0))
            j = int(math.floor((tracks[1,t] - ranges[1,0]) * delta1))
            if 0 <= i < bins[0] and 0 <= j < bins[1]:
                H[i,j] += 1

    return Hpy
%%timeit

H = cy_histogram2d(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)

Cython threaded

We can use the same GIL releasing trick in Cython, too.

%%timeit
splits = 8
with ThreadPoolExecutor(max_workers=splits) as pool:
    chunk = vals.shape[1] // splits
    chunks = (vals[:,i*chunk:(i+1)*chunk] for i in range(splits))
    f = partial(cy_histogram2d, bins=bins, ranges=ranges)
    results = pool.map(f, chunks)
    H = sum(results)

Histogram2D(edges, H)

Pybind11

We can set up a threaded and unthreaded version.

Note that we explicitly set a number of threads - the default hardware concurrency can be a bit high.

%load_ext ipybind

Add:

-c="-fopenmp" -Wl="-fopenmp"

(linux) or

-c="-Xpreprocessor" -c="-fopenmp" -Wl="-lomp"

(macOS with HomeBrew) to add OpenMP support.

%%pybind11 -c="-fopenmp" -Wl="-fopenmp"

#include <pybind11/numpy.h>

#include <vector>

py::array_t<long> py11_histogram2d_atomic(
    py::array_t<float> xtrx,
    py::array_t<float> ytrx,
    long bins0, long bins1,
    float ranges00, float ranges01,
    float ranges10, float ranges11) {

    auto xtrx_acc = xtrx.unchecked<1>();
    auto ytrx_acc = ytrx.unchecked<1>();

    std::vector<long> H(bins0 * bins1);

    double delta0 = 1/((ranges01 - ranges00) / bins0);
    double delta1 = 1/((ranges11 - ranges10) / bins1);

    py::buffer_info buf_trks = xtrx.request();
    size_t ntracks = buf_trks.size;

#   pragma omp parallel for
    for (size_t t=0; t<ntracks; t++) {
        float i = (xtrx_acc(t) - ranges00) * delta0;
        float j = (ytrx_acc(t) - ranges10) * delta1;
        if (0 <= i && i < bins0 && 0 <= j && j < bins1)
#           pragma omp atomic
            H[size_t(i) * bins1 + size_t(j)]++;
    }

    py::array_t<long> Hpy({bins0, bins1});
    auto Hpy_acc = Hpy.mutable_unchecked<2>();

    for (long i=0; i<bins0; i++)
        for (long j=0; j<bins0; j++)
            Hpy_acc(i,j)=H[i * bins1 + j];

    return Hpy;
}

PYBIND11_MODULE(example_atomic, m) {
    m.def("py11_histogram2d_atomic", py11_histogram2d_atomic);
}
%%pybind11

#include <pybind11/numpy.h>

#include <atomic>
#include <thread>
#include <vector>

py::array_t<long> py11_histogram2d_threaded_atomic(
    py::array_t<float> xtrx,
    py::array_t<float> ytrx,
    long bins0, long bins1,
    float ranges00, float ranges01,
    float ranges10, float ranges11,
    size_t num_threads) {

    auto xtrx_acc = xtrx.unchecked<1>();
    auto ytrx_acc = ytrx.unchecked<1>();

    std::vector<std::atomic<long>> H(bins0 * bins1);

    double delta0 = 1/((ranges01 - ranges00) / bins0);
    double delta1 = 1/((ranges11 - ranges10) / bins1);

    py::buffer_info buf_trks = xtrx.request();
    size_t ntracks = buf_trks.size;

    auto chunked = [&](size_t start, size_t end){
       for (size_t t=start; t<end; t++) {
        float i = (xtrx_acc(t) - ranges00) * delta0;
        float j = (ytrx_acc(t) - ranges10) * delta1;
        if (0 <= i && i < bins0 && 0 <= j && j < bins1)
            H[size_t(i) * bins1 + size_t(j)]++;
        }
    };

    if (num_threads==0)
        num_threads = std:🧵:hardware_concurrency();

    std::vector<std::thread> threadpool;
    size_t each = ntracks / num_threads;
    for (int i=0; i<(int) num_threads; i++) {
        threadpool.emplace_back(chunked, i*each, ((i-1)==num_threads) ? num_threads : (i+1)*each);
    }

    for (auto& thread : threadpool)
        thread.join();


    py::array_t<long> Hpy({bins0, bins1});
    auto Hpy_acc = Hpy.mutable_unchecked<2>();

    for (long i=0; i<bins0; i++)
        for (long j=0; j<bins1; j++)
            Hpy_acc(i,j)=H[i * bins1 + j];

    return Hpy;
}

PYBIND11_MODULE(example_threaded_atomic, m) {
    m.def("py11_histogram2d_threaded_atomic",
           py11_histogram2d_threaded_atomic);
}
%%pybind11

#include <pybind11/numpy.h>

#include <thread>
#include <algorithm>
#include <vector>

py::array_t<long> py11_histogram2d_threaded_sep(
    py::array_t<float> xtrx,
    py::array_t<float> ytrx,
    long bins0, long bins1,
    float ranges00, float ranges01,
    float ranges10, float ranges11,
    size_t num_threads) {

    auto xtrx_acc = xtrx.unchecked<1>();
    auto ytrx_acc = ytrx.unchecked<1>();

    double delta0 = 1/((ranges01 - ranges00) / bins0);
    double delta1 = 1/((ranges11 - ranges10) / bins1);

    py::buffer_info buf_trks = xtrx.request();
    size_t ntracks = buf_trks.size;

    if (num_threads==0)
        num_threads = std:🧵:hardware_concurrency();

    std::vector<long> H(num_threads * bins0 * bins1);

    auto chunked = [&](size_t thread, size_t start, size_t end){
       for (size_t t=start; t<end; t++) {
        float i = (xtrx_acc(t) - ranges00) * delta0;
        float j = (ytrx_acc(t) - ranges10) * delta1;
        if (0 <= i && i < bins0 && 0 <= j && j < bins1)
            H[thread*bins0*bins1 + size_t(i)*bins1 + size_t(j)]++;
        }
    };

    std::vector<std::thread> threadpool;
    size_t each = ntracks / num_threads;
    for (int i=0; i<(int) num_threads; i++) {
        threadpool.emplace_back(chunked, i, i*each, ((i-1)==num_threads) ? num_threads : (i+1)*each);
    }

    for (auto& thread : threadpool)
        thread.join();


    py::array_t<long> Hpy({bins0, bins1});
    auto Hpy_acc = Hpy.mutable_unchecked<2>();

    for (long i=0; i<bins0; i++)
            for (long j=0; j<bins1; j++)
                Hpy_acc(i,j)=0;

    for (size_t thread=0; thread < num_threads; thread++)
        for (long i=0; i<bins0; i++)
            for (long j=0; j<bins1; j++)
                Hpy_acc(i,j)+=H[thread*bins0*bins1 + i*bins1 + j];

    return Hpy;
}

PYBIND11_MODULE(example_threaded_sep, m) {
    m.def("py11_histogram2d_threaded_sep", py11_histogram2d_threaded_sep);
}
%%pybind11 -c="-fopenmp" -Wl="-fopenmp"

#include <pybind11/numpy.h>

#include <vector>

#include <omp.h>

py::array_t<long> py11_histogram2d_sep(
    py::array_t<float> xtrx,
    py::array_t<float> ytrx,
    long bins0, long bins1,
    float ranges00, float ranges01,
    float ranges10, float ranges11) {

    double delta0 = 1/((ranges01 - ranges00) / bins0);
    double delta1 = 1/((ranges11 - ranges10) / bins1);

    py::buffer_info buf_trks = xtrx.request();
    size_t ntracks = buf_trks.size;

    auto xtrx_acc = xtrx.unchecked<1>();
    auto ytrx_acc = ytrx.unchecked<1>();

    py::array_t<long> Hpy({bins0, bins1});
    auto Hpy_acc = Hpy.mutable_unchecked<2>();

    for (long i=0; i<bins0; i++)
        for (long j=0; j<bins1; j++)
            Hpy_acc(i,j)=0;

#   pragma omp parallel
    {
        size_t num_threads = omp_get_num_threads();
        std::vector<long> H = std::vector<long>(bins0 * bins1);

#       pragma omp for
        for (size_t t=0; t<ntracks; t++) {

            float i = (xtrx_acc(t) - ranges00) * delta0;
            float j = (ytrx_acc(t) - ranges10) * delta1;
            if (0 <= i && i < bins0 && 0 <= j && j < bins1) {
                H[size_t(i)*bins1 + size_t(j)]++;
            }
        }


        for (long i=0; i<bins0; i++)
            for (long j=0; j<bins1; j++)
#               pragma omp atomic
                Hpy_acc(i,j)+=H[i*bins1 + j];
    }

    return Hpy;
}

PYBIND11_MODULE(example_sep, m) {
    m.def("py11_histogram2d_sep", py11_histogram2d_sep);
}
%%timeit
h = py11_histogram2d_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_threaded_atomic(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 0)
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_threaded_atomic(
    vals[0], vals[1], *bins, *ranges[0], *ranges[1], 32
)
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_threaded_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 0)
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_threaded_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1], 32)
print("Should be 0:", (h - H).sum())
%%timeit
h = py11_histogram2d_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
Histogram2D(edges, h)
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1, 1), (-1, 1))).astype(int)
h = py11_histogram2d_sep(vals[0], vals[1], *bins, *ranges[0], *ranges[1])
print("Should be 0:", (h - H).sum())

TODO: Boost::Histogram

Boost::Histogram should be coming out soon (early next year), and there should be nice pybind11 bindings for it. This would be very interesting to compare!