How we made Python's packaging library 3x faster -

Along with a pip (and now packaging) maintainer, Damian Shaw, I have been working on making packaging, the library behind almost all packaging related tools, faster at reading versions and specifiers, something tools like pip have to do thousands of times during resolution. Using Python 3.15’s new statistical profiler and metadata from every package ever uploaded to PyPI, I measured and improved core Packaging constructs while keeping the code readable and simple. Reading in Versions can be up to 2x faster and SpecifierSets can be up to 3x faster in packaging 26.0, now released! Other operations have been optimized, as well, up to 5x in some cases. See the announcement and release notes too; this post will focus on the performance work only.

Introduction

packaging is the core library used by most tools for Python to deal with many of the standardized packaging constructs, like versions, specifiers, markers, and the like. It is the 11th most downloaded library, but if you also take into account that it is vendored into pip, meaning you get a (hidden) copy with every pip install, it’s actually the 2nd most downloaded library. Given that pip is vendored into Python, everyone who has Python has packaging, unless their distro strips it out into a separate package; so it is possible it is the most common third party Python library in the world.

In packaging, a Version is something that follows PEP 440’s version standard. And a SpecifierSet is conditions on that version; think >=2,<3 or ~=1.0, those are SpecifierSets. They are used on dependencies, on requires-python, etc. They are also part of Markers, that is, something like tomli; python_version < '3.11' (a Requirement) contains a Marker.

I’d like to start by showing you the progress we’ve made as a series of plots; if you’d like to see how we made some of these, I’ll follow with in-depth examples.

Performance plots with asv

After most of the performance PRs were made, I finally invested a little time into making a proper set of micro-benchmarks with asv; I’ll be showing plots from that. Code for this is currently in a branch in my fork; it might eventually be either contributed or moved to a separate repo. The benchmarks are an optimized (trimmed down) version of the original code.

Plots were made using code in the source directory of my blog repository; values are scaled by the 25.0 performance numbers, with a green line showing the current performance after the changes we’ve been working on. The lines are based on Python 3.14 from uv (which is a bit faster than the one from homebrew). These were run on an entry-level M1 Mac Mini. The plot xscale is expanded after 25.0 to show the current work.

Version constructor

This is the Version constructor. You can see the series of PRs described below lowering the time to 0.5. Now, one of those steps was making the comparison tuple generated on first usage, instead of in the constructor, so the sorting benchmark has taken on that cost:

Version sort

I did play around with the idea of computing __lt__ and friends directly, instead of making a tuple, caching it, then comparing that. But it seems Python optimizes tuple comparison, and these get compared a lot when sorting, so even though the custom method could exit early and save a little calculation, it still was something like 5x slower.

Version str

Here you can see optimizations for __str__; we’ve mostly avoided calling the Version -> str -> Version like we used to, but this still helps third party packages that do this.

Version hash

This is the complete time taken to construct and produce a hash.

SpecifierSet construction

Here we can see SpecifierSet’s construction time. In the past, there were two major regressions; The first bump around 2020 was a bugfix; the added logic is needed for correctness. The second was the introduction of the nested NamedTuple and some other slowdowns we have now fixed.

SpecifierSet contains

One of the most important operations on SpecifierSet is asking if a version is contained in it. Here you can see that we’ve managed to get this over 2x faster.

SpecifierSet filter

Another core operation is .filter, which we’ve made about 5x faster. Most of this was from caching the Version, avoiding repeated Version constructors.

Marker constructor

Another constructor is Marker. The big jump in version 22 was moving to a handwritten parser instead of pyparsing (which also isolated us from breakages due to pyparsing changing their API, and removed our only dependency, too!), but we’ve further improved this since 25.0 by dropping the regular expression construction inside the constructor.

Marker evaluate

Evaluating Markers (to see if the Requirement passes a particular environment) has also gotten faster. Most of that final drop is from avoiding trying to parse everything as a Version, and instead just apply Version to things that might be versions.

Requirement constructor

For reconstructing Requirement, this is similar to Marker (since it contains them).

Utilities: canonicalize_name

Here’s a microbenchmark of the canonicalize_name function, which we made 2x faster by removing a regular expression substitution, using str.translate instead. This turned out to be slower on 3.12 and 3.13; using chained lower() and single-char .replace() is faster on all versions, since there’s an optimized single char replace implementation in CPython.

Resolver loop

For our final benchmark, this is a quick attempt at making a toy resolver. The one bump up is from a fix for proper PEP 440 handling of prereleases.

How it started

Now that you’ve seen what we’ve done, let’s look at how we got there.

This optimization work started when Damian Shaw made a PR to reduce the number of Versions being created during specifier comparison operations, with a note about how pip needed to create thousands of these. That got me interested; I was looking into why Version’s were slow to create in the first place. During the work, Kevin Turcios also got involved, looking for potential slow operations using an AI tool he works on. Also huge thanks to Brett Cannon for reviewing many of these PRs.

Measuring Version and atomic/possessive regex (3.11.5+ only)

The core of the Version object is a regular expression; the rules specified in PEP 440 can be expressed as a regular expression. While most versions look like 1.2.3, there are a lot of optional parts; 2!1.2.3.dev1.post1+extra is also a valid version (don’t try to upload it to PyPI, but it is valid as a Version!). A regular expression is a natural way to express something like that, and probably will be faster than lots of string manipulations; but regular expressions are known to be slow. Since I teach students in my APC 524 class at Princeton to always profile before they start to optimize, ~~I started by profiling, of course~~. Okay… I first worked on the regex, because I knew it had to be slow. I used Python 3.11’s new atomic grouping and possessive qualifiers to reduce backtracking; once you’ve matched a part you don’t need to go back and try other matches on the same part of the version. This did make it faster - by something like 5%.

To measure this, I started by just asking ChatGPT for some versions valid in Python, it gave me 10 or so, then I multiplied that by a large number and that gave me something I could run. A little later, I downloaded the metadata for PyPI (about 10GB sqlite file), and read in every version published, filtering out invalid versions (PyPI used to not validate versions; it predates PEP 440 anyway!), and started using that (final benchmarking code is at the end). This also gave me a way to ensure that the same versions were being read; if the number of versions changed, then the regex was doing something differently.

Here’s the quick script:

import timeit
from packaging.version import Version

TEST_VERSIONS = [
    "1.0.0",
    "2.7",
    "1.2.3rc1",
    "0.9.0.dev4",
    "10.5.1.post2",
    "1!2.3.4",
    "1.0+abc.1",
    "2025.11.24",
    "3.4.5-preview.8",
    "v1.0.0",
] * 10_000


def bench():
    for v in TEST_VERSIONS:
        Version(v)


if __name__ == "__main__":
    t = timeit.timeit("bench()", globals=globals(), number=5)
    print(f"Time: {t:.4f} seconds")

Profiling Version

This result didn’t make sense; the regex was faster, so it should have had a bigger impact on Versions as a whole. I decided to do what I should have done first: profile. This was a perfect opportunity to try CPython 3.15.0’s new statistical profiler that I’d been hearing about on the Core.Py podcast. Since uv python install can install the 3.15 alpha’s, it was easy to get it, I didn’t even have to build anything. Since packaging doesn’t have any compiled dependencies, everything worked smoothly with the alpha version.

To use it, something like this works on macOS:

sudo -E uv run --python 3.15 python -m profiling.sampling tasks/benchmark_version.py

It might trigger an install with sudo active, which means you’ll have to clear uv’s cache and python installs also with sudo, but it got me going.

The textual output was nice, and the html output was great; for a zero-setup profile (well, once Python 3.15 is out), this is fantastic.

Here’s what it looked like:

First flamegraph of Version

Textual output (click to expand)

$ sudo -E uv run --python 3.15 python -m profiling.sampling tasks/benchmark_version.py
Time: 1.3528 seconds
Per version: 2.705616084 µs
Captured 13646 samples in 1.36 seconds
Sample rate: 10000.01 samples/sec
Error rate: 20.57%
Profile Stats:
       nsamples   sample%  tottime (ms)    cumul%   cumtime (s)  filename:lineno(function)
        1/10703       0.0         0.100      99.4         1.070  _sync_coordinator.py:193(_execute_script)
        0/10703       0.0         0.000      99.4         1.070  _sync_coordinator.py:234(main)
        0/10703       0.0         0.000      99.4         1.070  _sync_coordinator.py:251(<module>)
        0/10703       0.0         0.000      99.4         1.070  <frozen runpy>:88(_run_code)
        0/10703       0.0         0.000      99.4         1.070  <frozen runpy>:198(_run_module_as_main)
        0/10661       0.0         0.000      99.0         1.066  <timeit-src>:6(inner)
        0/10661       0.0         0.000      99.0         1.066  timeit.py:183(Timer.timeit)
        0/10661       0.0         0.000      99.0         1.066  timeit.py:240(timeit)
        0/10661       0.0         0.000      99.0         1.066  benchmark_version.py:25(<module>)
      670/10660       6.2        67.000      99.0         1.066  benchmark_version.py:21(bench)
        82/9990       0.8         8.200      92.7         0.999  __init__:0(__init__)
      2613/2623      24.3       261.300      24.4         0.262  version.py:201(Version.__init__)
       951/2106       8.8        95.100      19.6         0.211  version.py:218(Version.__init__)
      1660/1813      15.4       166.000      16.8         0.181  version.py:208(Version.__init__)
      1068/1151       9.9       106.800      10.7         0.115  version.py:206(Version.__init__)

Legend:
  nsamples: Direct/Cumulative samples (direct executing / on call stack)
  sample%: Percentage of total samples this function was directly executing
  tottime: Estimated total time spent directly in this function
  cumul%: Percentage of total samples when this function was on the call stack
  cumtime: Estimated cumulative time (including time in called functions)
  filename:lineno(function): Function location and name

Summary of Interesting Functions:

Functions with Highest Direct/Cumulative Ratio (Hot Spots):
  0.818 direct/cumulative ratio, 58.4% direct samples: version.py:(Version.__init__)
  0.063 direct/cumulative ratio, 6.2% direct samples: benchmark_version.py:(bench)
  0.008 direct/cumulative ratio, 0.8% direct samples: __init__:(__init__)

Functions with Highest Call Frequency (Indirect Calls):
  10703 indirect calls, 99.4% total stack presence: _sync_coordinator.py:(main)
  10703 indirect calls, 99.4% total stack presence: _sync_coordinator.py:(<module>)
  10703 indirect calls, 99.4% total stack presence: <frozen runpy>:(_run_code)

Functions with Highest Call Magnification (Cumulative/Direct):
  10703.0x call magnification, 10702 indirect calls from 1 direct: _sync_coordinator.py:(_execute_script)
  121.8x call magnification, 9908 indirect calls from 82 direct: __init__:(__init__)
  15.9x call magnification, 9990 indirect calls from 670 direct: benchmark_version.py:(bench)

(The HTML version has line numbers and more info.) That’s not what I expected at all. While you can see the regex (first blue section on the left), it’s not dominating; there’s a bunch of other stuff nearly as large as the regex.

Speedups

Stripping 0’s: 10% speedup

The first speedup I saw was this line:

_release = tuple(
    reversed(list(itertools.dropwhile(lambda x: x == 0, reversed(release))))
)

That’s terrible, it’s generating tons of small lists and dropping them. I started with a version that is very fast, making this line 20x faster and dropping it off the profile. This was my first version:

def _strip_trailing_zeros(release: tuple[int, ...]) -> tuple[int, ...]:
    for i in range(len(release) - 1, -1, -1):
        if release[i] != 0:
            return release[: i + 1]
    return ()

This sped reading versions up by about 10% in my benchmark, and by about 40% in pip’s resolver.

There’s overhead to the call, though, so later I got another 10% or so by starting from a suggestion from ℤahlman on the PyPA Discord:

while release and release[-1] == 0:
    release = release[:-1]

And coming up with an in-between version with good performance by being inline:

len_release = len(release)
i = len_release
while i and release[i - 1] == 0:
    i -= 1
_release = release if i == len_release else release[:i]

This avoids creating multiple tuples; it even avoids a slice in the common case of no stripped zeros.

Faster Regex (10-17% faster, 3.11.5+ only)

I did go ahead and make the regex PR. I dropped atomic groups; just using possessive qualifiers got the speed up I wanted and it was easier to strip them out to support older versions of Python with the same single regex string. The 10-17% speedup might not seem like a lot, but I still planned to remove a lot of the other things that were keeping the regex from dominating.

To do this, * becomes *+, and ? becomes ?+. You just need to be careful to only apply it where backtracking is not needed, like between each group. Inside a group, there are cases where you might need to backtrack. To support older Python versions, PATTERN.replace("*+", "*").replace("?+", "?") can be used to strip this back out (atomic groups are harder to strip out). Note that possessive qualifiers were broken on CPython 3.11.0-3.11.4, so the same stripping needs to be done there.

I also cleaned up the regex code a bit, using fullmatch instead of search with anchors, which also seemed a little (1%) faster, could be within measurement uncertainty though.

Note that if you are trying to speed anything up except packaging itself, you can add the regex PyPI library and that supports these features on older Python versions too. The packaging library can’t have dependencies, especially compiled ones.

SpecifierSet: Removing singledispatch (7% faster)

I noticed another slow part in the flamegraph was canonicalize_version, which used a functools.singledispatch instead of an if statement; while I love singledispatch for a very specific style of programming, this isn’t a good use of it, and it’s slow. The function is now simpler, and faster.

This is basically what it was doing:

# Bad pattern
@functools.singledispatch
def f(x: Version | str) -> str:
    str(_TrimmedRelease(str(...)))


@f.register
def f(x: str) -> object:
    return f(Version(x))

Notice how the dispatched functions call the generic function, and the types overlap. Those are signs that this shouldn’t even be used. A better version would be:

def f(x: Version | str) -> str:
    if isinstance(x, str):
        version = Version(x)
    return str(_TrimmedRelease(str(x)))

I don’t want to give singledispatch a bad reputation; see uproot-browser for a good use, where I use it to register different data types that have a known plotting mechanism. It’s just the wrong tool here, and also not great when performance is critical.

However, that wasn’t the only problem with this function; it was running the Version creation (also inside _TrimmedRelease, too!) too many (more than one) times. Remember making Versions runs a regex!

SpecifierSet: remove duplicate Version creation (37% faster)

Inside canonicalize_version, there was another issue; the same version was created twice, once with a subclass (_TrimmedRelease) that had a different behavior when it turned into a string (removing zeros). I instead reworked the classes so you could create the subclass directly, without going through a string. _TrimmedRelease(version) now avoids the string intermediate if version is a Version.

Now the function looks something like this:

def f(x: Version | str) -> str:
    if isinstance(x, str):
        version = Version(x)
    return str(_TrimmedRelease(x))

Removing NamedTuple (20% faster)

Version had an interesting design; it contained a _Version NamedTuple with all of its fields. Now that we added caching, the outer Version had one more field, but otherwise, it was redundant. Creating and using NamedTuple is expensive, accessing via the names has a cost. This might have been done to ensure the object was not writeable, but that can be done without the NamedTuple access using properties. Removing this gave a (20%) speedup, as well as accessing values and even turning the version into a string also gets faster.

I was not able to find anyone using the hidden ._version attribute using GitHub’s code search; if that does break someone, we can always generate the NamedTuple on demand, but we’ll only do that if we have to. Edit: Turns out Hatch uses this. We’ve added the on-demand version with a DeprecationWarning, to be a FutureWarning in a later release.

Map instead of generator (8% faster)

Another slow line are the ones that look like this:

release = tuple(int(i) for i in match.group("release").split("."))

That generator is rather expensive. You can save a little time by using a list comprehension instead (tuple([...]) instead of tuple(...)) for small tuples, but I found that this:

release = map(int, match.group("release").split("."))

was similar to the list comprehension in speed, and it’s both nicer than adding the extra brackets, and was used elsewhere in the code, so moving to using them saved about 8%. Note that tuple([ ... for ... in ... ]) is likely only faster when the thing you are iterating over is small.

Using replacement to get new versions

A couple PRs Damian started and we both worked on was adding __replace__ support, then using it to to replace some Version -> str -> Version sequences inside SpecifierSet. It would have been nice if the API of Version returned Version instead of str for some methods like .public, but that’s a breaking change. If you are using something like Version(version.public) in a performance critical path, you can use __replace__ (copy.replace on Python 3.14) instead, which will be much faster than reparsing the Version. This mostly speeds up comparison, which I’m not usually benchmarking, but is critical for users like pip.

Using slots (2% faster)

This isn’t much of an improvement for Version or SpecifierSet (maybe more on older Python versions), but using __slots__ is a good idea, can reduce memory, and makes the class stricter as well, since it disallows setting a unknown property. Key sharing dictionaries in newer versions reduce the savings, but it’s still nicer.

Speedups inspired by Codeflash

Kevin Turcios used his tool, codeflash.ai, to look for possible speedups. I reviewed the ones it found, and implemented a version of three of them: I moved set construction out of a function, I used .partition instead of split (probably not faster, but nicer), and I used a dict to handle alternate spellings instead of a series of if’s. The tool reported the speedup in the test function, but that’s representative of real work; check the PRs if you’d like to see the values. I came up with different solutions, so the values are relevant enough to show here.

Here’s an example:

# Before
parts = [p.strip() for p in pair.split(",", 1)]
parts.extend([""] * (max(0, 2 - len(parts))))  # Ensure 2 items
label, url = parts

# After
label, _, url = (s.strip() for s in pair.partition(","))

Another one pulled set construction outside a function (making a set is expensive unless you use it inline; if it’s static, just make it once).

Other speedups

Damian also implemented a series of speedups related to reducing unnecessary object creation, such as making some computation lazy, caching related versions, avoiding redundant Version creation, and using the cache in more places. These aren’t less important than mine, it’s just that I’m writing the blog post and I have more to say about mine. :) Also, since his work focused on making pip’s resolver faster, some of the speedups are related to comparisons and containment checks, which won’t show up on my simple profiling.

For his resolver benchmark, pip was originally creating Versions over 4.8 million times, and combined with changes he is also making to pip, it’s now under 400 thousand.

pip on packaging 25.0
pip on packaging main

There also was a speedup found by Shantanu Jain, which speeds up Requirement parsing by 3x by moving regex construction out of the constructor.

After implementing the asv based benchmarks, I also worked on speedups for Marker and Requirement. I inlined the __str__ code for Version, using f-strings instead of joining lists, which gave a 10% speedup.

One of the more impactful changes was replacing the regular expression substitution for a string translate, doubling the performance of canonicalize_name:

_canonicalize_table = str.maketrans(
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ_.",
    "abcdefghijklmnopqrstuvwxyz--",
)
...
value = name.translate(_canonicalize_table)
while "--" in value:
    value = value.replace("--", "-")

This works because package names are required to be ASCII; using this just for the two replacements than calling .lower() is only about 5% slower if you have unicode, but we don’t. Condensing repeated separators is also quite rare. (Credit to Hugo van Kemenade for the clean version above, mine was a bit uglier). This is slower on 3.12 and 3.13, however, so this is actually the fastest version on all Python versions:

value = name.lower().replace(".", "-").replace("_", "-")
while "--" in value:
    value = value.replace("--", "-")

CPython has an optimized single-char replace, and this doesn’t have to the table lookup and one byte at a time string building, so it’s a little faster on 3.14, and up to 4x faster on older versions.

Final flamegraph of Version

The flamegraph now looks much better; the regex (in blue above) dominates, and parts like splitting strings into . separated integers probably can’t get faster outside of compilation. There might be a bit more to gain, but we’ve done pretty well.

Final performance numbers

Comparing packaging 25.0 and the main branch on Python 3.14, reading every version on PyPI went from 19.6 seconds to 9.9 seconds, a nearly 2x speedup. Reading every requires-python and checking if the current version of Python passes went from 105 seconds to 33.9 seconds, a 3x speedup. I actually do this all the time when I’m running analysis on build backends to monitor adoption; those run about two times faster on packaging main.

We have made an RC release, and hope to make a full release in about a week; some other work on improving our handling of standards around markers could cause a delay, but it should happen soon. A lot of other things are in the release as well: support for pattern matching, support for pylock files, support for import name metadata, support for writing metadata to a file, and lots of expanded linting and type verification in our codebase.

The last change required a small fix to the standard; packaging has never followed the marker specification correctly, but the standard was a bit broken, requiring every value to attempt conversion to a Version, even things that were not version-like at all. This change gave us a speed up that uv is already doing. Waiting on that to get approval is all that’s left for a final release of the new packaging (as well as waiting a bit for any bugs in the RC to be reported by you)! Please test the RC and make sure it works for you.

I don’t know about you, but I’m very excited for the fastest release of packaging yet! Please try 26.0 out today! (pip 26.0 hopefully will ship with packaging 26.0 at the end of January).

Edit: after this post, people have been applying some of these speedups to other libraries as well, including CPython itself!

Thanks to Kevin Turcios, Brett Cannon, and Damian Shaw for reviewing this post before publication. Thanks to Mike Fielder for suggesting proper benchmarking with asv, which added work but the result is much cooler and I even found new speedups. Thanks to ChatGPT for catching some initial typos before publication. And thanks to Brett Cannon (again), Thanos (from the PyPA discord server), James Gilbert, and Giordon Stark for catching typos in the published article. I found I am really bad at writing example code that doesn’t run or get linted! (If you helped and I missed your name above, please let me know!)

Benchmark scripts (click to expand)

# benchmark_versions.py
import sqlite3
import timeit
from packaging.version import Version, InvalidVersion

# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite


def valid_version(v: str) -> bool:
    try:
        Version(v)
    except InvalidVersion:
        return False
    return True


with sqlite3.connect("pypi-data.sqlite") as conn:
    TEST_ALL_VERSIONS = [
        row[0]
        for row in conn.execute("SELECT version FROM projects")
        if valid_version(row[0])
    ]


def bench():
    for v in TEST_ALL_VERSIONS:
        Version(v)


if __name__ == "__main__":
    print(f"Loaded {len(TEST_ALL_VERSIONS):,} versions")
    t = timeit.timeit("bench()", globals=globals(), number=1)
    print(f"Time: {t:.4f} seconds")

# benchmark_specifiers.py
import sqlite3
import timeit
from packaging.specifiers import SpecifierSet, InvalidSpecifier
from packaging.version import Version

# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite


def valid_spec(v: str) -> bool:
    try:
        SpecifierSet(v)
    except InvalidSpecifier:
        return False
    return True


with sqlite3.connect("pypi-data.sqlite") as conn:
    TEST_ALL_SPECS = [
        row[0]
        for row in conn.execute("SELECT requires_python FROM projects")
        if row[0] and valid_spec(row[0])
    ]


def bench():
    ver = Version("3.14.2")
    for v in TEST_ALL_SPECS:
        SpecifierSet(v).contains(ver)


if __name__ == "__main__":
    print(f"Loaded {len(TEST_ALL_SPECS):,} specs")
    t = timeit.timeit("bench()", globals=globals(), number=1)
    print(f"Time: {t:.4f} seconds")

programming python