Summary
Give factorize (#212) a dynamically sized scratch hash table instead of pre-sizing it to n
(the worst-case, all-unique count). The enabler is a small generalization of grow_table so it can
rehash array-keyed tables, not just the list/object-backed ones it supports today.
Motivation
factorize builds a private scratch FAMObject and sizes it once with grow_table(&scratch, n).
For the common grouping/pivot workload the number of distinct keys k is far smaller than n, so the
table is drastically over-allocated:
n = 1_000_000 → a table of ~2^21 slots ≈ 32 MB, of which only k slots are ever touched.
- Those
k touched slots are scattered across 32 MB, so lookups reach into L3 instead of staying
L2/L1-resident.
Measured effect (int64, n = 1M, comparing the n-sized table against a small, cache-resident table):
| case |
n-sized (32 MB) |
small table |
gain |
| k=100 |
2.62 ms |
2.20 ms |
~16% |
| k=5000 |
3.21 ms |
2.91 ms |
~9% |
This also matters for closing the last gap to pandas. After adding a fast bit-based float hash,
factorize on float64 is at near-parity with pandas.factorize (0.92–0.97x at the larger/higher-k
cases) but still slightly behind; a right-sized, cache-resident table is the most likely lever to flip
float (and low-k int) past pandas. For context, factorize already beats pandas decisively on
<U strings (4–9x), datetime64 (1.4–1.7x), and is at parity on int64.
Why it isn't a free reuse of existing growth
AutoMap/FrozenAutoMap already grow dynamically, but only for the list/object representation.
grow_table's rehash re-inserts each key via insert_obj(self, PyList_GET_ITEM(self->keys, i), i, h),
which needs a Python-object key list. For any array-keyed table (keys_array_type != 0 — every typed
int/float/string/datetime fast path) it bails:
// src/auto_map.c, grow_table()
if (size_old) {
if (self->keys_array_type) {
PyErr_SetString(PyExc_NotImplementedError, "Cannot grow table for array keys");
goto restore;
}
...
}
AutoMap.append/extend reject array-backed keys the same way. So the existing growth is inseparable
from the boxed/object path — exactly the slow path factorize avoids. Reusing it as-is would throw away
the typed-hashing speed.
Proposed approach
An open-addressing rehash doesn't need the key value — only to re-place each (keys_pos, hash)
entry by re-probing. Generalize grow_table's array-keyed branch to do this. Two viable variants:
- Rehash by stored hash — replay the
hash & mask → SCAN → reprobe sequence to the first empty
slot. No key access at all; dtype-agnostic.
- Rehash via the typed probes — since the entries reference
self->keys (an array) at valid
keys_pos indices, call the existing lookup_hash_*(new_table, key_at(keys_pos), h, kat) to find
the new slot; it re-probes and resolves same-hash collisions via ==. Reuses code we already rely on.
Then wire factorize:
- Start the scratch table at a modest size (e.g.
min(n, 4096)-sized) rather than n.
- Add a growth trigger at the top of the element loop: when the running distinct count
k reaches the
load threshold (table_size * LOAD), grow (rehash) before the next lookup, so no re-probe of the
current element is needed.
This is safe for the rest of the codebase: real FrozenAutoMaps are pre-sized and never regrow, so the
new array-keyed rehash branch is dead for them and changes no existing behavior.
Tradeoffs
- Helps the common low-cardinality case (
k ≪ n): ~10–15% and better cache behavior.
- The rare all-unique case (
k ≈ n) does O(log n) rehashes (total O(n) extra copies), a modest
regression. Mitigated by the min(n, 4096) starting size so low-k needs zero or one growth, and
high-k grows geometrically (amortized O(n), constant factor small).
Acceptance criteria
grow_table (or a dedicated helper) can rehash an array-keyed table correctly for all supported KATs
(int/uint/float/unicode/string/datetime), verified by the existing factorize round-trip and
randomized parity tests.
factorize scratch table starts small and grows; no correctness change (all test/test_factorize.py
and the full suite pass).
- Benchmark shows improvement on low-
k factorize (int/float) with no material regression on the
all-unique case; report float vs pandas.factorize before/after.
References
Summary
Give
factorize(#212) a dynamically sized scratch hash table instead of pre-sizing it ton(the worst-case, all-unique count). The enabler is a small generalization of
grow_tableso it canrehash array-keyed tables, not just the list/object-backed ones it supports today.
Motivation
factorizebuilds a private scratchFAMObjectand sizes it once withgrow_table(&scratch, n).For the common grouping/pivot workload the number of distinct keys
kis far smaller thann, so thetable is drastically over-allocated:
n = 1_000_000→ a table of ~2^21 slots ≈ 32 MB, of which onlykslots are ever touched.ktouched slots are scattered across 32 MB, so lookups reach into L3 instead of stayingL2/L1-resident.
Measured effect (int64,
n = 1M, comparing then-sized table against a small, cache-resident table):n-sized (32 MB)This also matters for closing the last gap to pandas. After adding a fast bit-based float hash,
factorizeonfloat64is at near-parity withpandas.factorize(0.92–0.97x at the larger/higher-kcases) but still slightly behind; a right-sized, cache-resident table is the most likely lever to flip
float (and low-
kint) past pandas. For context,factorizealready beats pandas decisively on<Ustrings (4–9x), datetime64 (1.4–1.7x), and is at parity on int64.Why it isn't a free reuse of existing growth
AutoMap/FrozenAutoMapalready grow dynamically, but only for the list/object representation.grow_table's rehash re-inserts each key viainsert_obj(self, PyList_GET_ITEM(self->keys, i), i, h),which needs a Python-object key list. For any array-keyed table (
keys_array_type != 0— every typedint/float/string/datetime fast path) it bails:
AutoMap.append/extendreject array-backed keys the same way. So the existing growth is inseparablefrom the boxed/object path — exactly the slow path
factorizeavoids. Reusing it as-is would throw awaythe typed-hashing speed.
Proposed approach
An open-addressing rehash doesn't need the key value — only to re-place each
(keys_pos, hash)entry by re-probing. Generalize
grow_table's array-keyed branch to do this. Two viable variants:hash & mask→ SCAN → reprobe sequence to the first emptyslot. No key access at all; dtype-agnostic.
self->keys(an array) at validkeys_posindices, call the existinglookup_hash_*(new_table, key_at(keys_pos), h, kat)to findthe new slot; it re-probes and resolves same-hash collisions via
==. Reuses code we already rely on.Then wire
factorize:min(n, 4096)-sized) rather thann.kreaches theload threshold (
table_size * LOAD), grow (rehash) before the next lookup, so no re-probe of thecurrent element is needed.
This is safe for the rest of the codebase: real
FrozenAutoMaps are pre-sized and never regrow, so thenew array-keyed rehash branch is dead for them and changes no existing behavior.
Tradeoffs
k ≪ n): ~10–15% and better cache behavior.k ≈ n) does O(log n) rehashes (total O(n) extra copies), a modestregression. Mitigated by the
min(n, 4096)starting size so low-kneeds zero or one growth, andhigh-
kgrows geometrically (amortized O(n), constant factor small).Acceptance criteria
grow_table(or a dedicated helper) can rehash an array-keyed table correctly for all supported KATs(int/uint/float/unicode/string/datetime), verified by the existing
factorizeround-trip andrandomized parity tests.
factorizescratch table starts small and grows; no correctness change (alltest/test_factorize.pyand the full suite pass).
kfactorize(int/float) with no material regression on theall-unique case; report float vs
pandas.factorizebefore/after.References
factorizeprimitive #212 (thefactorizeprimitive)src/auto_map.c:grow_table(rehash guard),lookup_hash_*(typed probes),factorize