Skip to content

factorize: dynamic scratch-table sizing (generalize grow_table to rehash array keys) #214

Description

@flexatone

Summary

Give factorize (#212) a dynamically sized scratch hash table instead of pre-sizing it to n
(the worst-case, all-unique count). The enabler is a small generalization of grow_table so it can
rehash array-keyed tables, not just the list/object-backed ones it supports today.

Motivation

factorize builds a private scratch FAMObject and sizes it once with grow_table(&scratch, n).
For the common grouping/pivot workload the number of distinct keys k is far smaller than n, so the
table is drastically over-allocated:

  • n = 1_000_000 → a table of ~2^21 slots ≈ 32 MB, of which only k slots are ever touched.
  • Those k touched slots are scattered across 32 MB, so lookups reach into L3 instead of staying
    L2/L1-resident.

Measured effect (int64, n = 1M, comparing the n-sized table against a small, cache-resident table):

case n-sized (32 MB) small table gain
k=100 2.62 ms 2.20 ms ~16%
k=5000 3.21 ms 2.91 ms ~9%

This also matters for closing the last gap to pandas. After adding a fast bit-based float hash,
factorize on float64 is at near-parity with pandas.factorize (0.92–0.97x at the larger/higher-k
cases) but still slightly behind; a right-sized, cache-resident table is the most likely lever to flip
float (and low-k int) past pandas. For context, factorize already beats pandas decisively on
<U strings (4–9x), datetime64 (1.4–1.7x), and is at parity on int64.

Why it isn't a free reuse of existing growth

AutoMap/FrozenAutoMap already grow dynamically, but only for the list/object representation.
grow_table's rehash re-inserts each key via insert_obj(self, PyList_GET_ITEM(self->keys, i), i, h),
which needs a Python-object key list. For any array-keyed table (keys_array_type != 0 — every typed
int/float/string/datetime fast path) it bails:

// src/auto_map.c, grow_table()
if (size_old) {
    if (self->keys_array_type) {
        PyErr_SetString(PyExc_NotImplementedError, "Cannot grow table for array keys");
        goto restore;
    }
    ...
}

AutoMap.append/extend reject array-backed keys the same way. So the existing growth is inseparable
from the boxed/object path — exactly the slow path factorize avoids. Reusing it as-is would throw away
the typed-hashing speed.

Proposed approach

An open-addressing rehash doesn't need the key value — only to re-place each (keys_pos, hash)
entry by re-probing. Generalize grow_table's array-keyed branch to do this. Two viable variants:

  1. Rehash by stored hash — replay the hash & mask → SCAN → reprobe sequence to the first empty
    slot. No key access at all; dtype-agnostic.
  2. Rehash via the typed probes — since the entries reference self->keys (an array) at valid
    keys_pos indices, call the existing lookup_hash_*(new_table, key_at(keys_pos), h, kat) to find
    the new slot; it re-probes and resolves same-hash collisions via ==. Reuses code we already rely on.

Then wire factorize:

  • Start the scratch table at a modest size (e.g. min(n, 4096)-sized) rather than n.
  • Add a growth trigger at the top of the element loop: when the running distinct count k reaches the
    load threshold (table_size * LOAD), grow (rehash) before the next lookup, so no re-probe of the
    current element is needed.

This is safe for the rest of the codebase: real FrozenAutoMaps are pre-sized and never regrow, so the
new array-keyed rehash branch is dead for them and changes no existing behavior.

Tradeoffs

  • Helps the common low-cardinality case (k ≪ n): ~10–15% and better cache behavior.
  • The rare all-unique case (k ≈ n) does O(log n) rehashes (total O(n) extra copies), a modest
    regression. Mitigated by the min(n, 4096) starting size so low-k needs zero or one growth, and
    high-k grows geometrically (amortized O(n), constant factor small).

Acceptance criteria

  • grow_table (or a dedicated helper) can rehash an array-keyed table correctly for all supported KATs
    (int/uint/float/unicode/string/datetime), verified by the existing factorize round-trip and
    randomized parity tests.
  • factorize scratch table starts small and grows; no correctness change (all test/test_factorize.py
    and the full suite pass).
  • Benchmark shows improvement on low-k factorize (int/float) with no material regression on the
    all-unique case; report float vs pandas.factorize before/after.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions