feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760goel-skd wants to merge 6 commits into
Conversation
bec8884 to
69cc006
Compare
Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.
69cc006 to
f42e2da
Compare
wgtmac
left a comment
There was a problem hiding this comment.
I haven't checked the implementation and test yet. Just post some thoughts around APIs.
Thanks much @wgtmac. I responded to your comments. |
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.
c692240 to
9ffebc3
Compare
The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two bytes but lower-cases to "i", so "İx" should match prefix "i" but does not. This test pins that behavior; it fails against the current implementation and is fixed by the following commit. Relates to apache#760.
Compare the ToLower forms of both inputs instead of byte-slicing str to prefix.size() before lowercasing. ToLower can change a string's byte length (e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code point or wrongly reject a valid match. Makes the regression test from the previous commit pass. Relates to apache#760.
| /// Intended for case-insensitive name matching, similar to Iceberg Java's | ||
| /// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent | ||
| /// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case | ||
| /// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot |
There was a problem hiding this comment.
If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?
|
Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately. What I'll do in this PR:
On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it. Two things I'd punt to follow-up issues:
Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups. |
|
Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT? |
@wgtmac - Thanks, I agree that is fair. Will push the updated code. |
db1f5a8 to
3e063ae
Compare
Replaces the ASCII-only
StringUtils::ToLowerwith a Unicode-aware implementation backed by utf8proc, so case-insensitive name handling matches Iceberg Java'stoLowerCase(Locale.ROOT).ToLowerlower-cases UTF-8 using utf8proc simple (1:1) case mapping (e.g.CAFÉ→café,GROẞE→große). Pure-ASCII input takes a byte-wise fast path and never touches utf8proc. The function is total: a byte that does not begin a valid UTF-8 sequence is passed through unchanged and decoding resumes at the next byte, so the valid code points around it are still lower-cased.EqualsIgnoreCaseandStartsWithIgnoreCasefold throughToLower, so they are case-insensitive for non-ASCII letters too. Both take an allocation-free ASCII fast path and fall back toToLoweronly when a non-ASCII byte is present.StartsWithIgnoreCaseno longer assumes case mapping preserves byte length — the old byte-slice guard mishandled length-changing maps (e.g.İU+0130 →i). Now"İx"starts with"i", and"i"starts with"İ".ToUpperis intentionally left ASCII-only — it only normalizes ASCII enum/codec strings, and simple case mapping would be wrong for some letters (e.g.ßwould stay unchanged instead of becomingSS).subprojects/utf8proc.wrap) builds, with matching source hashes and correct static-vs-shared linkage on Windows.Testing
string_util_test.cc:ToLowerUnicode(incl. invalid / truncated / stray-continuation bytes),ToUpperAsciiOnly, and Unicode + invalid-UTF-8 cases forEqualsIgnoreCase/StartsWithIgnoreCase.IgnoreCaseAgreesWithToLowerOracleexhaustively checks both ASCII fast paths against theToLoweroracle over all short strings drawn from an alphabet with length-changing maps (İ→i,K→k) and invalid UTF-8.Follow-up
Part of #613.
Follow-up to #748.
Pre-cursor to #808.