Bug report
Bug description
In Lib/zoneinfo/_zoneinfo.py, _parse_dst_start_end() validates the Mm.w.d
transition rule strictly with an re.ASCII fullmatch, but the Jn (Julian)
and n (0-based) day-of-year branches fall through to a bare int(date) with
no format guard:
else:
if type == "J":
n_is_julian = True
date = date[1:]
else:
n_is_julian = False
doy = int(date) # <-- no ASCII / format check
offset = _DayOffset(doy, n_is_julian)
int() accepts things the C accelerator's day-of-year parser rejects. The C
side (Modules/_zoneinfo.c, parse_transition_rule) reads the field with
parse_digits(&ptr, 1, 3, &day), which consumes 1 to 3 ASCII digits via
Py_ISDIGIT and nothing else. So the two implementations disagree on the same
POSIX TZ string.
The most serious case is a silent miscompile, not a crash: int('1_0')
is 10 (PEP 515 underscore grouping), so a TZ string like
AAA4BBB,J1_0,J300/2 builds a valid but different zone (DST starts on day
10) in pure Python, while the C accelerator raises ValueError. A program
that relies on the pure fallback silently computes wrong local times instead
of reporting the malformed rule.
Other pure-accept / C-reject inputs for the day-of-year field: a leading +
(J+1), a leading space (J 1), 4-or-more-digit widths (J0001), and
non-ASCII digits (Arabic-Indic J١).
Differential (main, before fix)
TZ template AAA4BBB,<token>,J300/2, only <token> varies; loaded through
both implementations via a crafted TZif v2+ footer:
| token |
C accelerator |
pure-Python (before) |
J1_0 |
reject |
accept — day 10 (silent miscompile) |
1_0 |
reject |
accept — day 10 (silent miscompile) |
J+1 |
reject |
accept |
+1 |
reject |
accept |
J 1 |
reject |
accept |
1 |
reject |
accept |
J0001 |
reject |
accept |
0001 |
reject |
accept |
J١ (Arabic 1) |
reject |
accept |
١ |
reject |
accept |
J01, J001 |
accept |
accept (agree; 1-3 digit leading zeros are valid) |
J1, J365, 0, 365 |
accept |
accept (valid controls) |
J366, J400, J1234 |
reject |
reject (agree; range/width) |
10 divergent inputs. The C accelerator consumes at most 3 digits, so
J0001 (4 digits) is rejected by C — any fix must not accept it either.
CPython versions
main (3.16). The pure-Python parser has carried this since the POSIX TZ
support was added.
Fix
Add an re.ASCII digit guard matching C's parse_digits(&ptr, 1, 3, &day)
(1 to 3 ASCII digits) before int(), in the J/n branch only:
if re.fullmatch(r"\d{1,3}", date, re.ASCII) is None:
raise ValueError(f"Invalid dst start/end date: {dststr}")
doy = int(date)
This makes pure exactly match C: it rejects the 10 divergent inputs, still
accepts the leading-zero J01/J001 forms C accepts, and leaves the existing
_DayOffset range check ([julian, 365]) to reject out-of-range values, so
no numeric-range behaviour changes. All 499 bundled IANA zones parse
byte-identically through both implementations after the fix.
Linked PRs
Bug report
Bug description
In
Lib/zoneinfo/_zoneinfo.py,_parse_dst_start_end()validates theMm.w.dtransition rule strictly with an
re.ASCIIfullmatch, but theJn(Julian)and
n(0-based) day-of-year branches fall through to a bareint(date)withno format guard:
int()accepts things the C accelerator's day-of-year parser rejects. The Cside (
Modules/_zoneinfo.c,parse_transition_rule) reads the field withparse_digits(&ptr, 1, 3, &day), which consumes 1 to 3 ASCII digits viaPy_ISDIGITand nothing else. So the two implementations disagree on the samePOSIX TZ string.
The most serious case is a silent miscompile, not a crash:
int('1_0')is
10(PEP 515 underscore grouping), so a TZ string likeAAA4BBB,J1_0,J300/2builds a valid but different zone (DST starts on day10) in pure Python, while the C accelerator raises
ValueError. A programthat relies on the pure fallback silently computes wrong local times instead
of reporting the malformed rule.
Other pure-accept / C-reject inputs for the day-of-year field: a leading
+(
J+1), a leading space (J 1), 4-or-more-digit widths (J0001), andnon-ASCII digits (Arabic-Indic
J١).Differential (main, before fix)
TZ template
AAA4BBB,<token>,J300/2, only<token>varies; loaded throughboth implementations via a crafted TZif v2+ footer:
J1_01_0J+1+1J 11J00010001J١(Arabic 1)١J01,J001J1,J365,0,365J366,J400,J123410 divergent inputs. The C accelerator consumes at most 3 digits, so
J0001(4 digits) is rejected by C — any fix must not accept it either.CPython versions
main (3.16). The pure-Python parser has carried this since the POSIX TZ
support was added.
Fix
Add an
re.ASCIIdigit guard matching C'sparse_digits(&ptr, 1, 3, &day)(1 to 3 ASCII digits) before
int(), in theJ/nbranch only:This makes pure exactly match C: it rejects the 10 divergent inputs, still
accepts the leading-zero
J01/J001forms C accepts, and leaves the existing_DayOffsetrange check ([julian, 365]) to reject out-of-range values, sono numeric-range behaviour changes. All 499 bundled IANA zones parse
byte-identically through both implementations after the fix.
Linked PRs
_zoneinfo.py#152848_zoneinfo.py(GH-152848) #152908_zoneinfo.py(GH-152848) #152909_zoneinfo.py(GH-152848) #152910