The _json decoder had two failure modes when a Python str value would
contain a lone surrogate (legal per the Python 3 str model):
1. Boundary UnicodeEncodeError: JsonScanner::Callable::call rejected
any input str with surrogates via try_into_utf8 before scanning
began.
2. Silent U+FFFD corruption: call_scan_once and parse_object's key
path called .to_string() on scanstring's Wtf8Buf output, which
routes through Wtf8::Display (lossy). Array values and dict keys
decoded from JSON \uXXXX escapes silently became U+FFFD.
Switch JsonScanner's five PyUtf8StrRef signatures to PyStrRef, drop
the entry-point try_into_utf8 call, and feed Wtf8Buf directly to
new_str instead of going through .to_string(). Key memoization now
uses HashMap<Wtf8Buf, PyStrRef> so surrogate-bearing keys survive
interning. parse_number takes &[u8] since JSON numbers are ASCII.
Extends the WTF-8 refactor pattern established in #7673 to the
decoder. machinery::scanstring already returns Wtf8Buf and is
unchanged.
Unmasks test_single_surrogate_decode. 214 tests in test.test_json
pass with no regressions. Decoder output verified byte-identical to
CPython 3.13.4 over 10,000 random fuzz cases (JSON docs containing
random surrogate escapes at root/list/dict positions, compared via
json.dumps(..., ensure_ascii=True, sort_keys=True)).
encode_basestring/encode_basestring_ascii took PyUtf8StrRef, so
json.dumps(str_with_lone_surrogate) raised UnicodeEncodeError at the
Python/Rust boundary before write_json_string ran. CPython's encoder
emits \uXXXX under ensure_ascii=True and passes raw WTF-8 otherwise.
Switch to PyStrRef + s.as_wtf8(), matching scanstring in the same file.
Rewrite write_json_string to accept &Wtf8 and iterate
code_point_indices, emitting \uXXXX for surrogates in ascii mode and
passing their bytes through otherwise. Stop escaping 0x7F in the
ensure_ascii=False path (matches py_encode_basestring). Return Wtf8Buf
via the checked from_bytes so invariant breaks panic instead of UB.
Fuzzing also exposed two pre-existing ESCAPE_CHARS typos: 0x0B was
"\u000" and 0x1B was "\u001" (both missing trailing 'b'). Fixed here.
Verified byte-identical with CPython 3.13.4 over 16 manual + 10,000
random fuzz cases. Full test.test_json: 214 tests, 0 failures, 0
unexpected successes. Unmasks test_ascii_non_printable_encode and
test_single_surrogate_encode. Decoder path is a follow-up.
* Fix stack overflow on deeply-nested JSON in json.loads()
json.loads() on a deeply-nested array or object payload (e.g.
'[' * 50000 + ']' * 50000) overflowed the native Rust stack and
crashed the interpreter process with SIGSEGV. CPython raises
RecursionError on the same input via _Py_EnterRecursiveCall in
Modules/_json.c.
The recursion lives in the mutual call chain:
JsonScanner::parse_object / parse_array
-> JsonScanner::call_scan_once
-> JsonScanner::parse_object / parse_array
Every descent funnels through call_scan_once, so wrapping its body
with vm.with_recursion covers both '{' and '[' paths (and their
mixed nesting) with a single guard.
Before:
./rustpython -c "import json; json.loads('[' * 50000 + ']' * 50000)"
-> SIGSEGV (exit 139)
After:
-> RecursionError: maximum recursion depth exceeded while
decoding a JSON object from a string
Verified:
- extra_tests/snippets/stdlib_json.py: all assertions pass
(includes 3 new regression cases: array, object, alternating
nesting at depth 100000)
- cargo run -- -m test test_json: 214 passed, 0 regressed
(9 skipped, 13 expected failures, all pre-existing)
- depth 500000 no longer crashes (RecursionError)
- shallow parsing unchanged
* Enable test_highly_nested_objects_decoding
Per @ShaharNaveh's review on #7632: this test was previously marked
`@unittest.skip("TODO: RUSTPYTHON; crashes")` because json.loads
would SIGSEGV on the 500_000-deep input. The recursion-guard added
in this PR makes it raise RecursionError like CPython, so the skip
decorator can be removed.
$ cargo run -- -m unittest \
test.test_json.test_recursion.TestCRecursion.test_highly_nested_objects_decoding \
test.test_json.test_recursion.TestPyRecursion.test_highly_nested_objects_decoding
...
Ran 2 tests in 0.825s
OK
$ cargo run -- -m test test_json
Ran 214 tests (7 skipped, 13 expected failures) — all pass.
* Parse JSON in Rust
* Reuse key when decoding JSON
* Unmark resolved test
* Parse null/true/false directly in call_scan_once
Parse JSON constants (null, true, false) directly in Rust within
call_scan_once() instead of falling back to Python scan_once.
This reduces Python-Rust boundary crossings for array/object values.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Parse numbers directly in call_scan_once
Parse JSON numbers starting with digits (0-9) directly in Rust within
call_scan_once() by reusing the existing parse_number() method.
This reduces Python-Rust boundary crossings for array/object values.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Parse NaN/Infinity/-Infinity in call_scan_once
Parse special JSON constants (NaN, Infinity, -Infinity) and negative
numbers directly in Rust within call_scan_once(). This handles:
- 'N' -> NaN via parse_constant callback
- 'I' -> Infinity via parse_constant callback
- '-' -> -Infinity or negative numbers via parse_constant/parse_number
This reduces Python-Rust boundary crossings for array/object values.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Correct wrong index access
* Leave more flame span
* Refactor json scanstring with byte index
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>