mirror of
https://github.com/RustPython/RustPython.git
synced 2026-06-02 19:39:49 +09:00
The _json decoder had two failure modes when a Python str value would contain a lone surrogate (legal per the Python 3 str model): 1. Boundary UnicodeEncodeError: JsonScanner::Callable::call rejected any input str with surrogates via try_into_utf8 before scanning began. 2. Silent U+FFFD corruption: call_scan_once and parse_object's key path called .to_string() on scanstring's Wtf8Buf output, which routes through Wtf8::Display (lossy). Array values and dict keys decoded from JSON \uXXXX escapes silently became U+FFFD. Switch JsonScanner's five PyUtf8StrRef signatures to PyStrRef, drop the entry-point try_into_utf8 call, and feed Wtf8Buf directly to new_str instead of going through .to_string(). Key memoization now uses HashMap<Wtf8Buf, PyStrRef> so surrogate-bearing keys survive interning. parse_number takes &[u8] since JSON numbers are ASCII. Extends the WTF-8 refactor pattern established in #7673 to the decoder. machinery::scanstring already returns Wtf8Buf and is unchanged. Unmasks test_single_surrogate_decode. 214 tests in test.test_json pass with no regressions. Decoder output verified byte-identical to CPython 3.13.4 over 10,000 random fuzz cases (JSON docs containing random surrogate escapes at root/list/dict positions, compared via json.dumps(..., ensure_ascii=True, sort_keys=True)).