The _json decoder had two failure modes when a Python str value would
contain a lone surrogate (legal per the Python 3 str model):
1. Boundary UnicodeEncodeError: JsonScanner::Callable::call rejected
any input str with surrogates via try_into_utf8 before scanning
began.
2. Silent U+FFFD corruption: call_scan_once and parse_object's key
path called .to_string() on scanstring's Wtf8Buf output, which
routes through Wtf8::Display (lossy). Array values and dict keys
decoded from JSON \uXXXX escapes silently became U+FFFD.
Switch JsonScanner's five PyUtf8StrRef signatures to PyStrRef, drop
the entry-point try_into_utf8 call, and feed Wtf8Buf directly to
new_str instead of going through .to_string(). Key memoization now
uses HashMap<Wtf8Buf, PyStrRef> so surrogate-bearing keys survive
interning. parse_number takes &[u8] since JSON numbers are ASCII.
Extends the WTF-8 refactor pattern established in #7673 to the
decoder. machinery::scanstring already returns Wtf8Buf and is
unchanged.
Unmasks test_single_surrogate_decode. 214 tests in test.test_json
pass with no regressions. Decoder output verified byte-identical to
CPython 3.13.4 over 10,000 random fuzz cases (JSON docs containing
random surrogate escapes at root/list/dict positions, compared via
json.dumps(..., ensure_ascii=True, sort_keys=True)).
encode_basestring/encode_basestring_ascii took PyUtf8StrRef, so
json.dumps(str_with_lone_surrogate) raised UnicodeEncodeError at the
Python/Rust boundary before write_json_string ran. CPython's encoder
emits \uXXXX under ensure_ascii=True and passes raw WTF-8 otherwise.
Switch to PyStrRef + s.as_wtf8(), matching scanstring in the same file.
Rewrite write_json_string to accept &Wtf8 and iterate
code_point_indices, emitting \uXXXX for surrogates in ascii mode and
passing their bytes through otherwise. Stop escaping 0x7F in the
ensure_ascii=False path (matches py_encode_basestring). Return Wtf8Buf
via the checked from_bytes so invariant breaks panic instead of UB.
Fuzzing also exposed two pre-existing ESCAPE_CHARS typos: 0x0B was
"\u000" and 0x1B was "\u001" (both missing trailing 'b'). Fixed here.
Verified byte-identical with CPython 3.13.4 over 16 manual + 10,000
random fuzz cases. Full test.test_json: 214 tests, 0 failures, 0
unexpected successes. Unmasks test_ascii_non_printable_encode and
test_single_surrogate_encode. Decoder path is a follow-up.