fix: decode numeric references to surrogate code points as U+FFFD#102
Open
spokodev wants to merge 1 commit into
Open
fix: decode numeric references to surrogate code points as U+FFFD#102spokodev wants to merge 1 commit into
spokodev wants to merge 1 commit into
Conversation
`decode('�')`, and any numeric character reference in the surrogate
range U+D800..U+DFFF, returned a lone surrogate instead of the replacement
character. The WHATWG numeric character reference end state requires a
surrogate code point to be set to 0xFFFD; emitting a lone surrogate produces
strings that are not well-formed UTF-16 and break round-trips through UTF-8
and JSON.stringify.
The numeric branch passed code points <= 0xFFFF straight to fromCharCode,
which yields the lone surrogate. Add the surrogate range to the existing
out-of-bounds check. Non-surrogate references (C1 Windows-1252 mappings,
astral code points, named references) are unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
decode('�'), and any numeric character reference in the surrogate range U+D800..U+DFFF, returns a lone surrogate instead of the replacement character:The WHATWG numeric character reference end state says: "If the number is a surrogate, then ... set the character reference code to 0xFFFD." A lone surrogate is not well-formed UTF-16, so the current output breaks round-trips through UTF-8 and produces invalid data in
JSON.stringifyandBuffer.The numeric branch passed code points
<= 0xFFFFstraight tofromCharCode, which yields the lone surrogate. This adds the surrogate range to the existing out-of-bounds check so those references decode to U+FFFD, matching the spec (and theentitiesandhepackages).Verified across the full surrogate range in both
&#x..;and&#..;forms; non-surrogate references (C1 Windows-1252 mappings, astral code points, named references) are unaffected. Added a regression test.