TextDecoder: accept all WHATWG utf-8 encoding labels#198
Open
bkaradzic-microsoft wants to merge 1 commit into
Open
TextDecoder: accept all WHATWG utf-8 encoding labels#198bkaradzic-microsoft wants to merge 1 commit into
bkaradzic-microsoft wants to merge 1 commit into
Conversation
The TextDecoder constructor only accepted the exact labels "utf-8"/"UTF-8"
and threw for every other spelling. Per the WHATWG Encoding Standard, an
encoding label is matched after stripping leading/trailing ASCII whitespace
and ASCII-lowercasing, and several labels ("utf8", "unicode-1-1-utf-8",
"unicode11utf8", "unicode20utf8", "x-unicode20utf8") all map to UTF-8.
Consumers such as the Babylon.js glTF/Draco loader construct
`new TextDecoder("utf8")`; the throw aborted decoding mid-load and (in
Babylon Native) left the loader in a state that drove a native out-of-bounds
write, observed as non-deterministic heap corruption on the Draco
validation tests.
Normalize the label per spec and accept all UTF-8 labels. Adds regression
tests for "utf8", case/whitespace variants, the other aliases, and a
still-rejected unsupported encoding.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the TextDecoder polyfill to accept all UTF-8 encoding labels recognized by the WHATWG Encoding Standard (e.g., utf8, unicode11utf8), fixing real-world incompatibilities (notably Babylon.js glTF/Draco loader usage) and adds unit tests to prevent regressions.
Changes:
- Normalize the constructor’s encoding label (ASCII trim + ASCII lowercase) and accept all WHATWG UTF-8 labels.
- Preserve rejection behavior for unsupported encodings (e.g.,
utf-16). - Add unit tests covering accepted aliases and normalization behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| Tests/UnitTests/Scripts/tests.ts | Adds unit tests for UTF-8 label aliases, case/whitespace normalization, and unsupported encodings. |
| Polyfills/TextDecoder/Source/TextDecoder.cpp | Implements WHATWG-style label normalization and accepts the full set of UTF-8 labels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| label != "unicode20utf8" && | ||
| label != "x-unicode20utf8") | ||
| { | ||
| throw Napi::Error::New(Env(), "TextDecoder: unsupported encoding '" + encoding + "', only 'utf-8' is supported"); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
TextDecoder's constructor only accepted the exact labels"utf-8"and"UTF-8", throwing for every other spelling. This PR makes it accept all WHATWG-spec UTF-8 labels.Why
Per the WHATWG Encoding Standard, an encoding label is matched after stripping leading/trailing ASCII whitespace and ASCII-lowercasing, and several labels all decode as UTF-8:
utf-8,utf8,unicode-1-1-utf-8,unicode11utf8,unicode20utf8,x-unicode20utf8.Real consumers rely on this. The Babylon.js glTF/Draco loader constructs
new TextDecoder("utf8")(no hyphen). With the old check that threw, decoding aborted mid-load. In Babylon Native the aborted load left the loader in a state that drove a native out-of-bounds write, which surfaced as non-deterministic heap corruption (STATUS_HEAP_CORRUPTION,0xC0000374) on the Draco mesh-compression validation tests.Fix
Normalize the label per the spec (trim ASCII whitespace + ASCII-lowercase) and accept the full set of UTF-8 labels; still throw for genuinely unsupported encodings.
Verification
GLTF Serializer KHR draco mesh compression,GLTF Buggy with Draco Mesh Compression) now pass (3/3 runs each) with this fix vendored in; a third (GLTF Box with bad Draco normalized flag) no longer crashes.utf8label, case/whitespace variants, the other UTF-8 aliases, and a still-rejected encoding (utf-16).