-
-
Notifications
You must be signed in to change notification settings - Fork 34k
util: add fast path for Latin1 decoding #55275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
b12ba58
util: add fast path for Latin1 decoding
3c239bd
handle input correctly in decode functions
17f317e
repair
e0a40c6
Update lib/internal/encoding.js
mertcanaltin b7420b5
add ignoreBOM and fatal support to decodeLatin1
8bc747e
lint
fced029
update tests to support ignoreBOM and fatal for decodeLatin1
af7cca5
lint
659f1b6
correct BOM check in decodeLatin1 for uint8_t comparison
b92e571
add BOM test case to cover 0xFF in decodeLatin1
94109bf
Update encoding_binding.cc
mertcanaltin 6dec37f
delete result.resize
f62f4c4
Update encoding_binding.cc
mertcanaltin f96e2ef
Update encoding_binding.cc
mertcanaltin 77bc5ee
update text-decoder to use windows-1252 encoding
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| #include "encoding_binding.h" | ||
| #include "env-inl.h" | ||
| #include "gtest/gtest.h" | ||
| #include "node_test_fixture.h" | ||
| #include "v8.h" | ||
|
|
||
| namespace node { | ||
| namespace encoding_binding { | ||
|
|
||
| bool RunDecodeLatin1(Environment* env, | ||
| Local<Value> args[], | ||
| Local<Value>* result) { | ||
| Isolate* isolate = env->isolate(); | ||
| TryCatch try_catch(isolate); | ||
|
|
||
| BindingData::DecodeLatin1(FunctionCallbackInfo<Value>(args)); | ||
|
|
||
| if (try_catch.HasCaught()) { | ||
| return false; | ||
| } | ||
|
|
||
| *result = try_catch.Exception(); | ||
| return true; | ||
| } | ||
|
|
||
| class EncodingBindingTest : public NodeTestFixture {}; | ||
|
|
||
| TEST_F(EncodingBindingTest, DecodeLatin1_ValidInput) { | ||
| Environment* env = CreateEnvironment(); | ||
| Isolate* isolate = env->isolate(); | ||
| HandleScope handle_scope(isolate); | ||
|
|
||
| const uint8_t latin1_data[] = {0xC1, 0xE9, 0xF3}; | ||
| Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, sizeof(latin1_data)); | ||
| memcpy(ab->GetBackingStore()->Data(), latin1_data, sizeof(latin1_data)); | ||
|
|
||
| Local<Uint8Array> array = Uint8Array::New(ab, 0, sizeof(latin1_data)); | ||
| Local<Value> args[] = {array}; | ||
|
|
||
| Local<Value> result; | ||
| EXPECT_TRUE(RunDecodeLatin1(env, args, &result)); | ||
|
|
||
| String::Utf8Value utf8_result(isolate, result); | ||
| EXPECT_STREQ(*utf8_result, "Áéó"); | ||
| } | ||
|
|
||
| TEST_F(EncodingBindingTest, DecodeLatin1_EmptyInput) { | ||
| Environment* env = CreateEnvironment(); | ||
| Isolate* isolate = env->isolate(); | ||
| HandleScope handle_scope(isolate); | ||
|
|
||
| Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, 0); | ||
| Local<Uint8Array> array = Uint8Array::New(ab, 0, 0); | ||
| Local<Value> args[] = {array}; | ||
|
|
||
| Local<Value> result; | ||
| EXPECT_TRUE(RunDecodeLatin1(env, args, &result)); | ||
|
|
||
| String::Utf8Value utf8_result(isolate, result); | ||
| EXPECT_STREQ(*utf8_result, ""); | ||
| } | ||
|
|
||
| TEST_F(EncodingBindingTest, DecodeLatin1_InvalidInput) { | ||
| Environment* env = CreateEnvironment(); | ||
| Isolate* isolate = env->isolate(); | ||
| HandleScope handle_scope(isolate); | ||
|
|
||
| Local<Value> args[] = {String::NewFromUtf8Literal(isolate, "Invalid input")}; | ||
|
|
||
| Local<Value> result; | ||
| EXPECT_FALSE(RunDecodeLatin1(env, args, &result)); | ||
| } | ||
|
|
||
| } // namespace encoding_binding | ||
| } // namespace node |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line specifically uses the wrong assumption that
Latin1andwindows-1252are the same encodingThey aren't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm catching now in MDN api docs, say this latin1 same windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the WHATWG Encoding Standard, which
explicitly states that in web-compatible software, latin1, iso-8859-1, and ascii
are all just labels for windows-1252
from the spec: "these are synonyms: latin1 and ascii are just labels for
windows-1252, and any software following this standard will, for example, decode 0x80
as U+20AC (€)
however, my implementation was still flawed - it didn't properly handle the 0x80-0x9F
range mapping. The fast path needs a lookup table for those bytes to correctly map them
to their windows-1252 code points (€, smart quotes, etc)
so while the WHATWG standard does treat them as equivalent, my implementation failed to
properly decode them. I support the revert, and if there's interest in a performance
optimization here, it would need to properly implement the windows-1252 mapping table
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHATWG mapping Latin1 to windows-1252 doesn't mean that Latin1 and windows-1252 are equivalent.
windows-1252 decoder is compatible with decoding Latin1 bytes, this is why the standard can use that mapping, the same way it can alias
asciitowindows-1252There is no
Latin1(orascii) inTextDecoder,new TextDecoder('Latin1')creates awindows-1252decoder which is capable of decodingLatin1, as it's a subset ofwindows-1252.new TextDecoder('ascii')also returns awindows-1252decoder by spec, which is reasonable asasciiis also a subset ofwindows-1252and that decoder can well decodeasciibytes.But we can't implement
windows-1252withLatin1(or withascii-only).We can't even make
new TextDecoder('Latin1')/new TextDecoder('ascii')return a Latin1-only (or ascii-only) decoder, as that would break compat.It should fully support windows-1252
See also our own doc:
node/doc/api/buffer.md
Lines 229 to 234 in 7643c2a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detail!