-
Notifications
You must be signed in to change notification settings - Fork 205
Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lar expressions Since Javascript uses UTF-16 to encode strings, it is assumed that [surrogate code points](https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/#24-surrogate-pairs) will be valid regex input. Some libraries including common polyfills and transpiler plugins use such regular expressions. By default, pcre2 does not allow code points in these ranges, failing with `disallowed Unicode code point`. In order to allow pcre2 to work with these code points, the `PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES` option must be set using [pcre2_set_compile_extra_options](https://www.pcre.org/current/doc/html/pcre2_set_compile_extra_options.html). This change sets this configuration in `pcre2_compile()` if NJS is using pcre2.
|
Hi Javier, Thanks for your patch. njs stores strings using UTF-8 encoding. This means that surrogate pair are converted to UTF-8 encoded symbols at the moment of parsing strings (and not stored as surrogate UTF-16 pairs). This is one of few incongruences with the ECMAScript standard that njs has. This manifests itself in a rare situation when Unicode characters above basic multilingual plane are used. node
>> String.fromCodePoint(0xD83D, 0xDCA9).length
2
njs
>> String.fromCodePoint(0xD83D, 0xDCA9).length
1 |
|
Thanks for responding @xeioex , I accidentally targeted the wrong branch with my patch which is why I closed it immediately. However maybe I can use it as an opportunity to ask a question. It's good to know that njs deals in UTF-8 internally. This patch came from experiences I had trying to get third party modules to reliably transpile using webpack. There are some common libraries that explicitly reference astral code points in the code. Here's an example and another. For this I've found two workarounds:
Through my experimentation I've found that using one of these strategies greatly increases the number of npm modules I can use without issues. However, I wonder which approach is preferred? The pcre2 documentation doesn't go into detail about the effects of setting |
Ok, let me see by myself. The second option seems more brittle to me. |
|
Feel free to test the following patch # HG changeset patch
# User Dmitry Volyntsev <[email protected]>
# Date 1651709682 25200
# Wed May 04 17:14:42 2022 -0700
# Node ID d845a1b766aacfb4303a3f7c63705273f389139b
# Parent 80ed74a0e2058a5f611c1227283e72a168693e1b
Improved surrogate pairs support for PCRE2 backend.
Prodded by Javier Evans.
diff --git a/external/njs_regex.c b/external/njs_regex.c
--- a/external/njs_regex.c
+++ b/external/njs_regex.c
@@ -60,8 +60,23 @@ njs_regex_compile_ctx_t *
njs_regex_compile_ctx_create(njs_regex_generic_ctx_t *ctx)
{
#ifdef NJS_HAVE_PCRE2
+ pcre2_compile_context *cc;
- return pcre2_compile_context_create(ctx);
+ cc = pcre2_compile_context_create(ctx);
+ if (njs_fast_path(cc != NULL)) {
+ /* Workaround for surrogate pairs in regular expressions
+ *
+ * This option is needed because njs unlike the standard ECMAScript
+ * stores and process strings in UTF-8 encoding.
+ * PCRE2 does not support surrogate pairs by default when it
+ * is compiled for UTF-8 only strings. But many polyfills
+ * and transpiler uses such surrogate pairs expressions.
+ */
+ pcre2_set_compile_extra_options(cc,
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES);
+ }
+
+ return cc;
#else |
|
Thank you for looking in to this! I agree that the first approach seems better since it will allow anyone with a current version of NJS to have wider compatibility. Test:
Attempt to import the above-linked lodash helper:
And the lodash script: Looks like it's working! The second test Is it possible that due to the fact that njs works in UTF-8 and the regex's criteria for determining if a string contains unicode depends on the presence of a surrogate pair we're getting a false negative? If so that is interesting and something that could open up npm package users to subtle bugs. I'll continue to look at it and see if there is anything else that could explain it. |
import hu from './has_unicode.mjs';
console.log(`hu('a'): ${hu('a')}`);
console.log(`hu('α'): ${hu('α')}`); |
|
Ah that's interesting. Seems like the function is likely bad or has a different meaning "has Unicode". So it seems an emoji is interpreted differently in njs. However in both njs and v8 non ascii characters return |
agree.
As I said, the difference will be for the unicode characters above basic multilingual plane, for example emoji. |
What
Since Javascript uses UTF-16 to encode strings, it is assumed that surrogate code points
will be valid regex input. Some libraries including common polyfills and
transpiler plugins use such regular expressions.
By default, pcre2 does not allow code points in these ranges, failing
with
disallowed Unicode code point. In order to allow pcre2to work with these code points, the
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPESoption must be set using pcre2_set_compile_extra_options.
This change sets this configuration in
pcre2_compile()if NJS is using pcre2.