Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495

4141done · 2022-05-03T16:56:28Z

What

Since Javascript uses UTF-16 to encode strings, it is assumed that surrogate code points
will be valid regex input. Some libraries including common polyfills and
transpiler plugins use such regular expressions.

By default, pcre2 does not allow code points in these ranges, failing
with disallowed Unicode code point. In order to allow pcre2
to work with these code points, the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
option must be set using pcre2_set_compile_extra_options.

This change sets this configuration in pcre2_compile() if NJS is using pcre2.

…lar expressions Since Javascript uses UTF-16 to encode strings, it is assumed that [surrogate code points](https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/#24-surrogate-pairs) will be valid regex input. Some libraries including common polyfills and transpiler plugins use such regular expressions. By default, pcre2 does not allow code points in these ranges, failing with `disallowed Unicode code point`. In order to allow pcre2 to work with these code points, the `PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES` option must be set using [pcre2_set_compile_extra_options](https://www.pcre.org/current/doc/html/pcre2_set_compile_extra_options.html). This change sets this configuration in `pcre2_compile()` if NJS is using pcre2.

xeioex · 2022-05-03T23:35:43Z

Hi Javier,

Thanks for your patch.

njs stores strings using UTF-8 encoding. This means that surrogate pair are converted to UTF-8 encoded symbols at the moment of parsing strings (and not stored as surrogate UTF-16 pairs).

This is one of few incongruences with the ECMAScript standard that njs has. This manifests itself in a rare situation when Unicode characters above basic multilingual plane are used.

node 
>> String.fromCodePoint(0xD83D, 0xDCA9).length
2

njs
>> String.fromCodePoint(0xD83D, 0xDCA9).length
1

4141done · 2022-05-04T00:17:50Z

Thanks for responding @xeioex , I accidentally targeted the wrong branch with my patch which is why I closed it immediately. However maybe I can use it as an opportunity to ask a question.

It's good to know that njs deals in UTF-8 internally. This patch came from experiences I had trying to get third party modules to reliably transpile using webpack.

There are some common libraries that explicitly reference astral code points in the code. Here's an example and another.

For this I've found two workarounds:

The code change in this PR
Compiling pcre2 with the flag --enable-pcre2-16 before compiling njs

Through my experimentation I've found that using one of these strategies greatly increases the number of npm modules I can use without issues.

However, I wonder which approach is preferred? The pcre2 documentation doesn't go into detail about the effects of setting PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES or setting --enable-pcre2-16

xeioex · 2022-05-04T23:57:02Z

However, I wonder which approach is preferred?

@4141done

Ok, let me see by myself.

The second option seems more brittle to me.

xeioex · 2022-05-05T00:16:33Z

@4141done

Feel free to test the following patch

# HG changeset patch
# User Dmitry Volyntsev <[email protected]>
# Date 1651709682 25200
#      Wed May 04 17:14:42 2022 -0700
# Node ID d845a1b766aacfb4303a3f7c63705273f389139b
# Parent  80ed74a0e2058a5f611c1227283e72a168693e1b
Improved surrogate pairs support for PCRE2 backend.

Prodded by Javier Evans.

diff --git a/external/njs_regex.c b/external/njs_regex.c
--- a/external/njs_regex.c
+++ b/external/njs_regex.c
@@ -60,8 +60,23 @@ njs_regex_compile_ctx_t *
 njs_regex_compile_ctx_create(njs_regex_generic_ctx_t *ctx)
 {
 #ifdef NJS_HAVE_PCRE2
+    pcre2_compile_context  *cc;
 
-    return pcre2_compile_context_create(ctx);
+    cc = pcre2_compile_context_create(ctx);
+    if (njs_fast_path(cc != NULL)) {
+        /* Workaround for surrogate pairs in regular expressions
+         *
+         * This option is needed because njs unlike the standard ECMAScript
+         * stores and process strings in UTF-8 encoding.
+         * PCRE2 does not support surrogate pairs by default when it
+         * is compiled for UTF-8 only strings. But many polyfills
+         * and transpiler uses such surrogate pairs expressions.
+         */
+        pcre2_set_compile_extra_options(cc,
+                                        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES);
+    }
+
+    return cc;
 
 #else

4141done · 2022-05-06T00:14:47Z

Thank you for looking in to this! I agree that the first approach seems better since it will allow anyone with a current version of NJS to have wider compatibility.

Test:

Recompile pcre2 to be sure it's not compiled with UTF-16 support which could cause a false positive

05/05/2022 04:59:08 PM > pcre2test -16
** This version of PCRE2 was built without 16-bit support

Compile NJS top of master without patch

NJS configuration summary:

 + using CC: "cc"
 + using CFLAGS: " -pipe -fPIC -fvisibility=hidden -O -W -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -fstrict-aliasing -Wstrict-overflow=5 -Wmissing-prototypes -Werror -g -O "

 + using PCRE library: -L/usr/local/lib -lpcre2-8
 + using readline library: -lreadline

 njs build dir: build
 njs CLI: build/njs

Confirm errors:

>> /\u200d\ud800-/
Thrown:
SyntaxError: pcre_compile2("\u200d\ud800-") failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at "-" in shell:1

Attempt to import the above-linked lodash helper:

>> import hasUnicode from 'has-unicode.mjs';
Thrown:
SyntaxError: pcre_compile2("[\u200d\ud800-\udfff\u0300-\u036f\ufe20-\ufe2f\u20d0-\u20ff\u1ab0-\u1aff\u1dc0-\u1dff\ufe0e\ufe0f]") failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at "-\udfff\u0300-\u036f\ufe20-\ufe2f\u20d0-\u20ff\u1ab0-\u1aff\u1dc0-\u1dff\ufe0e\ufe0f]"
    at RegExp (native)
    at module (/Users/j.evans/code/njs/has-unicode.mjs:15)

Recompile with patch

NJS configuration summary:

 + using CC: "cc"
 + using CFLAGS: " -pipe -fPIC -fvisibility=hidden -O -W -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -fstrict-aliasing -Wstrict-overflow=5 -Wmissing-prototypes -Werror -g -O "

 + using PCRE library: -L/usr/local/lib -lpcre2-8
 + using readline library: -lreadline

 njs build dir: build
 njs CLI: build/njs

Rerun tests:

>> /\u200d\ud800-/
/\u200d\ud800-/

And the lodash script:

>> import hasUnicode from 'has-unicode.mjs';
undefined
>> hasUnicode('foo');
false
>> hasUnicode('💩');
false

Looks like it's working! The second test hasUnicode('💩'); seems to be returning a wrong result, but that seems separate from the issue of not being able to accept astral codepoints in regex.

Is it possible that due to the fact that njs works in UTF-8 and the regex's criteria for determining if a string contains unicode depends on the presence of a surrogate pair we're getting a false negative?

If so that is interesting and something that could open up npm package users to subtle bugs. I'll continue to look at it and see if there is anything else that could explain it.

xeioex · 2022-05-06T01:45:04Z

@4141done

import hu from './has_unicode.mjs';

console.log(`hu('a'): ${hu('a')}`);
console.log(`hu('α'): ${hu('α')}`);

$ node t.mjs 
hu('a'): false
hu('α'): false

$ ../quickjs/qjs t.mjs 
hu('a'): false
hu('α'): false

$ ./build/njs t.mjs 
hu('a'): false
hu('α'): false

4141done · 2022-05-06T03:10:34Z

Ah that's interesting. Seems like the function is likely bad or has a different meaning "has Unicode".
However, there is one case where node's interpretation is different from njs'. I added one more test to your example:

import hu from './has-unicode.mjs';

console.log(`hu('a'): ${hu('a')}`);
console.log(`hu('α'): ${hu('α')}`);
console.log(`hu('💩'): ${hu('💩')}`);

$ node t.mjs
hu('a'): false
hu('α'): false
hu('💩'): true

$ njs
interactive njs 0.7.4

v.<Tab> -> the properties and prototype methods of v.

>> import hu from 'has-unicode.mjs';
undefined
>> hu('💩');
false

So it seems an emoji is interpreted differently in njs. However in both njs and v8 non ascii characters return false.

xeioex · 2022-05-06T03:23:48Z

Seems like the function is likely bad or has a different meaning "has Unicode".

agree.

However, there is one case where node's interpretation is different from njs'.

As I said, the difference will be for the unicode characters above basic multilingual plane, for example emoji.

xeioex · 2022-05-06T03:29:46Z

@4141done

Thanks, committed in 99239ae.

4141done closed this May 3, 2022

xeioex reopened this May 5, 2022

xeioex closed this May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495

Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495

Uh oh!

4141done commented May 3, 2022

Uh oh!

xeioex commented May 3, 2022

Uh oh!

4141done commented May 4, 2022 •

edited

Loading

Uh oh!

xeioex commented May 4, 2022

Uh oh!

xeioex commented May 5, 2022

Uh oh!

4141done commented May 6, 2022 •

edited

Loading

Uh oh!

xeioex commented May 6, 2022

Uh oh!

4141done commented May 6, 2022

Uh oh!

xeioex commented May 6, 2022

Uh oh!

xeioex commented May 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495

Allow NJS compiled with pcre2 to handle surrogate code points in regular expressions #495

Uh oh!

Conversation

4141done commented May 3, 2022

What

Uh oh!

xeioex commented May 3, 2022

Uh oh!

4141done commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xeioex commented May 4, 2022

Uh oh!

xeioex commented May 5, 2022

Uh oh!

4141done commented May 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test:

Uh oh!

xeioex commented May 6, 2022

Uh oh!

4141done commented May 6, 2022

Uh oh!

xeioex commented May 6, 2022

Uh oh!

xeioex commented May 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

4141done commented May 4, 2022 •

edited

Loading

4141done commented May 6, 2022 •

edited

Loading