Skip to content

Conversation

@4141done
Copy link

@4141done 4141done commented May 3, 2022

What

Since Javascript uses UTF-16 to encode strings, it is assumed that surrogate code points
will be valid regex input. Some libraries including common polyfills and
transpiler plugins use such regular expressions.

By default, pcre2 does not allow code points in these ranges, failing
with disallowed Unicode code point. In order to allow pcre2
to work with these code points, the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
option must be set using pcre2_set_compile_extra_options.

This change sets this configuration in pcre2_compile() if NJS is using pcre2.

…lar expressions

Since Javascript uses UTF-16 to encode strings, it is assumed that
[surrogate code points](https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/#24-surrogate-pairs)
will be valid regex input.  Some libraries including common polyfills and
transpiler plugins use such regular expressions.

By default, pcre2 does not allow code points in these ranges, failing
with `disallowed Unicode code point`.  In order to allow pcre2
to work with these code points, the `PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES`
option must be set using [pcre2_set_compile_extra_options](https://www.pcre.org/current/doc/html/pcre2_set_compile_extra_options.html).

This change sets this configuration in `pcre2_compile()` if NJS is using pcre2.
@4141done 4141done closed this May 3, 2022
@xeioex
Copy link
Contributor

xeioex commented May 3, 2022

Hi Javier,

Thanks for your patch.

njs stores strings using UTF-8 encoding. This means that surrogate pair are converted to UTF-8 encoded symbols at the moment of parsing strings (and not stored as surrogate UTF-16 pairs).

This is one of few incongruences with the ECMAScript standard that njs has. This manifests itself in a rare situation when Unicode characters above basic multilingual plane are used.

node 
>> String.fromCodePoint(0xD83D, 0xDCA9).length
2

njs
>> String.fromCodePoint(0xD83D, 0xDCA9).length
1

@4141done
Copy link
Author

4141done commented May 4, 2022

Thanks for responding @xeioex , I accidentally targeted the wrong branch with my patch which is why I closed it immediately. However maybe I can use it as an opportunity to ask a question.

It's good to know that njs deals in UTF-8 internally. This patch came from experiences I had trying to get third party modules to reliably transpile using webpack.

There are some common libraries that explicitly reference astral code points in the code. Here's an example and another.

For this I've found two workarounds:

  1. The code change in this PR
  2. Compiling pcre2 with the flag --enable-pcre2-16 before compiling njs

Through my experimentation I've found that using one of these strategies greatly increases the number of npm modules I can use without issues.

However, I wonder which approach is preferred? The pcre2 documentation doesn't go into detail about the effects of setting PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES or setting --enable-pcre2-16

@xeioex
Copy link
Contributor

xeioex commented May 4, 2022

However, I wonder which approach is preferred?

@4141done

Ok, let me see by myself.

The second option seems more brittle to me.

@xeioex
Copy link
Contributor

xeioex commented May 5, 2022

@4141done

Feel free to test the following patch

# HG changeset patch
# User Dmitry Volyntsev <[email protected]>
# Date 1651709682 25200
#      Wed May 04 17:14:42 2022 -0700
# Node ID d845a1b766aacfb4303a3f7c63705273f389139b
# Parent  80ed74a0e2058a5f611c1227283e72a168693e1b
Improved surrogate pairs support for PCRE2 backend.

Prodded by Javier Evans.

diff --git a/external/njs_regex.c b/external/njs_regex.c
--- a/external/njs_regex.c
+++ b/external/njs_regex.c
@@ -60,8 +60,23 @@ njs_regex_compile_ctx_t *
 njs_regex_compile_ctx_create(njs_regex_generic_ctx_t *ctx)
 {
 #ifdef NJS_HAVE_PCRE2
+    pcre2_compile_context  *cc;
 
-    return pcre2_compile_context_create(ctx);
+    cc = pcre2_compile_context_create(ctx);
+    if (njs_fast_path(cc != NULL)) {
+        /* Workaround for surrogate pairs in regular expressions
+         *
+         * This option is needed because njs unlike the standard ECMAScript
+         * stores and process strings in UTF-8 encoding.
+         * PCRE2 does not support surrogate pairs by default when it
+         * is compiled for UTF-8 only strings. But many polyfills
+         * and transpiler uses such surrogate pairs expressions.
+         */
+        pcre2_set_compile_extra_options(cc,
+                                        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES);
+    }
+
+    return cc;
 
 #else

@xeioex xeioex reopened this May 5, 2022
@4141done
Copy link
Author

4141done commented May 6, 2022

Thank you for looking in to this! I agree that the first approach seems better since it will allow anyone with a current version of NJS to have wider compatibility.

Test:

  1. Recompile pcre2 to be sure it's not compiled with UTF-16 support which could cause a false positive
05/05/2022 04:59:08 PM > pcre2test -16
** This version of PCRE2 was built without 16-bit support
  1. Compile NJS top of master without patch
NJS configuration summary:

 + using CC: "cc"
 + using CFLAGS: " -pipe -fPIC -fvisibility=hidden -O -W -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -fstrict-aliasing -Wstrict-overflow=5 -Wmissing-prototypes -Werror -g -O "

 + using PCRE library: -L/usr/local/lib -lpcre2-8
 + using readline library: -lreadline

 njs build dir: build
 njs CLI: build/njs

  1. Confirm errors:
>> /\u200d\ud800-/
Thrown:
SyntaxError: pcre_compile2("\u200d\ud800-") failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at "-" in shell:1

Attempt to import the above-linked lodash helper:

>> import hasUnicode from 'has-unicode.mjs';
Thrown:
SyntaxError: pcre_compile2("[\u200d\ud800-\udfff\u0300-\u036f\ufe20-\ufe2f\u20d0-\u20ff\u1ab0-\u1aff\u1dc0-\u1dff\ufe0e\ufe0f]") failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at "-\udfff\u0300-\u036f\ufe20-\ufe2f\u20d0-\u20ff\u1ab0-\u1aff\u1dc0-\u1dff\ufe0e\ufe0f]"
    at RegExp (native)
    at module (/Users/j.evans/code/njs/has-unicode.mjs:15)
  1. Recompile with patch
NJS configuration summary:

 + using CC: "cc"
 + using CFLAGS: " -pipe -fPIC -fvisibility=hidden -O -W -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -fstrict-aliasing -Wstrict-overflow=5 -Wmissing-prototypes -Werror -g -O "

 + using PCRE library: -L/usr/local/lib -lpcre2-8
 + using readline library: -lreadline

 njs build dir: build
 njs CLI: build/njs
  1. Rerun tests:
>> /\u200d\ud800-/
/\u200d\ud800-/

And the lodash script:

>> import hasUnicode from 'has-unicode.mjs';
undefined
>> hasUnicode('foo');
false
>> hasUnicode('💩');
false

Looks like it's working! The second test hasUnicode('💩'); seems to be returning a wrong result, but that seems separate from the issue of not being able to accept astral codepoints in regex.

Is it possible that due to the fact that njs works in UTF-8 and the regex's criteria for determining if a string contains unicode depends on the presence of a surrogate pair we're getting a false negative?

If so that is interesting and something that could open up npm package users to subtle bugs. I'll continue to look at it and see if there is anything else that could explain it.

@xeioex
Copy link
Contributor

xeioex commented May 6, 2022

@4141done

import hu from './has_unicode.mjs';

console.log(`hu('a'): ${hu('a')}`);
console.log(`hu('α'): ${hu('α')}`);
$ node t.mjs 
hu('a'): false
hu('α'): false

$ ../quickjs/qjs t.mjs 
hu('a'): false
hu('α'): false

$ ./build/njs t.mjs 
hu('a'): false
hu('α'): false

@4141done
Copy link
Author

4141done commented May 6, 2022

Ah that's interesting. Seems like the function is likely bad or has a different meaning "has Unicode".
However, there is one case where node's interpretation is different from njs'. I added one more test to your example:

import hu from './has-unicode.mjs';

console.log(`hu('a'): ${hu('a')}`);
console.log(`hu('α'): ${hu('α')}`);
console.log(`hu('💩'): ${hu('💩')}`);
$ node t.mjs
hu('a'): false
hu('α'): false
hu('💩'): true
$ njs
interactive njs 0.7.4

v.<Tab> -> the properties and prototype methods of v.

>> import hu from 'has-unicode.mjs';
undefined
>> hu('💩');
false

So it seems an emoji is interpreted differently in njs. However in both njs and v8 non ascii characters return false.

@xeioex
Copy link
Contributor

xeioex commented May 6, 2022

Seems like the function is likely bad or has a different meaning "has Unicode".

agree.

However, there is one case where node's interpretation is different from njs'.

As I said, the difference will be for the unicode characters above basic multilingual plane, for example emoji.

@xeioex
Copy link
Contributor

xeioex commented May 6, 2022

@4141done

Thanks, committed in 99239ae.

@xeioex xeioex closed this May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants