Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
cli: include \w, \s and \d in Unicode data table generation
This was an oversight omission when porting the old generator shell
script to regex-cli. This hasn't been an issue because I don't think
we've generated data for a new release of Unicode with this new
infrastructure yet.

This was flagged by unit tests that failed because \d was no longer a
subset of \w.
  • Loading branch information
BurntSushi committed Sep 29, 2024
commit dea2d344e3ed93abdb3f2b3ad17dc406e9d80cc9
17 changes: 17 additions & 0 deletions regex-cli/cmd/generate/unicode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,23 @@ USAGE:
gen(d.join("sentence_break.rs"), &["sentence-break", &ucd, "--chars"])?;
gen(d.join("word_break.rs"), &["word-break", &ucd, "--chars"])?;

// These generate the \w, \d and \s Unicode-aware character classes for
// regex-syntax. \d and \s are technically part of the general category
// and boolean properties generated above. However, these are generated
// separately to make it possible to enable or disable them via Cargo
// features independently of whether all boolean properties or general
// categories are enabled or disabled. The crate ensures that only one copy
// is compiled.
gen(d.join("perl_word.rs"), &["perl-word", &ucd, "--chars"])?;
gen(
d.join("perl_decimal.rs"),
&["general-category", &ucd, "--chars", "--include", "decimalnumber"],
)?;
gen(
d.join("perl_space.rs"),
&["property-bool", &ucd, "--chars", "--include", "whitespace"],
)?;

// Data tables for regex-automata.
let d = out
.join("regex-automata")
Expand Down