Skip to content

Upgrade to Unicode 14 (to support Vithkuqi) #877

@pemistahl

Description

@pemistahl

What version of regex are you using?

1.5.6

Describe the bug at a high level.

The letters of the Vithkuqi script, a script for writing the Albanian language, were added to Unicode version 14.0. The respective Unicode block is from U+10570 to U+105BF. I discovered that the regex \w+ does not match the letters of this block. Additionally, case-insensitive regexes starting with (?i) do not match both Vithkuqi uppercase and lowercase letters.

What are the steps to reproduce the behavior?

use regex::Regex;

let upper = "\u{10570}";          // Vithkuqi Capital Letter A
let lower = upper.to_lowercase(); // Vithkuqi Small Letter A (U+10597)

let r1 = Regex::new("(?i)^\u{10570}$").unwrap();
let r2 = Regex::new("^\\w+$").unwrap();

println!("{}", r1.is_match(upper));
println!("{}", r1.is_match(&lower));
println!("{}", r2.is_match(upper));
println!("{}", r2.is_match(&lower));

What is the actual behavior?

The actual output is:

true
false
false
false

What is the expected behavior?

The expected output is:

true
true
true
true

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions