Skip to content

Conversation

ExplodingCabbage
Copy link
Collaborator

@ExplodingCabbage ExplodingCabbage commented Feb 19, 2024

This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.

I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think
I inadvertently fixed two bugs along the way! Namely, those bugs are:

  • Bug on diff words with accent #311
  • that when tokenizing text containing Windows-style newlines, the carriage return and line feed characters would each get their own token instead of being grouped into a single newline token

The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:

      AssertionError: expected [ Array(35) ] to deeply equal [ Array(33) ]
      + expected - actual

         "\n"
         "\n"
         "\n"
         "  "
      -  "anim"
      -  "á-"
      +  "animá"
      +  "-"
         "los"
      -  "\r"
      -  "\n"
      -  "\r"
      -  "\n"
      +  "\r\n"
      +  "\r\n"
         "("
         "wibbly"
         " "
         "wobbly"

Resolves #311.

Previously the behaviour was kinda correct by coincidence; a \n without a \r was treated as a punctuation mark by the regex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug on diff words with accent

1 participant