-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
The langextract library provides a valuable solution for extracting structured information from text using LLMs. However, the current tokenizer's reliance on alphanumeric characters and symbols prevents it from correctly processing and aligning extractions from non-English languages, such as Japanese. This limitation currently makes the library difficult to use for a significant portion of potential users.
For example, when a Japanese text is processed, the char_interval property, which is crucial for grounding extractions to the source text, is not generated. This is a direct result of the tokenizer being unable to correctly parse non-Latin characters.
Adding support for multi-byte character sets would significantly enhance the library's utility and make it a highly valuable tool for a global user base.
Related to an issue?
This is a new feature request.
Possible solutions and alternatives
The primary component requiring an update is the tokenizer.py module. The regular expressions used to define tokens currently focus on English-centric patterns ([A-Za-z]+, [0-9]+, [^A-Za-z0-9\s]+).
A potential solution would be to revise these regex patterns to be more inclusive, allowing them to correctly identify and tokenize multi-byte characters and phrases in languages like Japanese, Chinese, or Korean.
By making the tokenizer more language-agnostic, the library's core functionality, including the generation of char_interval and alignment_status for extractions, can be correctly applied to a wider range of languages.
Priority and timeline considerations
This feature is considered a high priority for non-English-speaking users. It represents a fundamental step toward making langextract a truly versatile and globally applicable tool for information extraction.