Skip to content

Add Japanese language support to the tokenizer (char_interval and alignment fixes) #13

@ornew

Description

@ornew

The langextract library provides a valuable solution for extracting structured information from text using LLMs. However, the current tokenizer's reliance on alphanumeric characters and symbols prevents it from correctly processing and aligning extractions from non-English languages, such as Japanese. This limitation currently makes the library difficult to use for a significant portion of potential users.

For example, when a Japanese text is processed, the char_interval property, which is crucial for grounding extractions to the source text, is not generated. This is a direct result of the tokenizer being unable to correctly parse non-Latin characters.

Adding support for multi-byte character sets would significantly enhance the library's utility and make it a highly valuable tool for a global user base.

Related to an issue?

This is a new feature request.

Possible solutions and alternatives

The primary component requiring an update is the tokenizer.py module. The regular expressions used to define tokens currently focus on English-centric patterns ([A-Za-z]+, [0-9]+, [^A-Za-z0-9\s]+).

A potential solution would be to revise these regex patterns to be more inclusive, allowing them to correctly identify and tokenize multi-byte characters and phrases in languages like Japanese, Chinese, or Korean.

By making the tokenizer more language-agnostic, the library's core functionality, including the generation of char_interval and alignment_status for extractions, can be correctly applied to a wider range of languages.

Priority and timeline considerations

This feature is considered a high priority for non-English-speaking users. It represents a fundamental step toward making langextract a truly versatile and globally applicable tool for information extraction.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions