Add Japanese language support to the tokenizer (char_interval and alignment fixes)


The `langextract` library provides a valuable solution for extracting structured information from text using LLMs. However, the current tokenizer's reliance on alphanumeric characters and symbols prevents it from correctly processing and aligning extractions from non-English languages, such as Japanese. This limitation currently makes the library difficult to use for a significant portion of potential users.

For example, when a Japanese text is processed, the `char_interval` property, which is crucial for grounding extractions to the source text, is not generated. This is a direct result of the tokenizer being unable to correctly parse non-Latin characters.

Adding support for multi-byte character sets would significantly enhance the library's utility and make it a highly valuable tool for a global user base.

### Related to an issue?

This is a new feature request.

### Possible solutions and alternatives

The primary component requiring an update is the `tokenizer.py` module. The regular expressions used to define tokens currently focus on English-centric patterns (`[A-Za-z]+`, `[0-9]+`, `[^A-Za-z0-9\s]+`).

A potential solution would be to revise these regex patterns to be more inclusive, allowing them to correctly identify and tokenize multi-byte characters and phrases in languages like Japanese, Chinese, or Korean.

By making the tokenizer more language-agnostic, the library's core functionality, including the generation of `char_interval` and `alignment_status` for extractions, can be correctly applied to a wider range of languages.

### Priority and timeline considerations

This feature is considered a high priority for non-English-speaking users. It represents a fundamental step toward making `langextract` a truly versatile and globally applicable tool for information extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Japanese language support to the tokenizer (char_interval and alignment fixes) #13

Related to an issue?

Possible solutions and alternatives

Priority and timeline considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Japanese language support to the tokenizer (char_interval and alignment fixes) #13

Description

Related to an issue?

Possible solutions and alternatives

Priority and timeline considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions