-
Notifications
You must be signed in to change notification settings - Fork 606
Labels
10 pointsgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
Multilingual Content Translation Task
NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here
Problem Statement
Cognee processes multilingual content but lacks unified language handling, making cross-language search and analysis difficult. Non-English content remains isolated and cannot be effectively compared or connected with English content in the knowledge graph.
Proposed Solution
Create a translation task that automatically translates non-English content to English while preserving original language metadata and providing translation quality metrics.
Use one of existing translation services like Google Translate or ones provided by the major cloud providers
Requirements
Core Functionality
- Auto-detect document/chunk language using language detection
- Translate non-English content to English using LLM or translation APIs
- Preserve original text alongside translated versions
- Store language metadata and translation confidence scores
- Support configurable translation providers (OpenAI, Google Translate, Azure)
Data Models (cognee/tasks/translation/models.py
)
class TranslatedContent(DataPoint):
"""Represents translated content with quality metrics"""
original_chunk_id: str
original_text: str
translated_text: str
source_language: str
target_language: str = "en"
translation_provider: str
confidence_score: float
translation_timestamp: datetime
metadata: dict = {"index_fields": ["source_language", "original_chunk_id"]}
class LanguageMetadata(DataPoint):
"""Language information for content"""
content_id: str
detected_language: str
language_confidence: float
requires_translation: bool
character_count: int
metadata: dict = {"index_fields": ["detected_language", "content_id"]}
Core Task (cognee/tasks/translation/translate_content.py
)
async def translate_content(
data_chunks: List[DocumentChunk],
target_language: str = "en",
translation_provider: str = "openai",
confidence_threshold: float = 0.8
) -> List[DocumentChunk]:
"""
Translate non-English content to target language
Args:
data_chunks: Document chunks to process
target_language: Target language code (default: "en")
translation_provider: Translation service to use
confidence_threshold: Minimum confidence for language detection
Returns:
Enhanced chunks with translated content and metadata
"""
Usage Examples
# Auto-translate during ingestion
await cognee.add(
"document_spanish.pdf",
auto_translate=True,
target_language="en"
)
# Translate existing chunks
from cognee.tasks.translation import translate_content
translated_chunks = await translate_content(
chunks,
translation_provider="openai",
confidence_threshold=0.9
)
# Search across languages
results = await cognee.search(
"machine learning concepts",
search_type=SearchType.GRAPH_COMPLETION,
include_translations=True
)
Metadata
Metadata
Assignees
Labels
10 pointsgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed