Skip to content

Auto translate task #1352

@Vasilije1990

Description

@Vasilije1990

Multilingual Content Translation Task

NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here

Problem Statement

Cognee processes multilingual content but lacks unified language handling, making cross-language search and analysis difficult. Non-English content remains isolated and cannot be effectively compared or connected with English content in the knowledge graph.

Proposed Solution

Create a translation task that automatically translates non-English content to English while preserving original language metadata and providing translation quality metrics.
Use one of existing translation services like Google Translate or ones provided by the major cloud providers

Requirements

Core Functionality

  • Auto-detect document/chunk language using language detection
  • Translate non-English content to English using LLM or translation APIs
  • Preserve original text alongside translated versions
  • Store language metadata and translation confidence scores
  • Support configurable translation providers (OpenAI, Google Translate, Azure)

Data Models (cognee/tasks/translation/models.py)

class TranslatedContent(DataPoint):
    """Represents translated content with quality metrics"""
    original_chunk_id: str
    original_text: str
    translated_text: str
    source_language: str
    target_language: str = "en"
    translation_provider: str
    confidence_score: float
    translation_timestamp: datetime
    metadata: dict = {"index_fields": ["source_language", "original_chunk_id"]}

class LanguageMetadata(DataPoint):
    """Language information for content"""
    content_id: str
    detected_language: str
    language_confidence: float
    requires_translation: bool
    character_count: int
    metadata: dict = {"index_fields": ["detected_language", "content_id"]}

Core Task (cognee/tasks/translation/translate_content.py)

async def translate_content(
    data_chunks: List[DocumentChunk],
    target_language: str = "en",
    translation_provider: str = "openai",
    confidence_threshold: float = 0.8
) -> List[DocumentChunk]:
    """
    Translate non-English content to target language
    
    Args:
        data_chunks: Document chunks to process
        target_language: Target language code (default: "en")
        translation_provider: Translation service to use
        confidence_threshold: Minimum confidence for language detection
    
    Returns:
        Enhanced chunks with translated content and metadata
    """

Usage Examples

# Auto-translate during ingestion
await cognee.add(
    "document_spanish.pdf",
    auto_translate=True,
    target_language="en"
)

# Translate existing chunks
from cognee.tasks.translation import translate_content

translated_chunks = await translate_content(
    chunks,
    translation_provider="openai",
    confidence_threshold=0.9
)

# Search across languages
results = await cognee.search(
    "machine learning concepts",
    search_type=SearchType.GRAPH_COMPLETION,
    include_translations=True
)

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions