Skip to content

Conversation

@0xideas
Copy link
Contributor

@0xideas 0xideas commented Nov 11, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced asynchronous data processing pipelines for code graphs and document chunks.
    • Added functions for extracting and managing graphs from documents and code.
    • Implemented a new FalkorDBAdapter class for enhanced database interactions.
  • Improvements

    • Enhanced type safety and clarity in various data models.
    • Streamlined error handling and logging mechanisms across multiple functions.
  • Bug Fixes

    • Resolved issues with data retrieval and processing in search functionalities.
  • Documentation

    • Updated README with clearer usage comments and examples.
    • Added new tests for document classification and chunk extraction functionalities.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

Walkthrough

The pull request introduces significant changes across multiple files, primarily focusing on enhancing the asynchronous processing of datasets related to code graphs. A new file code_graph_pipeline.py is added, which defines an asynchronous pipeline for dataset processing, incorporating exception handling and logging. Other modifications include the restructuring of various classes to inherit from DataPoint, updates to method signatures, and the introduction of new utility functions for managing graph data. Several files are deleted, indicating a shift in functionality and a move towards a more modular codebase.

Changes

File Path Change Summary
cognee/api/v1/cognify/code_graph_pipeline.py Added functions: code_graph_pipeline, run_pipeline, and generate_dataset_name; added exception class PermissionDeniedException.
cognee/api/v1/cognify/cognify_v2.py Updated method signatures for cognify and run_cognify_pipeline, simplifying task handling.
cognee/api/v1/search/search_v2.py Modified import statement for query_chunks to a new module path.
cognee/infrastructure/databases/graph/falkordb/adapter.py Removed class FalcorDBAdapter and its methods.
cognee/infrastructure/databases/graph/get_graph_engine.py Updated get_graph_engine to include new parameters and modified import statements.
cognee/infrastructure/databases/graph/graph_db_interface.py Replaced abstract method graph with query.
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py Updated methods to use UUID for node identification and changed method signatures.
cognee/infrastructure/databases/graph/networkx/adapter.py Added methods get_graph_data and query; updated existing methods to accept DataPoint objects and UUID.
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py Introduced new class FalkorDBAdapter with comprehensive graph database functionalities.
cognee/infrastructure/databases/vector/__init__.py Removed import for DataPoint.
cognee/infrastructure/databases/vector/config.py Added attribute vector_db_port to VectorConfig.
cognee/infrastructure/databases/vector/create_vector_engine.py Updated logic to handle falkordb provider and added vector_db_port.
cognee/infrastructure/databases/vector/models/DataPoint.py Removed class DataPoint.
cognee/modules/chunking/TextChunker.py Updated constructor and chunk creation logic.
cognee/modules/data/operations/translate_text.py Added new function translate_text for AWS Translate.
README.md Enhanced comments in the "Basic Usage" section for clarity.
pyproject.toml Updated lancedb version and added beartype dependency.
Multiple files in cognee/tasks Various functions removed or updated, including classify_documents, check_permissions_on_documents, and others related to document processing.

Possibly related PRs

Suggested reviewers

  • Vasilije1990

🐇 In the code we weave, with threads of async delight,
A pipeline now flows, processing data just right.
With graphs and nodes, we hop and play,
Enhancing our tools, making work light as day!
So let’s celebrate this change, with a joyful cheer,
For the future of coding is bright and clear! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@0xideas 0xideas changed the base branch from feat/COG-184-add-falkordb to main November 11, 2024 12:48
@0xideas 0xideas force-pushed the test-beartype-in-cognify-pipeline branch from 5ae4302 to 7e130e3 Compare November 11, 2024 12:48
@0xideas 0xideas changed the title Test beartype in cognify pipeline [TEST] Test beartype in cognify pipeline Nov 11, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

🛑 Comments failed to post (70)
cognee/modules/engine/utils/generate_node_name.py (1)

1-2: 🛠️ Refactor suggestion

Add input validation and comprehensive documentation

The function needs additional robustness and clarity improvements:

  1. Add input validation
  2. Add docstring with examples
  3. Consider handling more special characters

Apply this diff:

+from typing import Optional
+
+def generate_node_name(name: Optional[str]) -> str:
+    """Generate a standardized node name by converting to lowercase and replacing special characters.
+    
+    Args:
+        name: Input string to be converted into a node name.
+            If None or empty, returns an empty string.
+    
+    Returns:
+        A normalized string suitable for use as a node name.
+    
+    Examples:
+        >>> generate_node_name("Hello World")
+        'hello_world'
+        >>> generate_node_name("User's Data")
+        'users_data'
+    """
+    if not name:
+        return ""
     return name.lower().replace(" ", "_").replace("'", "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import Optional

def generate_node_name(name: Optional[str]) -> str:
    """Generate a standardized node name by converting to lowercase and replacing special characters.
    
    Args:
        name: Input string to be converted into a node name.
            If None or empty, returns an empty string.
    
    Returns:
        A normalized string suitable for use as a node name.
    
    Examples:
        >>> generate_node_name("Hello World")
        'hello_world'
        >>> generate_node_name("User's Data")
        'users_data'
    """
    if not name:
        return ""
    return name.lower().replace(" ", "_").replace("'", "")
cognee/modules/engine/utils/generate_node_id.py (2)

4-4: ⚠️ Potential issue

Fix return type and add input validation.

Several issues need attention:

  1. uuid5() returns a UUID object, but the function name suggests returning a string
  2. No validation for empty strings or None
 def generate_node_id(node_id: str) -> str:
+    if not node_id:
+        raise ValueError("node_id cannot be empty")
-    return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
+    return str(uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", "")))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def generate_node_id(node_id: str) -> str:
    if not node_id:
        raise ValueError("node_id cannot be empty")
    return str(uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", "")))

3-3: 🛠️ Refactor suggestion

Add return type annotation and consider beartype integration.

Since this PR is focused on testing beartype, consider:

  1. Adding return type annotation to match the input type annotation style
  2. Adding beartype decorator for runtime type checking
+from beartype import beartype
 
+@beartype
-def generate_node_id(node_id: str) -> str:
+def generate_node_id(node_id: str) -> str:

Committable suggestion skipped: line range outside the PR's diff.

cognee/tasks/chunks/__init__.py (1)

5-5: 💡 Codebase verification

Type hints present but beartype decorator missing

The function remove_disconnected_chunks has proper type hints (data_chunks: list[DocumentChunk] -> list[DocumentChunk]) but is missing the @beartype decorator for runtime type checking. To fully integrate with beartype as mentioned in the PR title, the decorator should be added.

  • Add from beartype import beartype import
  • Add @beartype decorator above the function definition
🔗 Analysis chain

Verify beartype integration for this module

Since the PR title mentions testing beartype in the cognify pipeline, we should verify if type hints and beartype decorators are properly implemented in the imported module.


Let me try a different approach to verify the beartype integration by checking the actual content of the file.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the imported module uses type hints and beartype decorators

# Check for type hints and beartype usage in the source module
echo "Checking remove_disconnected_chunks.py for type hints and beartype usage:"
rg -A 5 "from beartype" "cognee/tasks/chunks/remove_disconnected_chunks.py"
rg -A 5 "@beartype" "cognee/tasks/chunks/remove_disconnected_chunks.py"

# Check function signature for type hints
echo -e "\nChecking function signature:"
ast-grep --pattern 'def remove_disconnected_chunks($$$): $$$'

Length of output: 439


Script:

#!/bin/bash
# Check the complete content of remove_disconnected_chunks.py
echo "Content of remove_disconnected_chunks.py:"
cat cognee/tasks/chunks/remove_disconnected_chunks.py

# Also check if there are any type hints in general
echo -e "\nChecking for any type annotations:"
rg ":" cognee/tasks/chunks/remove_disconnected_chunks.py

Length of output: 1605

cognee/modules/engine/models/EntityType.py (1)

4-11: ⚠️ Potential issue

Add beartype decorator to align with PR objective

The PR objective mentions testing beartype, but the decorator is missing from this class.

Apply this diff:

+from beartype import beartype
+
+@beartype
 class EntityType(DataPoint):

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/engine/models/Entity.py (1)

5-9: 🛠️ Refactor suggestion

Add type validation using beartype

Given that this PR's objective is to test beartype in the cognify pipeline, consider decorating the class with @beartype to enforce runtime type checking for the attributes.

+from beartype import beartype

+@beartype
 class Entity(DataPoint):
     name: str
     is_a: EntityType
     description: str
     mentioned_in: DocumentChunk
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from beartype import beartype

@beartype
class Entity(DataPoint):
    name: str
    is_a: EntityType
    description: str
    mentioned_in: DocumentChunk
cognee/modules/data/processing/document_types/AudioDocument.py (1)

1-15: 🛠️ Refactor suggestion

Add type hints to support beartype testing

Since this PR is focused on testing beartype integration, the class should include proper type hints:

from typing import Iterator, Callable
from uuid import UUID

class AudioDocument(Document):
    type: str = "audio"
+   raw_data_location: str
 
-   def read(self, chunk_size: int):
+   def read(self, chunk_size: int) -> Iterator[str]:
        # Transcribe the audio file
        result = get_llm_client().create_transcript(self.raw_data_location)
        text = result.text

-       chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)
+       chunker = TextChunker(self, chunk_size=chunk_size, get_text=lambda: str(text))

        yield from chunker.read()

Committable suggestion skipped: line range outside the PR's diff.

cognee/tasks/documents/extract_chunks_from_documents.py (1)

1-2: 🛠️ Refactor suggestion

Consider using standard typing module instead of beartype.typing

While beartype is great for runtime type checking, it's recommended to use the standard typing module for type hints. This ensures better compatibility and follows Python's standard practices.

-from beartype.typing import AsyncGenerator
+from typing import AsyncGenerator
 from beartype import beartype
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import AsyncGenerator
from beartype import beartype
cognee/infrastructure/databases/vector/pgvector/serialize_data.py (1)

4-15: 💡 Codebase verification

Add type hints and beartype decorator to serialize_data function

The function serialize_data in cognee/infrastructure/databases/vector/pgvector/serialize_data.py lacks type hints and the @beartype decorator, while other parts of the codebase consistently use beartype for type checking. You should:

  • Add type hints for the input parameter and return type
  • Add the @beartype decorator
  • Create corresponding test cases

Example locations where beartype is used:

  • cognee/tasks/documents/classify_documents.py
  • cognee/tasks/storage/add_data_points.py
  • cognee/tasks/summarization/summarize_text.py
🔗 Analysis chain

Verify beartype test coverage

Since this PR is focused on testing beartype, ensure this serialization function is covered in the type checking tests.


Let me check the content of the configuration files and look for any existing test files that might be related to serialization.


Let me check for the actual implementation and test files related to serialization in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing beartype tests and configurations
rg -l "beartype" tests/
rg -l "serialize_data.*beartype" tests/

# Look for type checking configurations
fd "py.typed|mypy.ini|pytest.ini"

# Check for related test files
fd "test_serialize" tests/

Length of output: 148


Script:

#!/bin/bash
# Check configuration files content
echo "=== mypy.ini ==="
cat mypy.ini
echo -e "\n=== pytest.ini ==="
cat pytest.ini

# Search for any test files that might contain serialization tests
rg -l "serialize" tests/

# Look for existing type checking patterns in test files
rg "from beartype|@beartype" tests/ -A 2

# Check if the serialize_data function is imported or used in tests
rg "serialize_data" tests/

Length of output: 2469


Script:

#!/bin/bash
# Find all Python files that might contain serialization-related code
fd -e py . | xargs rg -l "serialize"

# Look for test files specifically
fd -e py . tests/ | xargs rg -l "test.*serialize"

# Check for type hints in the serialize_data function
ast-grep --pattern 'def serialize_data($$$): $$$'

# Look for beartype usage in the codebase
rg "from beartype" -A 2

Length of output: 3123

cognee/tasks/graph/extract_graph_from_code.py (2)

13-16: 🛠️ Refactor suggestion

Consider parallel storage operations and add error handling.

The storage operations are performed sequentially and lack error handling. Consider using concurrent storage and proper error management.

-    for (chunk_index, chunk) in enumerate(data_chunks):
-        chunk_graph = chunk_graphs[chunk_index]
-        await add_data_points(chunk_graph.nodes)
+    try:
+        # Process storage operations concurrently
+        await asyncio.gather(
+            *[add_data_points(chunk_graphs[i].nodes) for i in range(len(data_chunks))]
+        )
+    except Exception as e:
+        raise Exception(f"Failed to store graph data points: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        # Process storage operations concurrently
        await asyncio.gather(
            *[add_data_points(chunk_graphs[i].nodes) for i in range(len(data_chunks))]
        )
    except Exception as e:
        raise Exception(f"Failed to store graph data points: {str(e)}") from e

9-11: ⚠️ Potential issue

Add error handling for graph extraction.

The concurrent graph extraction could fail silently. Consider adding error handling and logging.

-    chunk_graphs = await asyncio.gather(
-        *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
-    )
+    try:
+        chunk_graphs = await asyncio.gather(
+            *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks],
+            return_exceptions=True
+        )
+        # Handle any exceptions from gather
+        for result in chunk_graphs:
+            if isinstance(result, Exception):
+                raise result
+    except Exception as e:
+        # Log the error with context
+        raise Exception(f"Failed to extract graphs: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        chunk_graphs = await asyncio.gather(
            *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks],
            return_exceptions=True
        )
        # Handle any exceptions from gather
        for result in chunk_graphs:
            if isinstance(result, Exception):
                raise result
    except Exception as e:
        # Log the error with context
        raise Exception(f"Failed to extract graphs: {str(e)}") from e
cognee/infrastructure/engine/models/DataPoint.py (2)

20-24: 🛠️ Refactor suggestion

Improve method clarity and type safety

The method could benefit from:

  1. Type hints for the return value
  2. Better error handling
  3. More explicit validation
-    def get_embeddable_data(self):
+    def get_embeddable_data(self) -> Optional[str]:
+        """Retrieve the value of the first indexed field.
+
+        Returns:
+            Optional[str]: The value of the first indexed field if it exists, None otherwise.
+        """
+        if not self._metadata or not self._metadata["index_fields"]:
+            return None
+
+        field_name = self._metadata["index_fields"][0]
+        return getattr(self, field_name, None)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def get_embeddable_data(self) -> Optional[str]:
        """Retrieve the value of the first indexed field.

        Returns:
            Optional[str]: The value of the first indexed field if it exists, None otherwise.
        """
        if not self._metadata or not self._metadata["index_fields"]:
            return None

        field_name = self._metadata["index_fields"][0]
        return getattr(self, field_name, None)

10-15: ⚠️ Potential issue

Fix mutable default value and datetime initialization

There are several issues with the current implementation:

  1. Using a mutable dictionary as a default value for _metadata is dangerous as it will be shared across instances
  2. The updated_at default should be a callable to ensure a fresh timestamp for each instance

Apply this fix:

 class DataPoint(BaseModel):
     id: UUID = Field(default_factory=uuid4)
-    updated_at: Optional[datetime] = datetime.now(timezone.utc)
+    updated_at: Optional[datetime] = Field(default_factory=lambda: datetime.now(timezone.utc))
-    _metadata: Optional[MetaData] = {
-        "index_fields": []
-    }
+    _metadata: Optional[MetaData] = Field(
+        default_factory=lambda: {"index_fields": []}
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class DataPoint(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    updated_at: Optional[datetime] = Field(default_factory=lambda: datetime.now(timezone.utc))
    _metadata: Optional[MetaData] = Field(
        default_factory=lambda: {"index_fields": []}
    )
cognee/tasks/documents/classify_documents.py (2)

6-16: ⚠️ Potential issue

Add input validation and improve extension handling.

The current implementation has several potential issues:

  1. No validation for empty input list
  2. Case-sensitive extension matching
  3. Limited extension support (e.g., no mp3/wav for audio)

Consider adding these improvements:

 @beartype
 def classify_documents(data_documents: list[Data]) -> list[Document]:
+    if not data_documents:
+        return []
+
+    DOCUMENT_TYPES = {
+        # PDF documents
+        "pdf": PdfDocument,
+        # Audio documents
+        "mp3": AudioDocument,
+        "wav": AudioDocument,
+        "audio": AudioDocument,
+        # Image documents
+        "png": ImageDocument,
+        "jpg": ImageDocument,
+        "jpeg": ImageDocument,
+        "image": ImageDocument,
+    }
+
+    def create_document(data_item: Data) -> Document:
+        doc_class = DOCUMENT_TYPES.get(data_item.extension.lower(), TextDocument)
+        return doc_class(
+            id=data_item.id,
+            name=f"{data_item.name}.{data_item.extension}",
+            raw_data_location=data_item.raw_data_location
+        )
+
+    documents = [create_document(data_item) for data_item in data_documents]
     return documents
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

@beartype
def classify_documents(data_documents: list[Data]) -> list[Document]:
    if not data_documents:
        return []

    DOCUMENT_TYPES = {
        # PDF documents
        "pdf": PdfDocument,
        # Audio documents
        "mp3": AudioDocument,
        "wav": AudioDocument,
        "audio": AudioDocument,
        # Image documents
        "png": ImageDocument,
        "jpg": ImageDocument,
        "jpeg": ImageDocument,
        "image": ImageDocument,
    }

    def create_document(data_item: Data) -> Document:
        doc_class = DOCUMENT_TYPES.get(data_item.extension.lower(), TextDocument)
        return doc_class(
            id=data_item.id,
            name=f"{data_item.name}.{data_item.extension}",
            raw_data_location=data_item.raw_data_location
        )

    documents = [create_document(data_item) for data_item in data_documents]
    return documents

8-14: 🛠️ Refactor suggestion

Refactor complex list comprehension for better readability and maintainability.

The current implementation using chained conditionals in a list comprehension is hard to read and maintain. Consider using a mapping dictionary and separate function for better organization.

-    documents = [
-        PdfDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "pdf" else
-        AudioDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "audio" else
-        ImageDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "image" else
-        TextDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location)
-        for data_item in data_documents
-    ]
+    DOCUMENT_TYPES = {
+        "pdf": PdfDocument,
+        "audio": AudioDocument,
+        "image": ImageDocument
+    }
+    
+    def create_document(data_item: Data) -> Document:
+        doc_class = DOCUMENT_TYPES.get(data_item.extension.lower(), TextDocument)
+        return doc_class(
+            id=data_item.id,
+            name=f"{data_item.name}.{data_item.extension}",
+            raw_data_location=data_item.raw_data_location
+        )
+    
+    documents = [create_document(data_item) for data_item in data_documents]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    DOCUMENT_TYPES = {
        "pdf": PdfDocument,
        "audio": AudioDocument,
        "image": ImageDocument
    }
    
    def create_document(data_item: Data) -> Document:
        doc_class = DOCUMENT_TYPES.get(data_item.extension.lower(), TextDocument)
        return doc_class(
            id=data_item.id,
            name=f"{data_item.name}.{data_item.extension}",
            raw_data_location=data_item.raw_data_location
        )
    
    documents = [create_document(data_item) for data_item in data_documents]
cognee/modules/graph/utils/get_model_instance_from_graph.py (2)

6-10: 🛠️ Refactor suggestion

Add input validation for robustness.

The node mapping initialization lacks crucial validation checks:

  1. Validation for empty input lists
  2. Handling of duplicate node IDs
  3. Verification of entity_id existence in nodes

Consider adding these validations:

 def get_model_instance_from_graph(nodes: list[DataPoint], edges: list, entity_id: str):
+    if not nodes:
+        raise ValueError("Nodes list cannot be empty")
+    if not entity_id:
+        raise ValueError("Entity ID cannot be empty")
+    if entity_id not in {node.id for node in nodes}:
+        raise ValueError(f"Entity ID '{entity_id}' not found in nodes")
+
     node_map = {}
+    seen_ids = set()
     for node in nodes:
+        if node.id in seen_ids:
+            raise ValueError(f"Duplicate node ID found: {node.id}")
+        seen_ids.add(node.id)
         node_map[node.id] = node
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_model_instance_from_graph(nodes: list[DataPoint], edges: list, entity_id: str):
    if not nodes:
        raise ValueError("Nodes list cannot be empty")
    if not entity_id:
        raise ValueError("Entity ID cannot be empty")
    if entity_id not in {node.id for node in nodes}:
        raise ValueError(f"Entity ID '{entity_id}' not found in nodes")

    node_map = {}
    seen_ids = set()
    for node in nodes:
        if node.id in seen_ids:
            raise ValueError(f"Duplicate node ID found: {node.id}")
        seen_ids.add(node.id)
        node_map[node.id] = node

12-28: ⚠️ Potential issue

Add edge validation and error handling.

The edge processing lacks several important safety checks:

  1. Edge format validation
  2. Safe array access
  3. Error handling for model creation

Consider implementing these safety improvements:

 for edge in edges:
+    if not isinstance(edge, (list, tuple)) or len(edge) < 3:
+        raise ValueError(f"Invalid edge format: {edge}. Expected (source, target, label, [properties])")
+
+    try:
         source_node = node_map[edge[0]]
         target_node = node_map[edge[1]]
+    except KeyError as e:
+        raise ValueError(f"Edge references non-existent node: {e}")
+
         edge_label = edge[2]
         edge_properties = edge[3] if len(edge) == 4 else {}
         edge_metadata = edge_properties.get("metadata", {})
         edge_type = edge_metadata.get("type")

+        try:
             if edge_type == "list":
                 NewModel = copy_model(type(source_node), { edge_label: (list[type(target_node)], PydanticUndefined) })
                 node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: [target_node] })
             else:
                 NewModel = copy_model(type(source_node), { edge_label: (type(target_node), PydanticUndefined) })
                 node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: target_node })
+        except Exception as e:
+            raise ValueError(f"Failed to create model for edge {edge}: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    for edge in edges:
        if not isinstance(edge, (list, tuple)) or len(edge) < 3:
            raise ValueError(f"Invalid edge format: {edge}. Expected (source, target, label, [properties])")

        try:
            source_node = node_map[edge[0]]
            target_node = node_map[edge[1]]
        except KeyError as e:
            raise ValueError(f"Edge references non-existent node: {e}")

        edge_label = edge[2]
        edge_properties = edge[3] if len(edge) == 4 else {}
        edge_metadata = edge_properties.get("metadata", {})
        edge_type = edge_metadata.get("type")

        try:
            if edge_type == "list":
                NewModel = copy_model(type(source_node), { edge_label: (list[type(target_node)], PydanticUndefined) })
                node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: [target_node] })
            else:
                NewModel = copy_model(type(source_node), { edge_label: (type(target_node), PydanticUndefined) })
                node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: target_node })
        except Exception as e:
            raise ValueError(f"Failed to create model for edge {edge}: {str(e)}")
cognee/infrastructure/databases/graph/graph_db_interface.py (2)

6-6: ⚠️ Potential issue

Add return type annotation for beartype testing

Since this PR is focused on testing beartype, the query method should specify its return type annotation to ensure proper runtime type checking.

Apply this diff to add return type annotation:

-    async def query(self, query: str, params: dict):
+    async def query(self, query: str, params: dict[str, Any]) -> Any:

Consider being more specific about the return type based on your query system's actual return type (e.g., dict, list, etc.).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def query(self, query: str, params: dict[str, Any]) -> Any:

6-6: 💡 Codebase verification

Implementation issues found in query method across adapters

  • Neo4j adapter has implemented the query method correctly with proper async/await and parameter handling
  • NetworkX adapter has an empty implementation (just pass)
  • FalkorDB adapter appears to be missing the query method implementation entirely
🔗 Analysis chain

Verify implementations of the new query method

The interface change from graph() to query() is significant and requires verification of all implementing classes.


Let me check the implementations to verify if they have properly implemented the new query method.


Let me check the specific implementations of the query method in these adapters.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all potential implementations of GraphDBInterface
echo "Searching for implementations of GraphDBInterface..."
rg -l "class.*\(.*GraphDBInterface.*\)" 

# Find all potential usages of the old graph() method that need updating
echo "Checking for old graph() method usage..."
rg "\.graph\(\)" 

# Find current implementations of query method
echo "Checking current query method implementations..."
ast-grep --pattern 'class $CLASS {
  $$$
  async def query(query: str, params: dict)
  $$$
}'

Length of output: 656


Script:

#!/bin/bash
# Check Neo4j adapter implementation
echo "=== Neo4j Adapter Implementation ==="
rg -A 10 "class.*Neo4jAdapter" cognee/infrastructure/databases/graph/neo4j_driver/adapter.py

# Check FalkorDB adapter implementation
echo "=== FalkorDB Adapter Implementation ==="
rg -A 10 "class.*FalkorDBAdapter" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Check NetworkX adapter implementation
echo "=== NetworkX Adapter Implementation ==="
rg -A 10 "class.*NetworkXAdapter" cognee/infrastructure/databases/graph/networkx/adapter.py

Length of output: 1621


Script:

#!/bin/bash
# Search for query method implementations in each adapter
echo "=== Neo4j query implementation ==="
rg -B 2 -A 10 "async def query" cognee/infrastructure/databases/graph/neo4j_driver/adapter.py

echo "=== FalkorDB query implementation ==="
rg -B 2 -A 10 "async def query" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

echo "=== NetworkX query implementation ==="
rg -B 2 -A 10 "async def query" cognee/infrastructure/databases/graph/networkx/adapter.py

# Also check for any remaining graph() method implementations
echo "=== Checking for graph() method implementations ==="
rg -B 2 -A 5 "async def graph" cognee/infrastructure/databases/graph/

Length of output: 1532

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

23-33: ⚠️ Potential issue

Add error handling for FalkorDB initialization

The error handling for the FalkorDB provider has been removed. This could lead to uncaught exceptions if there are issues with:

  • Database connection
  • Embedding engine initialization
  • Module imports

Add error handling similar to the neo4j implementation:

    elif config.graph_database_provider == "falkordb":
+       try:
            from cognee.infrastructure.databases.vector.embeddings import get_embedding_engine
            from cognee.infrastructure.databases.hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter

            embedding_engine = get_embedding_engine()

            return FalkorDBAdapter(
                database_url = config.graph_database_url,
                database_port = config.graph_database_port,
                embedding_engine = embedding_engine,
            )
+       except:
+           pass

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/storage/utils/__init__.py (3)

1-7: 🛠️ Refactor suggestion

Move create_model import to the top of the file

The create_model import should be grouped with other imports at the top of the file to maintain consistent import organization and prevent potential circular import issues.

 import json
 from uuid import UUID
 from datetime import datetime
 from pydantic_core import PydanticUndefined
+from pydantic import create_model
 
 from cognee.infrastructure.engine import DataPoint
 
-from pydantic import create_model

Also applies to: 18-18


34-46: 🛠️ Refactor suggestion

Improve type safety and simplify conditions

The function needs better type hints and could benefit from simplified condition checks.

-def get_own_properties(data_point: DataPoint):
+def get_own_properties(data_point: DataPoint) -> dict[str, any]:
+    """Extract non-nested properties from a DataPoint instance.
+
+    Args:
+        data_point: The DataPoint instance to extract properties from
+
+    Returns:
+        A dictionary containing only the primitive field values
+    """
     properties = {}
+    
+    if not hasattr(data_point, '__iter__'):
+        raise TypeError("data_point must be iterable")

     for field_name, field_value in data_point:
-        if field_name == "_metadata" \
-            or isinstance(field_value, dict) \
-            or isinstance(field_value, DataPoint) \
-            or (isinstance(field_value, list) and isinstance(field_value[0], DataPoint)):
+        # Skip metadata, nested structures, and lists of DataPoints
+        if (field_name == "_metadata" or
+            isinstance(field_value, (dict, DataPoint)) or
+            (isinstance(field_value, list) and field_value and isinstance(field_value[0], DataPoint))):
             continue

         properties[field_name] = field_value

     return properties

Also, consider adding validation for empty lists to avoid potential IndexError when checking field_value[0].

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_own_properties(data_point: DataPoint) -> dict[str, any]:
    """Extract non-nested properties from a DataPoint instance.

    Args:
        data_point: The DataPoint instance to extract properties from

    Returns:
        A dictionary containing only the primitive field values
    """
    properties = {}
    
    if not hasattr(data_point, '__iter__'):
        raise TypeError("data_point must be iterable")

    for field_name, field_value in data_point:
        # Skip metadata, nested structures, and lists of DataPoints
        if (field_name == "_metadata" or
            isinstance(field_value, (dict, DataPoint)) or
            (isinstance(field_value, list) and field_value and isinstance(field_value[0], DataPoint))):
            continue

        properties[field_name] = field_value

    return properties

1-46: 💡 Codebase verification

Add type hints and beartype decorators for runtime type checking

The verification reveals that:

  • The file lacks @beartype decorators while other files in the codebase consistently use them
  • While basic type hints exist (DataPoint, dict, list), they could be more specific:
    • include_fields should specify key and value types
    • exclude_fields should use list[str]
    • JSONEncoder.default method lacks type hints completely

Suggested changes:

  • Add from beartype import beartype
  • Add @beartype decorators to all functions
  • Enhance type hints to be more specific
🔗 Analysis chain

Verify beartype compatibility

Given that this PR is focused on testing beartype in the cognify pipeline, we should verify that all functions have proper type hints for beartype validation.


The initial results show beartype is used in several files, but we need to see the actual usage patterns and compare with DataPoint to make a proper assessment about type hints in this utility file.


We need to verify the type hints in the utility functions and compare them with beartype usage patterns in other files to make a proper assessment.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if other files in the codebase use beartype decorators
rg -l "from beartype import" .

# Check if there's a beartype configuration file
fd "beartype.*\.toml|beartype.*\.ini|beartype.*\.json|beartype.*\.yaml|beartype.*\.yml"

# Look for any existing type validation decorators
rg -l "@(beartype|typing|validate|typecheck)"

Length of output: 644


Script:

#!/bin/bash
# Let's check how beartype is used in other files for comparison
rg -A 5 "from beartype import" .

# Let's also check the DataPoint class to understand its type hints
ast-grep --pattern 'class DataPoint'

Length of output: 2770


Script:

#!/bin/bash
# Let's examine the type hints in the current file
rg -A 2 "def \w+\(" cognee/modules/storage/utils/__init__.py

# Let's also check how DataPoint is used in other files with beartype
rg -B 2 -A 5 "DataPoint.*\:" .

Length of output: 603

🧰 Tools
🪛 Ruff

18-18: Module level import not at top of file

(E402)

.github/workflows/test_python_3_9.yml (1)

11-11: 💡 Codebase verification

⚠️ Potential issue

Fix inconsistent concurrency group configuration

The codebase shows two different patterns for concurrency groups:

  1. Standard pattern (most workflows):
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  1. Non-standard pattern (in labeled workflows):
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}

The | character in the group name is incorrect as it's treated as a literal character, not a logical operator. Since this workflow is triggered by labels, simply use the standard pattern as it already provides proper run isolation.

-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
🔗 Analysis chain

Fix the concurrency group configuration

The current concurrency group configuration has several issues:

  1. Using '|' in the group name might not work as expected for grouping runs
  2. The label condition syntax appears incorrect and might not evaluate properly

Consider this corrected version:

-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ contains(github.event.pull_request.labels.*.name, 'run-checks') }}

Let's verify the label usage in PRs:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if 'run-checks' label is consistently used
gh api graphql -f query='
query($owner:String!, $repo:String!) {
  repository(owner:$owner, name:$repo) {
    pullRequests(last:20, states:OPEN) {
      nodes {
        number
        labels(first:10) {
          nodes {
            name
          }
        }
      }
    }
  }
}' -F owner="$(gh repo view --json owner -q .owner)" -F repo="$(gh repo view --json name -q .name)"

Length of output: 774


Script:

#!/bin/bash
# Let's check other workflow files to understand the concurrency pattern usage
fd -e yml -e yaml . .github/workflows/ --exec cat {} \; | grep -A 2 -B 2 "concurrency:"

Length of output: 1664

.github/workflows/test_python_3_10.yml (2)

11-11: 🛠️ Refactor suggestion

Simplify concurrency group configuration

The current concurrency group configuration includes a boolean expression which could lead to unexpected behavior. Consider simplifying the group name and controlling execution through job conditions instead.

- group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+ group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}

Then, if needed, add a condition to the job:

jobs:
  run_common:
    if: |
      needs.get_docs_changes.outputs.changes_outside_docs == 'true' &&
      (github.event_name != 'pull_request' || github.event.label.name == 'run-checks')

4-8: 🛠️ Refactor suggestion

Consider broadening the pull request trigger events

The current configuration only runs tests when a PR is labeled, which could lead to missed testing opportunities and requires manual intervention. Consider including additional trigger events to ensure automated testing on PR creation and updates.

  pull_request:
    branches:
      - main
-    types: [labeled]
+    types: [opened, synchronize, reopened, labeled]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  workflow_dispatch:
  pull_request:
    branches:
      - main
    types: [opened, synchronize, reopened, labeled]
.github/workflows/test_python_3_11.yml (1)

11-11: ⚠️ Potential issue

Fix concurrency group configuration

The current concurrency group configuration has potential issues:

  1. Using '|' for string concatenation in the group name is unconventional
  2. The boolean expression will be treated as a string literal
  3. Different label states will create different group names, potentially defeating the purpose of concurrency control

Consider this alternative implementation:

-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}

Then control execution using job-level conditions:

jobs:
  run_common:
    if: |
      needs.get_docs_changes.outputs.changes_outside_docs == 'true' &&
      (github.event_name == 'workflow_dispatch' || github.event.label.name == 'run-checks')
cognee/infrastructure/engine/__tests__/model_to_graph_to_model.test.py (2)

63-72: ⚠️ Potential issue

Replace print statements with proper assertions

The current test uses print statements which don't validate the results.

Add proper assertions to verify the conversion:

     nodes, edges = get_graph_from_model(boris)
-    print(nodes)
-    print(edges)
+    assert len(nodes) > 0, "Should generate at least one node"
+    assert len(edges) > 0, "Should generate at least one edge"
 
-    person_data = nodes[len(nodes) - 1]
     parsed_person = get_model_instance_from_graph(nodes, edges, 'boris')
-    print(parsed_person)
+    assert parsed_person.id == boris.id
+    assert parsed_person.name == boris.name
+    assert len(parsed_person.owns_car) == len(boris.owns_car)
+    assert parsed_person.owns_car[0].brand == boris.owns_car[0].brand

Committable suggestion skipped: line range outside the PR's diff.


1-5: 🛠️ Refactor suggestion

Convert main block to proper test functions using pytest

The file is structured as a test but doesn't follow testing best practices. Consider restructuring using pytest fixtures and test functions.

Here's how to improve the structure:

 from enum import Enum
 from typing import Optional
+import pytest
 from cognee.infrastructure.engine import DataPoint
 from cognee.modules.graph.utils import get_graph_from_model, get_model_instance_from_graph
-
-if __name__ == "__main__":
+
+@pytest.fixture
+def sample_person():
+    # Move class definitions outside
+    return create_sample_person()
+
+def test_model_to_graph_conversion(sample_person):
+    nodes, edges = get_graph_from_model(sample_person)
+    assert nodes is not None
+    assert edges is not None
+
+def test_graph_to_model_conversion(sample_person):
+    nodes, edges = get_graph_from_model(sample_person)
+    parsed_person = get_model_instance_from_graph(nodes, edges, 'boris')
+    assert parsed_person == sample_person

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/vector/create_vector_engine.py (2)

1-8: 🛠️ Refactor suggestion

Improve type hints and port handling in VectorConfig

The class definition could be enhanced for better type safety and data validation:

  1. Dict base class should specify type parameters
  2. Port should be an integer instead of string for proper validation

Consider applying these changes:

- from typing import Dict
+ from typing import Dict, Union

- class VectorConfig(Dict):
+ class VectorConfig(Dict[str, Union[str, int]]):
    vector_db_url: str
-   vector_db_port: str
+   vector_db_port: int
    vector_db_key: str
    vector_db_provider: str
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import Dict, Union

class VectorConfig(Dict[str, Union[str, int]]):
    vector_db_url: str
    vector_db_port: int
    vector_db_key: str
    vector_db_provider: str

51-57: 🛠️ Refactor suggestion

Add validation and improve error handling for FalkorDB configuration

The FalkorDB integration needs additional validation and error handling:

  1. The import from hybrid package suggests a potential architecture concern
  2. Missing validation for required configuration

Consider these improvements:

 elif config["vector_db_provider"] == "falkordb":
     from ..hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter
 
+    if not config["vector_db_url"] or not config["vector_db_port"]:
+        raise EnvironmentError("FalkorDB URL and port must be configured")
+
+    try:
+        port = int(config["vector_db_port"])
+    except ValueError:
+        raise ValueError("FalkorDB port must be a valid integer")
+
     return FalkorDBAdapter(
         database_url = config["vector_db_url"],
-        database_port = config["vector_db_port"],
+        database_port = port,
         embedding_engine = embedding_engine,
     )

Consider moving FalkorDB adapter to a dedicated vector or graph database package instead of the hybrid package to maintain clear architectural boundaries.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    elif config["vector_db_provider"] == "falkordb":
        from ..hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter

        if not config["vector_db_url"] or not config["vector_db_port"]:
            raise EnvironmentError("FalkorDB URL and port must be configured")

        try:
            port = int(config["vector_db_port"])
        except ValueError:
            raise ValueError("FalkorDB port must be a valid integer")

        return FalkorDBAdapter(
            database_url = config["vector_db_url"],
            database_port = port,
            embedding_engine = embedding_engine,
cognee/modules/data/operations/translate_text.py (3)

5-5: ⚠️ Potential issue

Fix the incorrect use of async and yield with synchronous boto3 client

The boto3 client does not support asynchronous operations out of the box. Defining the function as async and using yield may lead to blocking behavior and unexpected results. Consider making the function synchronous or using an asynchronous client like aioboto3.

Apply this diff to make the function synchronous and replace yield with return:

-async def translate_text(text, source_language: str = "sr", target_language: str = "en", region_name = "eu-west-1"):
+def translate_text(text, source_language: str = "sr", target_language: str = "en", region_name = "eu-west-1"):

...

-            yield result.get("TranslatedText", "No translation found.")
+            return result.get("TranslatedText", "No translation found.")

...

-            yield None
+            return None

Also applies to: 33-33, 37-37, 41-41


14-14: ⚠️ Potential issue

Ensure consistent return types and update docstring accordingly

The function returns None when an error occurs, but the docstring states it returns a string. Consider returning an error message or raising an exception for consistency.

If you choose to return error messages, apply this diff:

-        return None
+        return f"BotoCoreError occurred: {e}"

...

-        return None
+        return f"ClientError occurred: {e}"

And update the docstring:

-    str: Translated text or an error message.
+    str: Translated text or an error message if an error occurred.

Alternatively, if the function may return None, update the docstring and consider using Optional[str] in the return type hint.

Also applies to: 35-41


10-11: ⚠️ Potential issue

Correct the language code standard in the docstring

AWS Translate uses ISO 639-1 language codes, not ISO 639-2. Update the docstrings to prevent confusion.

Apply this diff to correct the docstrings:

-    source_language (str): The source language code (e.g., "sr" for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
+    source_language (str): The source language code (e.g., "sr" for Serbian). ISO 639-1 Code https://www.iso.org/iso-639-language-codes.html

-    target_language (str): The target language code (e.g., "en" for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
+    target_language (str): The target language code (e.g., "en" for English). ISO 639-1 Code https://www.iso.org/iso-639-language-codes.html
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    source_language (str): The source language code (e.g., "sr" for Serbian). ISO 639-1 Code https://www.iso.org/iso-639-language-codes.html
    target_language (str): The target language code (e.g., "en" for English). ISO 639-1 Code https://www.iso.org/iso-639-language-codes.html
cognee/shared/SourceCodeGraph.py (2)

31-31: 🛠️ Refactor suggestion

Use List instead of list for type annotation

In the has_methods attribute, list["Function"] should be List["Function"] to ensure proper type hinting and consistency.

Apply this diff to correct it:

-has_methods: list["Function"]
+has_methods: List["Function"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    has_methods: List["Function"]

7-7: ⚠️ Potential issue

Avoid using 'type' as an attribute name to prevent shadowing built-in names

Using type as an attribute name can cause confusion and potential bugs since it shadows the built-in type function. Consider renaming it to variable_type or entity_type to improve code clarity.

Apply this diff to correct it:

-type: Literal["Variable"] = "Variable"
+entity_type: Literal["Variable"] = "Variable"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    entity_type: Literal["Variable"] = "Variable"
cognee/modules/chunking/TextChunker.py (3)

32-38: 🛠️ Refactor suggestion

Refactor duplicated DocumentChunk creation into a helper method.

The code for creating DocumentChunk instances is duplicated in multiple places. Refactoring this into a helper method will improve maintainability and reduce code duplication.

Apply this diff to refactor the code:

+    def create_document_chunk(self, id, text, word_count, cut_type):
+        return DocumentChunk(
+            id=id,
+            text=text,
+            word_count=word_count,
+            is_part_of=self.document,
+            chunk_index=self.chunk_index,
+            cut_type=cut_type,
+        )

     def read(self):
         for content_text in self.get_text():
             for chunk_data in chunk_by_paragraph(
                 content_text,
                 self.max_chunk_size,
                 batch_paragraphs = True,
             ):
                 if self.chunk_size + chunk_data["word_count"] <= self.max_chunk_size:
                     self.paragraph_chunks.append(chunk_data)
                     self.chunk_size += chunk_data["word_count"]
                 else:
                     if len(self.paragraph_chunks) == 0:
-                        yield DocumentChunk(
-                            id = chunk_data["chunk_id"],
-                            text = chunk_data["text"],
-                            word_count = chunk_data["word_count"],
-                            is_part_of = self.document,
-                            chunk_index = self.chunk_index,
-                            cut_type = chunk_data["cut_type"],
-                        )
+                        yield self.create_document_chunk(
+                            id=chunk_data["chunk_id"],
+                            text=chunk_data["text"],
+                            word_count=chunk_data["word_count"],
+                            cut_type=chunk_data["cut_type"],
+                        )
                         self.paragraph_chunks = []
                         self.chunk_size = 0
                     else:
                         chunk_text = " ".join(chunk["text"] for chunk in self.paragraph_chunks)
                         try:
-                            yield DocumentChunk(
-                                id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
-                                text = chunk_text,
-                                word_count = self.chunk_size,
-                                is_part_of = self.document,
-                                chunk_index = self.chunk_index,
-                                cut_type = self.paragraph_chunks[len(self.paragraph_chunks) - 1]["cut_type"],
-                            )
+                            yield self.create_document_chunk(
+                                id=uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
+                                text=chunk_text,
+                                word_count=self.chunk_size,
+                                cut_type=self.paragraph_chunks[-1]["cut_type"],
+                            )
                         except Exception as e:
                             print(e)
                         self.paragraph_chunks = [chunk_data]
                         self.chunk_size = chunk_data["word_count"]

                     self.chunk_index += 1

             if len(self.paragraph_chunks) > 0:
                 try:
-                    yield DocumentChunk(
-                        id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
-                        text = " ".join(chunk["text"] for chunk in self.paragraph_chunks),
-                        word_count = self.chunk_size,
-                        is_part_of = self.document,
-                        chunk_index = self.chunk_index,
-                        cut_type = self.paragraph_chunks[len(self.paragraph_chunks) - 1]["cut_type"],
-                    )
+                    yield self.create_document_chunk(
+                        id=uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
+                        text=" ".join(chunk["text"] for chunk in self.paragraph_chunks),
+                        word_count=self.chunk_size,
+                        cut_type=self.paragraph_chunks[-1]["cut_type"],
+                    )
                 except Exception as e:
                     print(e)

Also applies to: 43-51, 60-68


43-53: 🛠️ Refactor suggestion

Improve exception handling by avoiding broad exception capture.

Catching all exceptions with except Exception as e can obscure the real issues and make debugging harder. It's better to catch specific exceptions or let them propagate.

Apply this diff to improve exception handling:

             else:
                 chunk_text = " ".join(chunk["text"] for chunk in self.paragraph_chunks)
-                try:
                     yield self.create_document_chunk(
                         id=uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
                         text=chunk_text,
                         word_count=self.chunk_size,
                         cut_type=self.paragraph_chunks[-1]["cut_type"],
                     )
-                except Exception as e:
-                    print(e)
                 self.paragraph_chunks = [chunk_data]
                 self.chunk_size = chunk_data["word_count"]

             self.chunk_index += 1

         if len(self.paragraph_chunks) > 0:
-            try:
                 yield self.create_document_chunk(
                     id=uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
                     text=" ".join(chunk["text"] for chunk in self.paragraph_chunks),
                     word_count=self.chunk_size,
                     cut_type=self.paragraph_chunks[-1]["cut_type"],
                 )
-            except Exception as e:
-                print(e)

If specific exceptions are expected, catch them explicitly and handle appropriately. Otherwise, consider allowing exceptions to propagate to be handled upstream or add proper logging.

Also applies to: 60-70


7-12: ⚠️ Potential issue

Move class attributes to instance attributes within __init__.

Defining mutable default attributes at the class level can lead to shared state between instances, which can cause unexpected behaviors when multiple instances are created. It's better to initialize these attributes within the __init__ method.

Apply this diff to fix the issue:

 class TextChunker():
-    document = None
-    max_chunk_size: int
-    chunk_index = 0
-    chunk_size = 0
-    paragraph_chunks = []

     def __init__(self, document, get_text: callable, chunk_size: int = 1024):
         self.document = document
         self.max_chunk_size = chunk_size
         self.get_text = get_text
+        self.chunk_index = 0
+        self.chunk_size = 0
+        self.paragraph_chunks = []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def __init__(self, document, get_text: callable, chunk_size: int = 1024):
        self.document = document
        self.max_chunk_size = chunk_size
        self.get_text = get_text
        self.chunk_index = 0
        self.chunk_size = 0
        self.paragraph_chunks = []
cognee/modules/graph/utils/get_graph_from_model.py (2)

27-31: 🛠️ Refactor suggestion

Use tuples as dictionary keys instead of string concatenation for robustness.

Concatenating strings to create edge_key can lead to collisions and is less efficient. Using tuples as dictionary keys enhances readability and avoids potential issues with string manipulation.

Apply this diff to update the edge_key usage:

 # First occurrence
- edge_key = str(edge[0]) + str(edge[1]) + edge[2]
+ edge_key = (edge[0], edge[1], edge[2])

- if str(edge_key) not in added_edges:
+ if edge_key not in added_edges:

- added_edges[str(edge_key)] = True
+ added_edges[edge_key] = True

# Second occurrence
- edge_key = str(edge[0]) + str(edge[1]) + edge[2]
+ edge_key = (edge[0], edge[1], edge[2])

- if str(edge_key) not in added_edges:
+ if edge_key not in added_edges:

- added_edges[edge_key] = True
+ added_edges[edge_key] = True

Also applies to: 58-62


5-5: ⚠️ Potential issue

Avoid mutable default arguments to prevent unexpected behavior.

Using mutable default arguments like dictionaries (added_nodes={}, added_edges={}) can lead to unexpected behavior because the default mutable object is shared across all function calls. This can cause data from one call to affect another.

Apply this diff to fix the issue:

 def get_graph_from_model(data_point: DataPoint, include_root=True, added_nodes=None, added_edges=None):
+    if added_nodes is None:
+        added_nodes = {}
+    if added_edges is None:
+        added_edges = {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_graph_from_model(data_point: DataPoint, include_root=True, added_nodes=None, added_edges=None):
    if added_nodes is None:
        added_nodes = {}
    if added_edges is None:
        added_edges = {}
cognee/api/v1/cognify/code_graph_pipeline.py (2)

40-40: ⚠️ Potential issue

Use isinstance() for type checking to avoid potential issues

At line 40, using type(datasets[0]) == str is not recommended. It's better to use isinstance() for type checking, as it handles inheritance and is more idiomatic in Python.

Apply this diff to fix the issue:

-if type(datasets[0]) == str:
+if isinstance(datasets[0], str):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if isinstance(datasets[0], str):
🧰 Tools
🪛 Ruff

40-40: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


36-42: ⚠️ Potential issue

Prevent potential IndexError by checking if datasets is empty before accessing

The code assumes that datasets is not empty when accessing datasets[0]. If datasets is empty, this will raise an IndexError. Ensure that datasets is not empty before accessing the first element.

Consider adding a check before line 40:

+if not datasets:
+    logger.warning("No datasets provided to process.")
+    return

Alternatively, you can adjust the existing condition to handle empty lists:

-if datasets is None or len(datasets) == 0:
+if not datasets:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

40-40: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

cognee/tasks/graph/extract_graph_from_data.py (2)

55-55: ⚠️ Potential issue

Use tuples instead of string concatenation for dictionary keys to prevent collisions

The keys for existing_edges_map and edge_key are created by concatenating strings (lines 55 and 102), which can lead to collisions if the combined strings are not unique. It's safer to use tuples as dictionary keys.

Apply this diff to modify the key generation:

@@ -55
-        existing_edges_map[edge[0] + edge[1] + edge[2]] = True
+        existing_edges_map[(edge[0], edge[1], edge[2])] = True

@@ -102
-            edge_key = str(source_node_id) + str(target_node_id) + relationship_name
+            edge_key = (source_node_id, target_node_id, relationship_name)

Ensure that all references to these keys are updated accordingly.

Also applies to: 102-102


38-42: ⚠️ Potential issue

Accumulate graph_node_edges across all chunks to prevent data loss

Currently, graph_node_edges is re-initialized within the loop over data_chunks (lines 38-42), so only the edges from the last chunk are preserved. This leads to loss of edges from previous chunks. To fix this, initialize graph_node_edges before the loop and accumulate edges across all chunks.

Apply this diff to move the initialization and adjust the accumulation:

--- a/cognee/tasks/graph/extract_graph_from_data.py
+++ b/cognee/tasks/graph/extract_graph_from_data.py
@@ -17,0 +18 @@
+    graph_node_edges = []
@@ -23,3 +24,3 @@
     for (chunk_index, chunk) in enumerate(data_chunks):
         chunk_graph = chunk_graphs[chunk_index]
-        graph_node_edges = [
+        graph_node_edges.extend([
             (edge.target_node_id, edge.source_node_id, edge.relationship_name) \
@@ -41,0 +42 @@
+        ])

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

87-90: 🛠️ Refactor suggestion

Ensure Safe Manipulation of the properties Dictionary

When modifying the properties dictionary, it's important to ensure that the 'id' key exists to prevent potential KeyError. Although data_point.id is likely to be present, adding a safeguard can enhance robustness.

Consider updating the code to safely handle the 'id' key:

 properties = data_point.model_dump()
+if 'id' in properties:
     properties["uuid"] = properties["id"]
     del properties["id"]
+else:
+    properties["uuid"] = None  # or handle the absence appropriately
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            properties = data_point.model_dump()
            if 'id' in properties:
                properties["uuid"] = properties["id"]
                del properties["id"]
            else:
                properties["uuid"] = None  # or handle the absence appropriately

120-127: ⚠️ Potential issue

Safely Access _metadata in data_point

Accessing data_point._metadata["index_fields"][0] assumes that _metadata exists and contains the key "index_fields". To prevent potential AttributeError or KeyError, ensure that these exist before accessing.

Consider modifying the code to include safety checks:

- text = getattr(data_point, data_point._metadata["index_fields"][0])
+index_fields = getattr(data_point, '_metadata', {}).get("index_fields", [])
+if index_fields:
+    text = getattr(data_point, index_fields[0])
+else:
+    # Handle the absence appropriately, e.g., set text to None or raise an error
+    text = None

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)

129-135: ⚠️ Potential issue

Avoid accessing protected member _metadata in index_data_points

Directly accessing data_point._metadata["index_fields"][0] breaks encapsulation and may lead to errors if the internal structure changes. Provide a public method or property in DataPoint to access index fields safely.

First, add a getter method in DataPoint:

class DataPoint:
    # Existing code...

    def get_index_field(self) -> str:
        return getattr(self, self.metadata["index_fields"][0])

Then, update the list comprehension in index_data_points:

await self.create_data_points(
    f"{index_name}_{index_property_name}",
    [
        IndexSchema(
            id=data_point.id,
-           text=getattr(data_point, data_point._metadata["index_fields"][0]),
+           text=data_point.get_index_field(),
        ) for data_point in data_points
    ]
)

11-17: 🛠️ Refactor suggestion

Avoid using protected member _metadata directly in IndexSchema

The _metadata attribute is conventionally considered protected. Directly accessing or defining it may lead to maintenance issues and breaks encapsulation. Consider using a public attribute or property to store metadata.

Modify the IndexSchema class to use a public attribute:

class IndexSchema(DataPoint):
    text: str

-    _metadata: dict = {
+    metadata: dict = {
        "index_fields": ["text"]
    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class IndexSchema(DataPoint):
    text: str

    metadata: dict = {
        "index_fields": ["text"]
    }

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)

76-78: ⚠️ Potential issue

Ensure data_points is not empty to prevent IndexError

In create_data_points, accessing data_points[0] without checking if the list is empty can lead to an IndexError if data_points is empty. Add a check to handle this case.

Apply this diff to handle empty data_points:

 async def create_data_points(self, collection_name: str, data_points: list[DataPoint]):
     connection = await self.get_connection()

+    if not data_points:
+        # Handle empty data_points list appropriately
+        return

     payload_schema = type(data_points[0])
     payload_schema = self.get_data_point_schema(payload_schema)
     ...

Committable suggestion skipped: line range outside the PR's diff.


100-108: ⚠️ Potential issue

Avoid using generics at runtime when instantiating LanceDataPoint

In the create_lance_data_point function, using LanceDataPoint[str, self.get_data_point_schema(type(data_point))] tries to parameterize the class with types at runtime, which is not valid in Python. Generics are for type hints and not meant for runtime instantiation. Use LanceDataPoint without generic parameters.

Apply this diff to correct the instantiation:

 def create_lance_data_point(data_point: DataPoint, vector: list[float]) -> LanceDataPoint:
     properties = get_own_properties(data_point)
     properties["id"] = str(properties["id"])

-    return LanceDataPoint[str, self.get_data_point_schema(type(data_point))](
+    return LanceDataPoint(
         id = str(data_point.id),
         vector = vector,
         payload = properties,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def create_lance_data_point(data_point: DataPoint, vector: list[float]) -> LanceDataPoint:
    properties = get_own_properties(data_point)
    properties["id"] = str(properties["id"])

    return LanceDataPoint(
        id = str(data_point.id),
        vector = vector,
        payload = properties,
    )

216-220: ⚠️ Potential issue

Add error handling for accessing data_point._metadata["index_fields"]

In index_data_points, accessing data_point._metadata["index_fields"][0] without validation can lead to KeyError or IndexError if the metadata is missing or empty. Ensure that _metadata contains index_fields and that the attribute exists on data_point.

Apply this diff to include error checks:

 async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
     await self.create_data_points(f"{index_name}_{index_property_name}", [
-        IndexSchema(
-            id = str(data_point.id),
-            text = getattr(data_point, data_point._metadata["index_fields"][0]),
-        ) for data_point in data_points
+        IndexSchema(
+            id = str(data_point.id),
+            text = getattr(
+                data_point,
+                data_point._metadata.get("index_fields", [None])[0],
+                None
+            ),
+        )
+        for data_point in data_points
         if data_point._metadata.get("index_fields") and hasattr(data_point, data_point._metadata["index_fields"][0])
     ])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            IndexSchema(
                id = str(data_point.id),
                text = getattr(
                    data_point,
                    data_point._metadata.get("index_fields", [None])[0],
                    None
                ),
            )
            for data_point in data_points
            if data_point._metadata.get("index_fields") and hasattr(data_point, data_point._metadata["index_fields"][0])
        ])
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

131-131: ⚠️ Potential issue

Potential AttributeError when accessing data_point._metadata

On line 131, accessing data_point._metadata["index_fields"][0] may raise an AttributeError if data_point instances do not have a _metadata attribute or a KeyError if the "index_fields" key is missing. Ensure that all DataPoint instances have the _metadata attribute with the expected structure.

To fix this issue, verify that data_point objects include _metadata with the necessary keys, or add error handling to manage cases where _metadata is absent or improperly structured.

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (5)

250-251: ⚠️ Potential issue

Missing collection_name argument when calling delete_data_points

In delete_nodes, you are calling delete_data_points without supplying the collection_name argument.

Apply this diff:

-        self.delete_data_points(data_point_ids)
+        await self.delete_data_points(collection_name, data_point_ids)

Also, ensure you await the asynchronous call and have the correct collection_name.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def delete_nodes(self, collection_name: str, data_point_ids: list[str]):
        await self.delete_data_points(collection_name, data_point_ids)

230-237: ⚠️ Potential issue

Avoid using await inside list comprehensions or generator expressions

Using await inside list comprehensions can lead to unexpected behavior. It's better to use asyncio.gather with a regular for loop.

Apply this diff:

-            return await asyncio.gather(
-                *[self.search(
-                    collection_name = collection_name,
-                    query_vector = query_vector,
-                    limit = limit,
-                    with_vector = with_vectors,
-                ) for query_vector in query_vectors]
-            )
+            tasks = [
+                self.search(
+                    collection_name=collection_name,
+                    query_vector=query_vector,
+                    limit=limit,
+                    with_vector=with_vectors,
+                )
+                for query_vector in query_vectors
+            ]
+            return await asyncio.gather(*tasks)

This ensures compatibility and proper execution of asynchronous tasks.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        tasks = [
            self.search(
                collection_name=collection_name,
                query_vector=query_vector,
                limit=limit,
                with_vector=with_vectors,
            )
            for query_vector in query_vectors
        ]
        return await asyncio.gather(*tasks)

24-24: ⚠️ Potential issue

Initialize embedding_engine with an instance, not the class

In the __init__ method, the default value for embedding_engine is set to the EmbeddingEngine class itself rather than an instance. This will cause issues when attempting to call instance methods like embed_text. You should instantiate EmbeddingEngine or expect an instance to be passed.

Apply this diff to fix the issue:

-    embedding_engine =  EmbeddingEngine,
+    embedding_engine = None,
 ):
     if embedding_engine is None:
         embedding_engine = EmbeddingEngine()
     self.embedding_engine = embedding_engine

Committable suggestion skipped: line range outside the PR's diff.


104-109: ⚠️ Potential issue

Avoid using bare except clauses

At line 108, a bare except is used, which can catch unexpected exceptions and make debugging difficult. It's best practice to catch specific exceptions.

Apply this diff to catch general exceptions explicitly:

     try:
         indices = graph.list_indices()
         return any([(index[0] == index_name and index_property_name in index[1]) for index in indices.result_set])
-    except:
+    except Exception as e:
+        logging.error(f"Error checking if vector index exists: {e}")
         return False

Replace Exception with a more specific exception type if possible.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        try:
            indices = graph.list_indices()

            return any([(index[0] == index_name and index_property_name in index[1]) for index in indices.result_set])
        except Exception as e:
            logging.error(f"Error checking if vector index exists: {e}")
            return False
🧰 Tools
🪛 Ruff

108-108: Do not use bare except

(E722)


248-248: ⚠️ Potential issue

Missing collection_name argument when calling delete_data_points

In delete_node, you are calling delete_data_points without providing the required collection_name argument. This will result in a TypeError.

Apply this diff:

-        return await self.delete_data_points([data_point_id])
+        return await self.delete_data_points(collection_name, [data_point_id])

Ensure that you have access to the appropriate collection_name in this context.

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/graph/networkx/adapter.py (5)

83-85: ⚠️ Potential issue

Avoid using mutable default arguments

The edge_properties parameter in add_edge has a default value of an empty dictionary {}. Using mutable default arguments can lead to unexpected behavior because the same object is shared across all function calls.

Apply this diff to set the default value to None and initialize it within the function:

 async def add_edge(
     self,
     from_node: str,
     to_node: str,
     relationship_name: str,
-    edge_properties: Dict[str, Any] = {},
+    edge_properties: Optional[Dict[str, Any]] = None,
 ) -> None:
+    if edge_properties is None:
+        edge_properties = {}
     edge_properties["updated_at"] = datetime.now(timezone.utc)
     self.graph.add_edge(from_node, to_node, key=relationship_name, **edge_properties)

Remember to import Optional from typing:

from typing import Dict, Any, List, Optional

256-257: ⚠️ Potential issue

Avoid bare except clauses

Bare except clauses catch all exceptions, which can mask other issues and make debugging difficult. It's better to catch specific exceptions.

Apply this diff to catch specific exceptions:

 for node in graph_data["nodes"]:
     try:
         node["id"] = UUID(node["id"])
-    except:
+    except ValueError:
         pass
     if "updated_at" in node:
         node["updated_at"] = datetime.strptime(node["updated_at"], "%Y-%m-%dT%H:%M:%S.%f%z")
 
 for edge in graph_data["links"]:
     try:
         source_id = UUID(edge["source"])
         target_id = UUID(edge["target"])
         edge["source"] = source_id
         edge["target"] = target_id
         edge["source_node_id"] = source_id
         edge["target_node_id"] = target_id
-    except:
+    except ValueError:
         pass
     if "updated_at" in edge:
         edge["updated_at"] = datetime.strptime(edge["updated_at"], "%Y-%m-%dT%H:%M:%S.%f%z")

Also applies to: 270-271

🧰 Tools
🪛 Ruff

256-256: Do not use bare except

(E722)


285-288: 🛠️ Refactor suggestion

Optimize file existence checks

In the load_graph_from_file method, there's a redundant check if file_path == self.filename returning early, but later the code assumes file_path is set correctly. Ensure that the logic correctly handles the default filename and avoids unnecessary checks.


37-38: ⚠️ Potential issue

Implement the query method or raise NotImplementedError

The query method is currently unimplemented with a pass statement. If this method is intended for future use, consider raising a NotImplementedError to clearly indicate that it's pending implementation.

Apply this diff to raise a NotImplementedError:

 async def query(self, query: str, params: dict):
-    pass
+    raise NotImplementedError("The 'query' method is not yet implemented.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def query(self, query: str, params: dict):
        raise NotImplementedError("The 'query' method is not yet implemented.")

93-97: ⚠️ Potential issue

Correct parameter unpacking and edge property handling

In the add_edges method, ensure that edge tuples are unpacked correctly and handle cases where edge_properties might be missing. Also, avoid mutable default arguments similar to the add_edge method.

Apply this diff to correct the implementation:

 async def add_edges(
     self,
     edges: List[Tuple[str, str, str, Optional[Dict[str, Any]]]],
 ) -> None:
-    edges = [(edge[0], edge[1], edge[2], {
-        **(edge[3] if len(edge) == 4 else {}),
-        "updated_at": datetime.now(timezone.utc),
-    }) for edge in edges]
+    processed_edges = []
+    for edge in edges:
+        from_node, to_node, relationship_name = edge[:3]
+        edge_props = edge[3] if len(edge) == 4 and edge[3] is not None else {}
+        edge_props["updated_at"] = datetime.now(timezone.utc)
+        processed_edges.append((from_node, to_node, relationship_name, edge_props))
 
-    self.graph.add_edges_from(edges)
+    self.graph.add_edges_from(processed_edges)
     await self.save_graph_to_file(self.filename)

This ensures that edge_properties are handled correctly and that updated_at is added to each edge.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        processed_edges = []
        for edge in edges:
            from_node, to_node, relationship_name = edge[:3]
            edge_props = edge[3] if len(edge) == 4 and edge[3] is not None else {}
            edge_props["updated_at"] = datetime.now(timezone.utc)
            processed_edges.append((from_node, to_node, relationship_name, edge_props))

        self.graph.add_edges_from(processed_edges)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (7)

50-53: ⚠️ Potential issue

Missing 'await' when calling the asynchronous 'query' method.

In the has_node method, the query method is asynchronous but is called without await. This will result in a coroutine object being returned instead of the actual results.

Apply this diff to fix the issue:

async def has_node(self, node_id: str) -> bool:
-    results = self.query(
+    results = await self.query(
        """
            MATCH (n)
            WHERE n.id = $node_id
            RETURN COUNT(n) > 0 AS node_exists
        """,
        {"node_id": node_id}
    )
    return results[0]["node_exists"] if len(results) > 0 else False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def has_node(self, node_id: str) -> bool:
        results = await self.query(

79-85: 🛠️ Refactor suggestion

Set timestamps correctly when adding multiple nodes.

Similar to add_node, in the add_nodes method, ensure that created_at and updated_at timestamps are set appropriately in the ON CREATE and ON MATCH clauses.

Apply this diff to update the query:

ON CREATE SET n += node.properties,
+             n.created_at = timestamp()
ON MATCH SET n += node.properties,
-ON MATCH SET n.updated_at = timestamp()
+             n.updated_at = timestamp()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def add_nodes(self, nodes: list[DataPoint]) -> None:
        query = """
        UNWIND $nodes AS node
        MERGE (n {id: node.node_id})
        ON CREATE SET n += node.properties,
                     n.created_at = timestamp()
        ON MATCH SET n += node.properties,
                     n.updated_at = timestamp()

44-46: ⚠️ Potential issue

Avoid closing the driver after each query to prevent connection issues.

Calling self.close() inside the query method closes the driver after each query. This will cause connection errors when subsequent queries attempt to use the closed driver. It's recommended to close the driver once, typically when the application is shutting down.

Apply this diff to remove the unnecessary driver closure:

async def query(
    self,
    query: str,
    params: Optional[Dict[str, Any]] = None,
) -> List[Dict[str, Any]]:
    try:
        async with self.get_session() as session:
            result = await session.run(query, parameters=params)
            data = await result.data()
-           await self.close()
            return data
    except Neo4jError as error:
        logger.error("Neo4j query error: %s", error, exc_info=True)
        raise error
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                result = await session.run(query, parameters = params)
                data = await result.data()

92-94: ⚠️ Potential issue

Avoid overwriting the 'id' property in node properties.

When serializing node properties, ensure that the id property is not included in the properties dictionary if it's already used as an identifier. This prevents duplication and potential conflicts.

Apply this diff to exclude the 'id' property:

nodes = [{
    "node_id": str(node.id),
-   "properties": self.serialize_properties(node.model_dump()),
+   "properties": self.serialize_properties({
+       key: value for key, value in node.model_dump().items() if key != 'id'
+   }),
} for node in nodes]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            "node_id": str(node.id),
            "properties": self.serialize_properties({
                key: value for key, value in node.model_dump().items() if key != 'id'
            }),
        } for node in nodes]

63-69: 🛠️ Refactor suggestion

Ensure consistent use of 'ON CREATE' and 'ON MATCH' clauses.

In the add_node method, both ON CREATE SET and ON MATCH SET clauses are used to update the node properties. However, updated_at = timestamp() is only set in the ON MATCH clause. Consider setting created_at on node creation and updated_at on updates for better tracking.

Apply this diff to set timestamps appropriately:

query = """MERGE (node {id: $node_id})
            ON CREATE SET node += $properties,
+                          node.created_at = timestamp()
            ON MATCH SET node += $properties,
-           ON MATCH SET node.updated_at = timestamp()
+                          node.updated_at = timestamp()
            RETURN ID(node) AS internal_id, node.id AS nodeId"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def add_node(self, node: DataPoint):
        serialized_properties = self.serialize_properties(node.model_dump())

        query = """MERGE (node {id: $node_id})
                ON CREATE SET node += $properties,
                              node.created_at = timestamp()
                ON MATCH SET node += $properties,
                              node.updated_at = timestamp()
                RETURN ID(node) AS internal_id, node.id AS nodeId"""

177-182: ⚠️ Potential issue

Incorrect use of node UUIDs as labels in 'add_edge' method.

Similar to has_edge, the add_edge method incorrectly uses node UUIDs as labels in the MATCH clause. Nodes should be matched using properties, not labels derived from UUIDs.

Apply this diff to fix the query and parameters:

async def add_edge(
    self,
    from_node: str,
    to_node: str,
    relationship_name: str,
    edge_properties: Optional[Dict[str, Any]] = {},
):
    serialized_properties = self.serialize_properties(edge_properties)

-   query = f"""MATCH (from_node:`{str(from_node)}` {{id: $from_node}}), 
-        (to_node:`{str(to_node)}` {{id: $to_node}})
+   query = f"""MATCH (from_node {{id: $from_node_id}}), 
+        (to_node {{id: $to_node_id}})
         MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
         ON CREATE SET r += $properties, r.updated_at = timestamp()
         ON MATCH SET r += $properties, r.updated_at = timestamp()
         RETURN r"""
    params = {
-       "from_node": str(from_node),
-       "to_node": str(to_node),
+       "from_node_id": str(from_node),
+       "to_node_id": str(to_node),
        "properties": serialized_properties,
    }

    return await self.query(query, params)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        query = f"""MATCH (from_node {id: $from_node_id}), 
         (to_node {id: $to_node_id})
         MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
         ON CREATE SET r += $properties, r.updated_at = timestamp()
         ON MATCH SET r += $properties, r.updated_at = timestamp()

139-142: ⚠️ Potential issue

Incorrect usage of UUIDs as labels in Cypher queries.

The from_node and to_node UUIDs are being used as labels in the MATCH clause. Labels represent node types, not specific node instances. Using UUIDs as labels can cause unexpected behavior and is not recommended.

Apply this diff to correct the query and parameters:

async def has_edge(self, from_node: UUID, to_node: UUID, edge_label: str) -> bool:
-    query = f"""
-        MATCH (from_node:`{str(from_node)}`)-[relationship:`{edge_label}`]->(to_node:`{str(to_node)}`)
-        RETURN COUNT(relationship) > 0 AS edge_exists
-    """
+    query = f"""
+        MATCH (from_node)-[relationship:`{edge_label}`]->(to_node)
+        WHERE from_node.id = $from_node_id AND to_node.id = $to_node_id
+        RETURN COUNT(relationship) > 0 AS edge_exists
+    """
+    params = {
+        "from_node_id": str(from_node),
+        "to_node_id": str(to_node),
+    }

-    edge_exists = await self.query(query)
+    edge_exists_result = await self.query(query, params)
+    edge_exists = edge_exists_result[0]["edge_exists"] if edge_exists_result else False
    return edge_exists
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def has_edge(self, from_node: UUID, to_node: UUID, edge_label: str) -> bool:
    query = f"""
        MATCH (from_node)-[relationship:`{edge_label}`]->(to_node)
        WHERE from_node.id = $from_node_id AND to_node.id = $to_node_id
        RETURN COUNT(relationship) > 0 AS edge_exists
    """
    params = {
        "from_node_id": str(from_node),
        "to_node_id": str(to_node),
    }

    edge_exists_result = await self.query(query, params)
    edge_exists = edge_exists_result[0]["edge_exists"] if edge_exists_result else False
    return edge_exists

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

🛑 Comments failed to post (73)
cognee/modules/engine/utils/generate_node_name.py (1)

1-2: 🛠️ Refactor suggestion

Add docstring with examples and parameter validation

The function would benefit from a clear docstring explaining its purpose, parameters, return value, and usage examples. Also consider adding input validation.

Here's a suggested implementation:

+from typing import Optional
+from beartype import beartype
+
+@beartype
 def generate_node_name(name: str) -> str:
+    """Generate a normalized node name by converting to lowercase and replacing special characters.
+
+    Args:
+        name: Input string to be converted into a node name.
+            Must be a non-empty string.
+
+    Returns:
+        Normalized node name string.
+
+    Raises:
+        ValueError: If name is empty or None.
+
+    Examples:
+        >>> generate_node_name("Hello World")
+        'hello_world'
+        >>> generate_node_name("User's Data")
+        'users_data'
+    """
+    if not name or not isinstance(name, str):
+        raise ValueError("Name must be a non-empty string")
+
     return name.lower().replace(" ", "_").replace("'", "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import Optional
from beartype import beartype

@beartype
def generate_node_name(name: str) -> str:
    """Generate a normalized node name by converting to lowercase and replacing special characters.

    Args:
        name: Input string to be converted into a node name.
            Must be a non-empty string.

    Returns:
        Normalized node name string.

    Raises:
        ValueError: If name is empty or None.

    Examples:
        >>> generate_node_name("Hello World")
        'hello_world'
        >>> generate_node_name("User's Data")
        'users_data'
    """
    if not name or not isinstance(name, str):
        raise ValueError("Name must be a non-empty string")

    return name.lower().replace(" ", "_").replace("'", "")
cognee/modules/engine/utils/generate_node_id.py (1)

3-4: ⚠️ Potential issue

Add input validation and proper type checking.

Several improvements are needed for robustness:

  1. The PR mentions beartype but lacks runtime type checking
  2. No validation for empty or None input
  3. The UUID object needs explicit string conversion

Here's the suggested implementation:

+from beartype import beartype
 from uuid import NAMESPACE_OID, uuid5
+from typing import Optional

+@beartype
-def generate_node_id(node_id: str) -> str:
+def generate_node_id(node_id: Optional[str]) -> str:
+    if not node_id:
+        raise ValueError("node_id cannot be empty or None")
+    normalized_id = node_id.lower().replace(" ", "_").replace("'", "")
+    if not normalized_id:
+        raise ValueError("normalized node_id cannot be empty")
-    return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
+    return str(uuid5(NAMESPACE_OID, normalized_id))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from beartype import beartype
from uuid import NAMESPACE_OID, uuid5
from typing import Optional

@beartype
def generate_node_id(node_id: Optional[str]) -> str:
    if not node_id:
        raise ValueError("node_id cannot be empty or None")
    normalized_id = node_id.lower().replace(" ", "_").replace("'", "")
    if not normalized_id:
        raise ValueError("normalized node_id cannot be empty")
    return str(uuid5(NAMESPACE_OID, normalized_id))
cognee/modules/engine/models/EntityType.py (2)

4-8: 🛠️ Refactor suggestion

Consider renaming the 'type' attribute to avoid shadowing built-in type

The attribute name 'type' shadows Python's built-in type function. Consider using a more specific name like entity_type or type_name.

 class EntityType(DataPoint):
     name: str
-    type: str
+    entity_type: str
     description: str
     exists_in: DocumentChunk
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class EntityType(DataPoint):
    name: str
    entity_type: str
    description: str
    exists_in: DocumentChunk

4-11: 🛠️ Refactor suggestion

Add class documentation and runtime type checking

The class lacks documentation explaining its purpose and usage. Also, given the PR title mentions "beartype", consider adding runtime type checking.

+from beartype import beartype
+
+@beartype
 class EntityType(DataPoint):
+    """
+    Represents a type classification for entities in the system.
+    
+    Attributes:
+        name (str): The unique identifier for this entity type
+        entity_type (str): The classification category
+        description (str): Detailed description of this entity type
+        exists_in (DocumentChunk): Reference to the source document chunk
+    """
     name: str
     type: str
     description: str
     exists_in: DocumentChunk
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from beartype import beartype

@beartype
class EntityType(DataPoint):
    """
    Represents a type classification for entities in the system.
    
    Attributes:
        name (str): The unique identifier for this entity type
        entity_type (str): The classification category
        description (str): Detailed description of this entity type
        exists_in (DocumentChunk): Reference to the source document chunk
    """
    name: str
    type: str
    description: str
    exists_in: DocumentChunk
    _metadata: dict = {
        "index_fields": ["name"],
    }
cognee/modules/data/processing/document_types/AudioDocument.py (2)

8-11: 🛠️ Refactor suggestion

Consider error handling for transcript creation

The transcript creation could fail for various reasons (network issues, invalid audio file, etc.), but there's no error handling.

Consider adding error handling:

    def read(self, chunk_size: int):
        # Transcribe the audio file
-       result = get_llm_client().create_transcript(self.raw_data_location)
-       text = result.text
+       try:
+           result = get_llm_client().create_transcript(self.raw_data_location)
+           text = result.text
+       except Exception as e:
+           raise ValueError(f"Failed to transcribe audio file: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def read(self, chunk_size: int):
        # Transcribe the audio file
        try:
            result = get_llm_client().create_transcript(self.raw_data_location)
            text = result.text
        except Exception as e:
            raise ValueError(f"Failed to transcribe audio file: {e}")

1-15: ⚠️ Potential issue

Verify attribute initialization and add type hints

The class is missing several key components:

  1. No constructor is defined, yet self.raw_data_location is used in the read method
  2. Missing type hints for raw_data_location attribute

Consider adding proper initialization and type hints:

class AudioDocument(Document):
    type: str = "audio"
+   raw_data_location: str

+   def __init__(self, raw_data_location: str) -> None:
+       super().__init__()
+       self.raw_data_location = raw_data_location
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from cognee.infrastructure.llm.get_llm_client import get_llm_client
from cognee.modules.chunking.TextChunker import TextChunker
from .Document import Document

class AudioDocument(Document):
    type: str = "audio"
    raw_data_location: str

    def __init__(self, raw_data_location: str) -> None:
        super().__init__()
        self.raw_data_location = raw_data_location

    def read(self, chunk_size: int):
        # Transcribe the audio file
        result = get_llm_client().create_transcript(self.raw_data_location)
        text = result.text

        chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)

        yield from chunker.read()
cognee/tasks/documents/extract_chunks_from_documents.py (1)

9-11: ⚠️ Potential issue

Add error handling and input validation.

The current implementation lacks several important safeguards:

  1. No error handling for document.read() operations
  2. No validation of chunk_size
  3. No handling of empty documents list

Consider implementing these improvements:

 @beartype
 async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024) -> AsyncGenerator[DocumentChunk, None]:
+    if not documents:
+        return
+    
+    if chunk_size <= 0:
+        raise ValueError("chunk_size must be positive")
+
     for document in documents:
-        for document_chunk in document.read(chunk_size = chunk_size):
-            yield document_chunk
+        try:
+            for document_chunk in document.read(chunk_size=chunk_size):
+                yield document_chunk
+        except Exception as e:
+            # Log the error and continue with next document
+            logger.error(f"Error processing document: {str(e)}")
+            continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if not documents:
        return
    
    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive")

    for document in documents:
        try:
            for document_chunk in document.read(chunk_size=chunk_size):
                yield document_chunk
        except Exception as e:
            # Log the error and continue with next document
            logger.error(f"Error processing document: {str(e)}")
            continue
cognee/modules/data/processing/document_types/ImageDocument.py (1)

7-15: 🛠️ Refactor suggestion

Consider adding error handling for image transcription

The image transcription process could fail for various reasons (invalid image, API errors, etc.), but there's no error handling in place.

Consider adding try-catch block:

 def read(self, chunk_size: int):
     # Transcribe the image file
-    result = get_llm_client().transcribe_image(self.raw_data_location)
-    text = result.choices[0].message.content
+    try:
+        result = get_llm_client().transcribe_image(self.raw_data_location)
+        text = result.choices[0].message.content
+    except Exception as e:
+        raise ValueError(f"Failed to transcribe image: {str(e)}")

     chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)
     yield from chunker.read()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def read(self, chunk_size: int):
        # Transcribe the image file
        try:
            result = get_llm_client().transcribe_image(self.raw_data_location)
            text = result.choices[0].message.content
        except Exception as e:
            raise ValueError(f"Failed to transcribe image: {str(e)}")

        chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)

        yield from chunker.read()
cognee/tasks/graph/extract_graph_from_code.py (4)

8-8: 🛠️ Refactor suggestion

Add docstring and input validation

The function lacks documentation and input validation. Consider adding a detailed docstring and validating input parameters.

 async def extract_graph_from_code(data_chunks: list[DocumentChunk], graph_model: Type[BaseModel]):
+    """Extract and process graphs from code chunks.
+
+    Args:
+        data_chunks (list[DocumentChunk]): List of document chunks containing code
+        graph_model (Type[BaseModel]): Pydantic model for graph structure
+
+    Returns:
+        list[DocumentChunk]: Processed document chunks
+
+    Raises:
+        ValueError: If data_chunks is empty or contains invalid items
+    """
+    if not data_chunks:
+        raise ValueError("data_chunks cannot be empty")
+    if not all(isinstance(chunk, DocumentChunk) for chunk in data_chunks):
+        raise ValueError("All items in data_chunks must be DocumentChunk instances")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def extract_graph_from_code(data_chunks: list[DocumentChunk], graph_model: Type[BaseModel]):
    """Extract and process graphs from code chunks.

    Args:
        data_chunks (list[DocumentChunk]): List of document chunks containing code
        graph_model (Type[BaseModel]): Pydantic model for graph structure

    Returns:
        list[DocumentChunk]: Processed document chunks

    Raises:
        ValueError: If data_chunks is empty or contains invalid items
    """
    if not data_chunks:
        raise ValueError("data_chunks cannot be empty")
    if not all(isinstance(chunk, DocumentChunk) for chunk in data_chunks):
        raise ValueError("All items in data_chunks must be DocumentChunk instances")

13-15: ⚠️ Potential issue

Add error handling for storage operations

The storage loop lacks error handling and could fail silently. Also, there's a potential index mismatch if previous operations failed.

-    for (chunk_index, chunk) in enumerate(data_chunks):
-        chunk_graph = chunk_graphs[chunk_index]
-        await add_data_points(chunk_graph.nodes)
+    for chunk_graph, chunk in zip(results, data_chunks):
+        try:
+            if not hasattr(chunk_graph, 'nodes'):
+                logging.error(f"Invalid graph structure for chunk: {chunk.id}")
+                continue
+            await add_data_points(chunk_graph.nodes)
+        except Exception as e:
+            logging.error(f"Failed to store nodes for chunk {chunk.id}: {str(e)}")
+            raise

Committable suggestion skipped: line range outside the PR's diff.


9-11: ⚠️ Potential issue

Add error handling and batch processing

The concurrent extraction lacks error handling and could potentially overwhelm the system with too many simultaneous tasks.

+    # Process in batches of 10 to avoid overwhelming the system
+    batch_size = 10
+    results = []
+    for i in range(0, len(data_chunks), batch_size):
+        batch = data_chunks[i:i + batch_size]
+        try:
             chunk_graphs = await asyncio.gather(
-                *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
+                *[extract_content_graph(chunk.text, graph_model) for chunk in batch],
+                return_exceptions=True
             )
+            # Handle any exceptions from the batch
+            for graph in chunk_graphs:
+                if isinstance(graph, Exception):
+                    # Log the error and continue with other chunks
+                    logging.error(f"Failed to extract graph: {str(graph)}")
+                else:
+                    results.append(graph)
+        except Exception as e:
+            logging.error(f"Batch processing failed: {str(e)}")
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Process in batches of 10 to avoid overwhelming the system
    batch_size = 10
    results = []
    for i in range(0, len(data_chunks), batch_size):
        batch = data_chunks[i:i + batch_size]
        try:
            chunk_graphs = await asyncio.gather(
                *[extract_content_graph(chunk.text, graph_model) for chunk in batch],
                return_exceptions=True
            )
            # Handle any exceptions from the batch
            for graph in chunk_graphs:
                if isinstance(graph, Exception):
                    # Log the error and continue with other chunks
                    logging.error(f"Failed to extract graph: {str(graph)}")
                else:
                    results.append(graph)
        except Exception as e:
            logging.error(f"Batch processing failed: {str(e)}")
            raise

1-17: 💡 Codebase verification

Add @beartype decorator to ensure runtime type checking

The file extract_graph_from_code.py should include the beartype decorator for consistency with other task functions in the codebase. Both extract_graph_from_data.py and add_data_points.py use the @beartype decorator for runtime type checking. The type hints are already present, but the decorator is missing.

  • Add from beartype import beartype import
  • Add @beartype decorator above the extract_graph_from_code function
🔗 Analysis chain

Verify beartype integration and pipeline compatibility

Since this PR is about testing beartype in the cognify pipeline, we should verify the type hints are compatible with beartype's runtime type checking.


Let me gather more information about the beartype integration and type hints in the related files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if beartype is properly integrated and if there are any type-related issues

# Look for beartype usage in the codebase
echo "Checking beartype integration..."
rg -l "beartype" 

# Check for type hint usage in related files
echo "Checking type hint usage in related files..."
rg -l "from typing import|import typing" "cognee/tasks/graph/"

# Look for any existing beartype decorators
echo "Checking for beartype decorators..."
rg "@beartype" "cognee/tasks/graph/"

Length of output: 772


Script:

#!/bin/bash
# Check the content of extract_graph_from_data.py to see how beartype is used
echo "Checking extract_graph_from_data.py implementation..."
cat cognee/tasks/graph/extract_graph_from_data.py

# Check pyproject.toml for beartype configuration
echo -e "\nChecking pyproject.toml for beartype configuration..."
cat pyproject.toml | grep -A 5 -B 5 "beartype"

# Check add_data_points.py since it's imported in the file under review
echo -e "\nChecking add_data_points implementation..."
cat cognee/tasks/storage/add_data_points.py

Length of output: 6204

cognee/infrastructure/engine/models/DataPoint.py (3)

17-18: ⚠️ Potential issue

Decide on private attribute handling

The commented Config suggests incomplete private attribute handling. Either:

  1. Uncomment and implement proper private attribute handling, or
  2. Remove the commented code and document the decision

If keeping private attributes:

-    # class Config:
-    #     underscore_attrs_are_private = True
+    class Config:
+        underscore_attrs_are_private = True
+        validate_assignment = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    class Config:
        underscore_attrs_are_private = True
        validate_assignment = True

10-15: ⚠️ Potential issue

Fix mutable default and enhance metadata validation

Several potential issues to address:

  1. Using a mutable default value for _metadata is dangerous
  2. The private attribute handling is incomplete
  3. Missing validation for metadata structure

Apply these changes:

 class DataPoint(BaseModel):
     id: UUID = Field(default_factory = uuid4)
     updated_at: Optional[datetime] = datetime.now(timezone.utc)
-    _metadata: Optional[MetaData] = {
-        "index_fields": []
-    }
+    _metadata: Optional[MetaData] = Field(
+        default_factory=lambda: {"index_fields": []},
+        description="Metadata containing indexable field names"
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class DataPoint(BaseModel):
    id: UUID = Field(default_factory = uuid4)
    updated_at: Optional[datetime] = datetime.now(timezone.utc)
    _metadata: Optional[MetaData] = Field(
        default_factory=lambda: {"index_fields": []},
        description="Metadata containing indexable field names"
    )

20-24: 🛠️ Refactor suggestion

Improve method documentation, typing, and error handling

The method needs several improvements:

  1. Missing return type hint and docstring
  2. No error handling for invalid attribute access
  3. Complex condition could be simplified
-    def get_embeddable_data(self):
+    def get_embeddable_data(self) -> Optional[any]:
+        """Get the value of the first index field specified in metadata.
+
+        Returns:
+            Optional[any]: Value of the first index field if it exists, None otherwise
+        """
+        if not self._metadata or not self._metadata["index_fields"]:
+            return None
+
+        first_field = self._metadata["index_fields"][0]
+        try:
+            return getattr(self, first_field)
+        except AttributeError:
+            return None
-        if self._metadata and len(self._metadata["index_fields"]) > 0 \
-            and hasattr(self, self._metadata["index_fields"][0]):
-
-            return getattr(self, self._metadata["index_fields"][0])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def get_embeddable_data(self) -> Optional[any]:
        """Get the value of the first index field specified in metadata.

        Returns:
            Optional[any]: Value of the first index field if it exists, None otherwise
        """
        if not self._metadata or not self._metadata["index_fields"]:
            return None

        first_field = self._metadata["index_fields"][0]
        try:
            return getattr(self, first_field)
        except AttributeError:
            return None
cognee/infrastructure/databases/vector/config.py (1)

11-11: 💡 Codebase verification

🛠️ Refactor suggestion

Port configuration needs alignment with FalkorDB's standard port

The current default port 1234 should be changed as the codebase shows:

  1. FalkorDB adapter is a supported vector database provider
  2. QDrant adapter uses its standard port (6333)
  3. The port value is passed through to the FalkorDB driver's connection

Recommend changing to:

- vector_db_port: int = 1234
+ vector_db_port: int = int(os.getenv("VECTOR_DB_PORT", "6334"))  # FalkorDB default port
+
+ @field_validator("vector_db_port")
+ def validate_port(cls, v: int) -> int:
+     if not 0 <= v <= 65535:
+         raise ValueError("Port must be between 0 and 65535")
+     return v
🔗 Analysis chain

Consider improving port configuration and validation

The hard-coded default port 1234 raises several concerns:

  1. It's not a standard port for any common database system
  2. No validation ensures the port is within valid range (0-65535)
  3. Hard-coded ports can cause conflicts in containerized environments

Consider these improvements:

- vector_db_port: int = 1234
+ vector_db_port: int = int(os.getenv("VECTOR_DB_PORT", "6334"))  # FalkorDB default port
+
+ @field_validator("vector_db_port")
+ def validate_port(cls, v: int) -> int:
+     if not 0 <= v <= 65535:
+         raise ValueError("Port must be between 0 and 65535")
+     return v

Let's check if this port is used consistently:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for port usage across the codebase
rg -A 2 "(?:port.*1234|1234.*port)" --type py

Length of output: 293


Script:

#!/bin/bash
# Let's check for any FalkorDB or vector database related configurations and ports
rg -i "falkordb|vector.*db.*port" --type py -A 2

# Also check for any environment variable usage in configuration
rg "os\.getenv|environ" cognee/infrastructure/databases/vector/config.py -A 2

# Search for any database adapter implementations to understand port usage
fd "adapter|client" -e py --exec rg -l "port.*=|connect.*port"

Length of output: 5991


Script:

#!/bin/bash
# Let's check the specific port usage in FalkorDB adapter and other relevant files
rg -A 3 "port.*=|connect.*port" ./cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py ./cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py ./cognee/api/client.py

# Check if there are any default port configurations in these adapters
rg "DEFAULT.*PORT|port.*=.*\d+" ./cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py ./cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py

# Let's also check the vector engine creation to see how ports are handled
rg -A 3 "database_port" ./cognee/infrastructure/databases/vector/create_vector_engine.py

Length of output: 2452

cognee/tasks/chunks/remove_disconnected_chunks.py (1)

1-2: 💡 Codebase verification

Update import path in cognee/modules/data/processing/has_new_chunks.py

Found one file still using the old import path:

  • cognee/modules/data/processing/has_new_chunks.py needs to be updated to use from cognee.modules.chunking.models.DocumentChunk import DocumentChunk
🔗 Analysis chain

Verify the new import path usage across the codebase

The import path for DocumentChunk has been updated. Let's ensure this change is consistent across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining old import paths and verify new import path usage
# Expected: All imports should use the new path pattern

echo "Checking for old import pattern..."
rg "from cognee.modules.chunking import DocumentChunk"

echo "Verifying new import pattern usage..."
rg "from cognee.modules.chunking.models.DocumentChunk import DocumentChunk"

Length of output: 1480

cognee/tasks/documents/classify_documents.py (3)

8-14: 🛠️ Refactor suggestion

Refactor the document classification logic for better readability and maintainability

The current implementation using chained ternary operators is hard to read and maintain. Consider using a more explicit approach with a mapping or match statement.

Here's a suggested refactor:

-    documents = [
-        PdfDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "pdf" else
-        AudioDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "audio" else
-        ImageDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location) if data_item.extension == "image" else
-        TextDocument(id = data_item.id, name=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location)
-        for data_item in data_documents
-    ]
+    DOCUMENT_TYPES = {
+        "pdf": PdfDocument,
+        "audio": AudioDocument,
+        "image": ImageDocument
+    }
+    
+    def create_document(data_item: Data) -> Document:
+        document_class = DOCUMENT_TYPES.get(data_item.extension, TextDocument)
+        return document_class(
+            id=data_item.id,
+            name=f"{data_item.name}.{data_item.extension}",
+            raw_data_location=data_item.raw_data_location
+        )
+    
+    documents = [create_document(data_item) for data_item in data_documents]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    DOCUMENT_TYPES = {
        "pdf": PdfDocument,
        "audio": AudioDocument,
        "image": ImageDocument
    }
    
    def create_document(data_item: Data) -> Document:
        document_class = DOCUMENT_TYPES.get(data_item.extension, TextDocument)
        return document_class(
            id=data_item.id,
            name=f"{data_item.name}.{data_item.extension}",
            raw_data_location=data_item.raw_data_location
        )
    
    documents = [create_document(data_item) for data_item in data_documents]

6-16: ⚠️ Potential issue

Add input validation and error handling

The function lacks validation for:

  1. Input data_documents being empty or None
  2. Required attributes (id, name, extension, raw_data_location) being None
  3. Unsupported file extensions

Here's a suggested implementation with proper validation:

 @beartype
 def classify_documents(data_documents: list[Data]) -> list[Document]:
+    if not data_documents:
+        raise ValueError("data_documents cannot be empty")
+
+    SUPPORTED_EXTENSIONS = {"pdf", "audio", "image"}
+    
+    def validate_data_item(data_item: Data) -> None:
+        if not all([data_item.id, data_item.name, data_item.extension, data_item.raw_data_location]):
+            raise ValueError(f"Missing required attributes for data_item: {data_item}")
+        if data_item.extension not in SUPPORTED_EXTENSIONS and not isinstance(data_item, TextDocument):
+            raise ValueError(f"Unsupported file extension: {data_item.extension}")
+
+    for data_item in data_documents:
+        validate_data_item(data_item)

     # Rest of the implementation...

Committable suggestion skipped: line range outside the PR's diff.


9-11: 💡 Codebase verification

Extension handling needs improvement

The review comment is correct. The current implementation has several issues:

  1. There's a mismatch between file type detection and document classification:

    • The guess_file_type.py only defines custom matchers for "txt" and "pdf" files
    • But classify_documents.py assumes generic "audio" and "image" extensions which aren't properly validated
  2. No proper mapping exists between detected MIME types/extensions and document types

  3. The file type detection relies on the filetype library's defaults for audio/image files, but the document classification uses overly generic extensions

Recommendations:

  • Define specific file type matchers for supported audio (mp3, wav, etc.) and image (jpg, png, etc.) formats
  • Create a configuration mapping between file extensions and document types
  • Add validation to ensure only supported extensions are processed
🔗 Analysis chain

Review the extension matching logic

The current extension matching using "audio" and "image" is too generic. Consider:

  1. Using specific file extensions (e.g., mp3, wav, jpg, png)
  2. Maintaining a configuration of supported extensions

Let's check how these extensions are used in the codebase:


Let me check the document types implementation to see if there's any validation or configuration of supported extensions.


Let me check one more thing to see how file types are determined and if there's any mapping between extensions and document types.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for file extension handling in the codebase
echo "Searching for extension-related code..."
rg -l "extension.*audio|extension.*image" --type py

echo "Searching for AudioDocument usage..."
rg "AudioDocument" --type py -C 2

echo "Searching for ImageDocument usage..."
rg "ImageDocument" --type py -C 2

Length of output: 4560


Script:

#!/bin/bash
# Check for extension validation or configuration
echo "Checking Document base class..."
rg -A 10 "class Document" --type py

echo "Checking for extension configuration files..."
fd -e yaml -e json -e toml -e ini

echo "Checking for extension validation in data models..."
rg "extension.*=|extension.*validate" --type py

Length of output: 5532


Script:

#!/bin/bash
# Check file type handling
echo "Checking file type implementations..."
cat cognee/infrastructure/files/utils/guess_file_type.py

echo "Checking file metadata extraction..."
cat cognee/infrastructure/files/utils/get_file_metadata.py

Length of output: 2017

cognee/modules/data/operations/detect_language.py (1)

8-8: 💡 Codebase verification

Croatian to Serbian mapping requires documentation and configuration

The language mapping from Croatian ("hr") to Serbian ("sr") is hardcoded in the detect_language.py function, but this significant business logic decision:

  • Is not documented in any configuration files
  • Is not explained in the codebase beyond a simple comment
  • Has implications for the translation service which explicitly uses "sr" as the default source language

Recommendations:

  • Move the language mapping to a configuration file
  • Document the business reasoning behind this mapping
  • Add tests to verify this behavior
  • Consider making the mapping configurable for different use cases
🔗 Analysis chain

Verify Croatian to Serbian language mapping requirement

The automatic mapping from Croatian ("hr") to Serbian ("sr") seems to be a significant business logic decision. Please verify if this is an intentional requirement and if it should be documented elsewhere (e.g., in configuration or requirements).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other instances of this mapping to ensure consistency
rg -l "hr.*sr" --type py

# Check if this is documented in README or similar files
fd -e md -e rst -e txt | xargs rg -l "Croatian.*Serbian|hr.*sr"

Length of output: 142


Script:

#!/bin/bash
# Let's see the actual implementation and context of the language mapping
rg -A 10 -B 10 "hr.*sr" cognee/modules/data/operations/detect_language.py

# Look for any configuration files that might contain language mappings
fd -e yaml -e json -e toml -e ini | xargs rg -l "language|hr|sr|Croatian|Serbian"

# Check for any test files related to language detection
fd -e py | grep -i "test.*language" | xargs cat

# Look for any documentation strings or comments about language mapping
rg -g "*.py" -l "Croatian|Serbian" 

Length of output: 1606


Script:

#!/bin/bash
# Let's check the translate_text.py file for any related context
cat cognee/modules/data/operations/translate_text.py

# Let's also check the configuration files for any language-related settings
cat pyproject.toml
cat cognee/modules/data/extraction/texts.json

Length of output: 7598

cognee/modules/graph/utils/get_model_instance_from_graph.py (4)

29-29: ⚠️ Potential issue

Add validation for entity_id

The return statement doesn't validate if the entity_id exists in the node_map.

Apply this improvement:

-    return node_map[entity_id]
+    if entity_id not in node_map:
+        raise KeyError(f"Entity ID {entity_id} not found in node map")
+    return node_map[entity_id]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if entity_id not in node_map:
        raise KeyError(f"Entity ID {entity_id} not found in node map")
    return node_map[entity_id]

7-10: ⚠️ Potential issue

Add validation for node IDs

The node map population doesn't handle potential duplicate node IDs or validate node.id existence.

Consider applying this improvement:

 def get_model_instance_from_graph(nodes: list[DataPoint], edges: list, entity_id: str):
     node_map = {}
 
     for node in nodes:
+        if not hasattr(node, 'id'):
+            raise ValueError(f"Node {node} missing required 'id' attribute")
+        if node.id in node_map:
+            raise ValueError(f"Duplicate node ID found: {node.id}")
         node_map[node.id] = node
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    node_map = {}

    for node in nodes:
        if not hasattr(node, 'id'):
            raise ValueError(f"Node {node} missing required 'id' attribute")
        if node.id in node_map:
            raise ValueError(f"Duplicate node ID found: {node.id}")
        node_map[node.id] = node

12-19: ⚠️ Potential issue

Add edge validation and safe dictionary access

The edge processing lacks proper validation and error handling.

Apply these improvements:

     for edge in edges:
+        if len(edge) < 3:
+            raise ValueError(f"Edge {edge} missing required elements (source, target, label)")
         source_node = node_map[edge[0]]
         target_node = node_map[edge[1]]
         edge_label = edge[2]
         edge_properties = edge[3] if len(edge) == 4 else {}
-        edge_metadata = edge_properties.get("metadata", {})
-        edge_type = edge_metadata.get("type")
+        edge_metadata = edge_properties.get("metadata", {}) if isinstance(edge_properties, dict) else {}
+        edge_type = edge_metadata.get("type") if isinstance(edge_metadata, dict) else None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    for edge in edges:
        if len(edge) < 3:
            raise ValueError(f"Edge {edge} missing required elements (source, target, label)")
        source_node = node_map[edge[0]]
        target_node = node_map[edge[1]]
        edge_label = edge[2]
        edge_properties = edge[3] if len(edge) == 4 else {}
        edge_metadata = edge_properties.get("metadata", {}) if isinstance(edge_properties, dict) else {}
        edge_type = edge_metadata.get("type") if isinstance(edge_metadata, dict) else None

20-27: 🛠️ Refactor suggestion

Improve model creation logic and add error handling

The model creation logic needs validation and optimization.

Consider these improvements:

+        if edge[0] not in node_map:
+            raise KeyError(f"Source node {edge[0]} not found in node map")
+        if edge[1] not in node_map:
+            raise KeyError(f"Target node {edge[1]} not found in node map")
+
         if edge_type == "list":
             NewModel = copy_model(type(source_node), { edge_label: (list[type(target_node)], PydanticUndefined) })
-
-            node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: [target_node] })
+            try:
+                existing_list = getattr(source_node, edge_label, [])
+                if not isinstance(existing_list, list):
+                    existing_list = []
+                node_map[edge[0]] = NewModel(
+                    **source_node.model_dump(),
+                    **{edge_label: existing_list + [target_node]}
+                )
+            except Exception as e:
+                raise ValueError(f"Failed to create list model for edge {edge}: {str(e)}")
         else:
             NewModel = copy_model(type(source_node), { edge_label: (type(target_node), PydanticUndefined) })
-
-            node_map[edge[0]] = NewModel(**source_node.model_dump(), **{ edge_label: target_node })
+            try:
+                node_map[edge[0]] = NewModel(
+                    **source_node.model_dump(),
+                    **{edge_label: target_node}
+                )
+            except Exception as e:
+                raise ValueError(f"Failed to create model for edge {edge}: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if edge[0] not in node_map:
            raise KeyError(f"Source node {edge[0]} not found in node map")
        if edge[1] not in node_map:
            raise KeyError(f"Target node {edge[1]} not found in node map")

        if edge_type == "list":
            NewModel = copy_model(type(source_node), { edge_label: (list[type(target_node)], PydanticUndefined) })
            try:
                existing_list = getattr(source_node, edge_label, [])
                if not isinstance(existing_list, list):
                    existing_list = []
                node_map[edge[0]] = NewModel(
                    **source_node.model_dump(),
                    **{edge_label: existing_list + [target_node]}
                )
            except Exception as e:
                raise ValueError(f"Failed to create list model for edge {edge}: {str(e)}")
        else:
            NewModel = copy_model(type(source_node), { edge_label: (type(target_node), PydanticUndefined) })
            try:
                node_map[edge[0]] = NewModel(
                    **source_node.model_dump(),
                    **{edge_label: target_node}
                )
            except Exception as e:
                raise ValueError(f"Failed to create model for edge {edge}: {str(e)}")
cognee/infrastructure/databases/graph/graph_db_interface.py (1)

6-7: 💡 Codebase verification

Interface change has been properly implemented across adapters

The verification shows that the interface change has been correctly handled:

  1. Found two implementations of GraphDBInterface:
    • Neo4jAdapter: Fully implements the query method with proper error handling and type safety
    • NetworkXAdapter: Has a placeholder implementation that needs to be completed

The architectural change is consistent across the codebase with no remaining references to the old graph() method. The Neo4j implementation properly uses the new query interface, while NetworkX adapter acknowledges the interface requirement but needs implementation work.

🔗 Analysis chain

Verify the impact of interface change across implementations

The change from graph() to query(query: str, params: dict) represents a significant architectural shift. While this makes the interface more flexible for different query types, it also requires careful updates to all implementing classes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all potential implementations of GraphDBInterface
echo "Searching for potential implementations of GraphDBInterface..."
rg -l "class.*\(.*GraphDBInterface.*\)" 

# Look for any remaining references to the old graph() method
echo "Checking for any remaining references to the old graph() method..."
rg "\.graph\(\)" 

Length of output: 361


Script:

#!/bin/bash
# Search for implementations using ast-grep for better accuracy
ast-grep --pattern 'class $_ implements GraphDBInterface { $$$ }'
ast-grep --pattern 'class $_(GraphDBInterface): $$$ '

# Search for files importing GraphDBInterface
echo "Files importing GraphDBInterface:"
rg "from.*GraphDBInterface.*import" --type py

# Search for query method implementations
echo "Query method implementations:"
ast-grep --pattern 'async def query(self, query: str, params: dict):
    $$$'

Length of output: 70634

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

23-33: ⚠️ Potential issue

Add error handling for falkordb initialization

Unlike the neo4j case, there's no error handling for falkordb initialization. Both get_embedding_engine() and FalkorDBAdapter instantiation could fail.

Consider wrapping the initialization in a try-except block similar to neo4j:

    elif config.graph_database_provider == "falkordb":
+       try:
            from cognee.infrastructure.databases.vector.embeddings import get_embedding_engine
            from cognee.infrastructure.databases.hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter

            embedding_engine = get_embedding_engine()

            return FalkorDBAdapter(
                database_url = config.graph_database_url,
                database_port = config.graph_database_port,
                embedding_engine = embedding_engine,
            )
+       except:
+           pass

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/storage/utils/__init__.py (3)

1-7: 🛠️ Refactor suggestion

Fix import ordering according to PEP 8

The pydantic import should be grouped with other third-party imports at the top of the file.

Apply this diff to fix the import ordering:

 import json
 from uuid import UUID
 from datetime import datetime
 from pydantic_core import PydanticUndefined
+from pydantic import create_model
 
 from cognee.infrastructure.engine import DataPoint
-
-from pydantic import create_model

Also applies to: 18-18


20-33: 🛠️ Refactor suggestion

Add input validation for model parameters

While the implementation is correct, it lacks validation for input parameters. This could lead to runtime errors if invalid inputs are provided.

Consider adding these validations:

 def copy_model(model: DataPoint, include_fields: dict = {}, exclude_fields: list = []):
+    if not isinstance(model, DataPoint):
+        raise TypeError("model must be an instance of DataPoint")
+    if not isinstance(include_fields, dict):
+        raise TypeError("include_fields must be a dictionary")
+    if not isinstance(exclude_fields, list):
+        raise TypeError("exclude_fields must be a list")
+
     fields = {
         name: (field.annotation, field.default if field.default is not None else PydanticUndefined)
             for name, field in model.model_fields.items()
             if name not in exclude_fields
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def copy_model(model: DataPoint, include_fields: dict = {}, exclude_fields: list = []):
    if not isinstance(model, DataPoint):
        raise TypeError("model must be an instance of DataPoint")
    if not isinstance(include_fields, dict):
        raise TypeError("include_fields must be a dictionary")
    if not isinstance(exclude_fields, list):
        raise TypeError("exclude_fields must be a list")

    fields = {
        name: (field.annotation, field.default if field.default is not None else PydanticUndefined)
            for name, field in model.model_fields.items()
            if name not in exclude_fields
    }

    final_fields = {
        **fields,
        **include_fields
    }

    return create_model(model.__name__, **final_fields)

34-46: ⚠️ Potential issue

Fix potential IndexError and simplify type checks

The current implementation has a potential bug when handling empty lists and could be simplified.

Apply these improvements:

 def get_own_properties(data_point: DataPoint):
     properties = {}
 
     for field_name, field_value in data_point:
-        if field_name == "_metadata" \
-            or isinstance(field_value, dict) \
-            or isinstance(field_value, DataPoint) \
-            or (isinstance(field_value, list) and isinstance(field_value[0], DataPoint)):
+        if field_name == "_metadata":
+            continue
+        
+        # Skip nested structures
+        if isinstance(field_value, (dict, DataPoint)):
+            continue
+            
+        # Safely check for list of DataPoints
+        if isinstance(field_value, list) and field_value and isinstance(field_value[0], DataPoint):
             continue
 
         properties[field_name] = field_value
 
     return properties
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_own_properties(data_point: DataPoint):
    properties = {}

    for field_name, field_value in data_point:
        if field_name == "_metadata":
            continue
        
        # Skip nested structures
        if isinstance(field_value, (dict, DataPoint)):
            continue
            
        # Safely check for list of DataPoints
        if isinstance(field_value, list) and field_value and isinstance(field_value[0], DataPoint):
            continue

        properties[field_name] = field_value

    return properties
cognee/infrastructure/databases/vector/vector_db_interface.py (1)

3-3: 💡 Codebase verification

Import path inconsistency detected

The import in vector_db_interface.py is incorrect. While DataPoint exists, it's actually located in cognee.infrastructure.engine.models.DataPoint, not directly in cognee.infrastructure.engine. The current import path from cognee.infrastructure.engine import DataPoint works only because of re-export in __init__.py, but it should be imported from its actual location for consistency.

  • Update import in vector_db_interface.py to from cognee.infrastructure.engine.models.DataPoint import DataPoint
🔗 Analysis chain

Verify the DataPoint import path change

The import path change from relative to absolute suggests a structural reorganization. Let's verify the new location exists and is being used consistently across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the DataPoint class exists in the new location and check for any remaining old imports

# Check if DataPoint exists in the new location
echo "Checking new DataPoint location..."
ast-grep --pattern 'class DataPoint' cognee/infrastructure/engine/

# Check for any remaining old-style imports that need updating
echo "Checking for remaining old-style imports..."
rg "from .+models.DataPoint import" --type py

# Check for any direct references to the old location
echo "Checking for references to old location..."
fd "models/DataPoint.py" 

Length of output: 493


Script:

#!/bin/bash
# Let's verify the actual location and content of DataPoint class

# Check the content of engine/__init__.py to understand the import chain
echo "Checking engine/__init__.py content..."
cat cognee/infrastructure/engine/__init__.py

# Check if DataPoint exists in engine/models directory
echo -e "\nChecking for DataPoint in engine/models..."
fd DataPoint.py cognee/infrastructure/engine/models/

# If it exists, check its content
echo -e "\nChecking DataPoint implementation..."
find cognee/infrastructure/engine/models/ -name "DataPoint.py" -exec cat {} \;

Length of output: 1300

cognee/modules/data/operations/translate_text.py (3)

20-24: 🛠️ Refactor suggestion

Enhance input validation for AWS Translate constraints.

AWS Translate has specific limitations that should be validated:

  • Maximum text length (5,000 bytes per request)
  • Supported language pairs
+from typing import Optional
+
+MAX_TEXT_BYTES = 5000
+
 async def translate_text(...):
     if not text:
         raise ValueError("No text to translate.")
+    
+    text_bytes = text.encode('utf-8')
+    if len(text_bytes) > MAX_TEXT_BYTES:
+        raise ValueError(f"Text exceeds maximum length of {MAX_TEXT_BYTES} bytes")
 
     if not source_language or not target_language:
         raise ValueError("Source and target language codes are required.")
+    
+    # Add supported language pair validation
+    supported_languages = await get_supported_languages(region_name)
+    if source_language not in supported_languages or target_language not in supported_languages:
+        raise ValueError("Unsupported language pair")

Committable suggestion skipped: line range outside the PR's diff.


26-41: 🛠️ Refactor suggestion

Enhance error handling and add resilience mechanisms.

Consider adding:

  1. Retry mechanism for transient failures
  2. Timeout handling
  3. More specific error messages
  4. Proper AWS session management
+from boto3.session import Session
+from botocore.config import Config
+
 async def translate_text(...):
     try:
-        translate = boto3.client(service_name="translate", region_name=region_name, use_ssl=True)
+        session = Session()
+        config = Config(
+            retries=dict(
+                max_attempts=3,
+                mode='adaptive'
+            ),
+            connect_timeout=5,
+            read_timeout=10
+        )
+        translate = session.client(
+            service_name="translate",
+            region_name=region_name,
+            config=config
+        )
         result = translate.translate_text(
             Text=text,
             SourceLanguageCode=source_language,
             TargetLanguageCode=target_language,
         )
-        yield result.get("TranslatedText", "No translation found.")
+        translated_text = result.get("TranslatedText")
+        if not translated_text:
+            logger.warning("No translation returned from AWS Translate")
+            yield None
+        else:
+            yield translated_text
 
     except BotoCoreError as e:
-        logger.error(f"BotoCoreError occurred: {e}")
+        logger.error(f"AWS service error occurred: {str(e)}", exc_info=True)
         yield None
 
     except ClientError as e:
-        logger.error(f"ClientError occurred: {e}")
+        error_code = e.response['Error']['Code']
+        error_message = e.response['Error']['Message']
+        logger.error(f"AWS client error occurred: {error_code} - {error_message}", exc_info=True)
         yield None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        session = Session()
        config = Config(
            retries=dict(
                max_attempts=3,
                mode='adaptive'
            ),
            connect_timeout=5,
            read_timeout=10
        )
        translate = session.client(
            service_name="translate",
            region_name=region_name,
            config=config
        )
        result = translate.translate_text(
            Text=text,
            SourceLanguageCode=source_language,
            TargetLanguageCode=target_language,
        )
        translated_text = result.get("TranslatedText")
        if not translated_text:
            logger.warning("No translation returned from AWS Translate")
            yield None
        else:
            yield translated_text

    except BotoCoreError as e:
        logger.error(f"AWS service error occurred: {str(e)}", exc_info=True)
        yield None

    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        logger.error(f"AWS client error occurred: {error_code} - {error_message}", exc_info=True)
        yield None

5-15: 💡 Codebase verification

🛠️ Refactor suggestion

AWS region configuration should be externalized to environment variables

The codebase shows a pattern of using environment variables and .env files for configuration, but AWS region is hardcoded in both translate_text.py and fetch_secret.py. Align with the existing configuration pattern by:

  • Moving AWS region to environment variables
  • Using the established pydantic settings pattern for AWS configuration
  • Defaulting to environment variable if provided, falling back to "eu-west-1" if not set
-async def translate_text(text, source_language: str = "sr", target_language: str = "en", region_name = "eu-west-1"):
+async def translate_text(
+    text: str,
+    source_language: str = "sr",
+    target_language: str = "en",
+    region_name: str = os.getenv("AWS_REGION", "eu-west-1")
+):
🔗 Analysis chain

Consider externalizing AWS configuration.

The AWS region is hardcoded. Consider moving it to environment variables or configuration file for better maintainability and flexibility across different environments.

Let's check if AWS configuration exists elsewhere in the codebase:

Also, consider validating language codes against ISO 639-2 standards:

+from typing import Optional
+
+def is_valid_iso639_2(lang_code: str) -> bool:
+    # Add ISO 639-2 validation logic
+    pass
+
 async def translate_text(
-    text, 
-    source_language: str = "sr", 
-    target_language: str = "en", 
-    region_name = "eu-west-1"
+    text: str,
+    source_language: str = "sr",
+    target_language: str = "en",
+    region_name: Optional[str] = None
 ):
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for AWS configuration patterns
rg -l "AWS_REGION|aws_region|region_name" --type py

Length of output: 125


Script:

#!/bin/bash
# Check the content of fetch_secret.py for AWS configuration patterns
cat cognee/fetch_secret.py

Length of output: 1806


Script:

#!/bin/bash
# Check for environment variable usage in both files
rg "os\.environ|getenv" cognee/fetch_secret.py cognee/modules/data/operations/translate_text.py

Length of output: 162


Script:

#!/bin/bash
# Look for any config files or settings modules
fd -e py "config|settings"

Length of output: 720


Script:

#!/bin/bash
# Check base_config.py for AWS configuration
cat cognee/base_config.py

Length of output: 867


Script:

#!/bin/bash
# Check infrastructure config files for AWS settings
cat cognee/infrastructure/databases/vector/config.py
cat cognee/infrastructure/databases/relational/config.py

Length of output: 1998

.github/workflows/test_python_3_9.yml (1)

11-11: ⚠️ Potential issue

Fix concurrency group syntax and improve label handling

There are several issues with the current concurrency configuration:

  1. The | separator in the group name might cause parsing issues
  2. The label check might fail for non-labeled events (like workflow_dispatch)

Apply this fix:

-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ github.event.label.name == 'run-checks' }}

Also consider:

  1. Adding a condition to handle non-labeled events:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ github.event.label && github.event.label.name == 'run-checks' || 'manual' }}
  1. Moving the label name to an environment variable for better maintainability
.github/workflows/test_python_3_10.yml (1)

4-8: 💡 Codebase verification

Workflow trigger configuration is consistent but restrictive

The verification confirms that all Python test workflows (3.9, 3.10, and 3.11) have identical trigger configurations:

  • All workflows only trigger on labeled events
  • Manual triggers via workflow_dispatch are available
  • Consistent branch restriction to 'main'

This consistent implementation across all test workflows indicates an intentional design choice rather than an oversight. However, the original concern about delayed feedback due to requiring labels remains valid.

  • All test workflows (test_python_3.9.yml, test_python_3.10.yml, test_python_3.11.yml) require the 'labeled' event trigger
  • This configuration may delay test feedback as it depends on manual label application
  • Consider implementing one of these alternatives:
    1. Add [opened, synchronize] to the types array to also run on PR creation and updates
    2. Move the label check to job-level conditions while keeping automatic PR triggers
🔗 Analysis chain

Verify the impact of restricted PR triggers

The workflow now only triggers on labeled events, which means tests won't run automatically when PRs are created or updated. This could lead to delays in feedback if the 'run-checks' label isn't applied promptly.

Consider one of these alternatives:

  1. Keep automatic triggers and use the label as an additional condition in job-level if statements
  2. Document the required label in the repository's contributing guidelines

Run this script to verify other workflow files for consistency:


This script will help us verify the trigger configurations across all Python test workflows by focusing specifically on the 'on' section of each workflow file. This will give us a clearer picture of whether the trigger restrictions are consistent across all test workflows.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if other test workflows have similar trigger restrictions
# Expected: All test workflows should have consistent trigger configurations

echo "Checking workflow trigger configurations:"
for workflow in $(fd "test_python" .github/workflows/); do
  echo "=== $workflow ==="
  rg "on:|workflow_dispatch:|pull_request:|types:" "$workflow" -A 5
done

Length of output: 1405


Script:

#!/bin/bash
# Let's check the workflow trigger configurations with a different approach
echo "=== Checking workflow triggers ==="
for workflow in .github/workflows/test_python_*.yml; do
    echo -e "\n## $workflow ##"
    # Extract the 'on' section and its immediate contents
    awk '/^on:/{p=1;print;next} p&&/^[a-z]/{p=0}p' "$workflow"
done

Length of output: 1086

.github/workflows/test_python_3_11.yml (2)

11-11: ⚠️ Potential issue

Fix the concurrency group configuration syntax

The current concurrency group configuration has syntax issues that could cause unexpected behavior:

  1. The | character in the group name might be interpreted as an OR operator
  2. The label condition is appended as a string rather than being part of the condition
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ${{ github.event.label.name == 'run-checks' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ github.event.label.name == 'run-checks' }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ github.event.label.name == 'run-checks' }}

4-8: 💡 Codebase verification

🛠️ Refactor suggestion

Add missing label condition to match other workflow files

The workflow should include the same label condition as other test workflows in the repository. Add:

  pull_request:
    branches:
      - main
    types: [labeled]
+   if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' | ${{ github.event.label.name == 'run-checks' }}
  • The workflow runs on the 'run-checks' label, consistent with other test workflows
  • It also includes the docs-change condition to skip tests when only docs are modified
🔗 Analysis chain

Specify the required label in the workflow trigger configuration

The workflow is configured to run only on labeled events, but it doesn't specify which label triggers the workflow. This could lead to confusion or accidental triggers with wrong labels.

  pull_request:
    branches:
      - main
    types: [labeled]
+   if: github.event.label.name == 'run-checks'

Let's verify if other workflows use consistent label triggers:


Let me check if any of these workflows specify the required label in their configuration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistency in label-based triggers across workflows
# Expected: All test workflows should use the same label trigger pattern

# Search for label-related configurations in all workflow files
rg -U "types.*labeled" .github/workflows/

Length of output: 502


Script:

#!/bin/bash
# Check for label conditions in workflows
rg -U "if:.*label" .github/workflows/

Length of output: 809

cognee/infrastructure/engine/__tests__/model_to_graph_to_model.test.py (3)

7-8: 🛠️ Refactor suggestion

Convert to proper test framework structure

This file is located in a __tests__ directory but uses a main guard pattern instead of a proper test framework. Consider restructuring this as a proper test case using pytest or unittest.

Example structure:

import pytest

def test_model_to_graph_conversion():
    # Setup test data
    boris = Person(...)
    
    # Test conversion
    nodes, edges = get_graph_from_model(boris)
    parsed_person = get_model_instance_from_graph(nodes, edges, 'boris')
    
    # Assertions
    assert parsed_person.id == boris.id
    assert parsed_person.name == boris.name
    # Add more assertions

65-72: ⚠️ Potential issue

Add proper assertions for conversion validation

The test only prints results without validating the conversion. Add proper assertions to verify the graph structure and parsed data.

# Verify nodes structure
assert len(nodes) == 3  # Person, Car, and CarType nodes
assert any(node.get('id') == 'boris' for node in nodes)

# Verify edges
assert any(edge['from'] == 'boris' and edge['relationship'] == 'owns_car' for edge in edges)

# Verify parsed data matches original
assert parsed_person.id == boris.id
assert parsed_person.name == boris.name
assert parsed_person.age == boris.age
assert len(parsed_person.owns_car) == len(boris.owns_car)
assert parsed_person.owns_car[0].brand == boris.owns_car[0].brand

68-70: ⚠️ Potential issue

Avoid fragile node access pattern

Accessing the last node by index is fragile and could break if the node order changes. Consider finding the node by id or type.

- person_data = nodes[len(nodes) - 1]
+ person_data = next(node for node in nodes if node.get('id') == 'boris')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    person_data = next(node for node in nodes if node.get('id') == 'boris')

    parsed_person = get_model_instance_from_graph(nodes, edges, 'boris')
cognee/infrastructure/databases/vector/create_vector_engine.py (1)

51-57: 🛠️ Refactor suggestion

Multiple improvements needed for FalkorDB integration

  1. The deep relative import (..hybrid.falkordb) suggests suboptimal module organization. Consider restructuring to maintain consistency with other adapters.
  2. Unlike other providers (e.g., Weaviate), there's no validation for missing configuration.
  3. Parameter naming is inconsistent with other adapters (database_url vs url).

Consider applying these improvements:

 elif config["vector_db_provider"] == "falkordb":
-    from ..hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter
+    from .falkordb.FalkorDBAdapter import FalkorDBAdapter
+
+    if config["vector_db_url"] is None or config["vector_db_port"] is None:
+        raise EnvironmentError("FalkorDB is not configured!")
 
     return FalkorDBAdapter(
-        database_url = config["vector_db_url"],
-        database_port = config["vector_db_port"],
+        url = config["vector_db_url"],
+        port = config["vector_db_port"],
         embedding_engine = embedding_engine,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    elif config["vector_db_provider"] == "falkordb":
        from .falkordb.FalkorDBAdapter import FalkorDBAdapter

        if config["vector_db_url"] is None or config["vector_db_port"] is None:
            raise EnvironmentError("FalkorDB is not configured!")

        return FalkorDBAdapter(
            url = config["vector_db_url"],
            port = config["vector_db_port"],
            embedding_engine = embedding_engine,
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

117-119: 🛠️ Refactor suggestion

Add error handling for collection creation

The method should handle cases where the collection might already exist and validate input parameters.

Consider adding error handling:

async def create_vector_index(self, index_name: str, index_property_name: str):
+   if not index_name or not index_property_name:
+       raise ValueError("index_name and index_property_name must not be empty")
+   collection_name = f"{index_name}_{index_property_name}"
+   if await self.has_collection(collection_name):
+       logger.warning(f"Collection {collection_name} already exists")
+       return
    await self.create_collection(f"{index_name}_{index_property_name}")

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)

126-127: ⚠️ Potential issue

Add error handling and input validation.

The method should validate inputs and handle potential errors:

  1. Validate index_name and index_property_name length and characters
  2. Add error handling for collection creation failures

Example implementation:

     async def create_vector_index(self, index_name: str, index_property_name: str):
+        if not index_name or not index_property_name:
+            raise ValueError("Index name and property name cannot be empty")
+        
+        collection_name = f"{index_name}_{index_property_name}"
+        if len(collection_name) > 255:  # Adjust limit based on Qdrant's requirements
+            raise ValueError("Combined index name exceeds maximum length")
+
+        try:
+            await self.create_collection(collection_name)
+        except Exception as e:
+            logger.error(f"Failed to create vector index: {str(e)}")
+            raise
-        await self.create_collection(f"{index_name}_{index_property_name}")

Committable suggestion skipped: line range outside the PR's diff.


129-135: ⚠️ Potential issue

Add validation for index fields and error handling.

The method makes assumptions about the data structure without proper validation:

  1. No check if data_point._metadata exists
  2. No validation of "index_fields" in metadata
  3. No error handling if the specified field doesn't exist

Suggested implementation:

     async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
+        def get_index_field(data_point: DataPoint) -> str:
+            if not hasattr(data_point, '_metadata'):
+                raise ValueError(f"Data point {data_point.id} missing _metadata")
+            if 'index_fields' not in data_point._metadata:
+                raise ValueError(f"Data point {data_point.id} missing index_fields in metadata")
+            field_name = data_point._metadata["index_fields"][0]
+            if not hasattr(data_point, field_name):
+                raise ValueError(f"Data point {data_point.id} missing required field {field_name}")
+            return getattr(data_point, field_name)
+
+        try:
             await self.create_data_points(f"{index_name}_{index_property_name}", [
                 IndexSchema(
                     id = data_point.id,
-                    text = getattr(data_point, data_point._metadata["index_fields"][0]),
+                    text = get_index_field(data_point),
                 ) for data_point in data_points
             ])
+        except Exception as e:
+            logger.error(f"Failed to index data points: {str(e)}")
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
        def get_index_field(data_point: DataPoint) -> str:
            if not hasattr(data_point, '_metadata'):
                raise ValueError(f"Data point {data_point.id} missing _metadata")
            if 'index_fields' not in data_point._metadata:
                raise ValueError(f"Data point {data_point.id} missing index_fields in metadata")
            field_name = data_point._metadata["index_fields"][0]
            if not hasattr(data_point, field_name):
                raise ValueError(f"Data point {data_point.id} missing required field {field_name}")
            return getattr(data_point, field_name)

        try:
            await self.create_data_points(f"{index_name}_{index_property_name}", [
                IndexSchema(
                    id = data_point.id,
                    text = get_index_field(data_point),
                ) for data_point in data_points
            ])
        except Exception as e:
            logger.error(f"Failed to index data points: {str(e)}")
            raise
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)

159-163: ⚠️ Potential issue

Add UUID validation for ID conversion

The code assumes all IDs can be converted to UUID. This could raise ValueError if an ID string is not a valid UUID format.

Consider adding error handling:

# In retrieve method
return [
    ScoredResult(
-       id = UUID(result.id),
+       id = UUID(result.id) if isinstance(result.id, str) else result.id,
        payload = result.payload,
        score = 0
    ) for result in results
]

# In search method
return [
    ScoredResult(
-       id = UUID(str(row.id)),
+       id = UUID(str(row.id)) if not isinstance(row.id, UUID) else row.id,
        payload = row.payload,
        score = row.similarity
    ) for row in vector_list
]

Also applies to: 206-209


85-87: ⚠️ Potential issue

Add validation for empty data_points list

The code assumes data_points list is non-empty when accessing data_points[0]. This could lead to an IndexError if the list is empty.

Consider adding validation:

async def create_data_points(
    self, collection_name: str, data_points: List[DataPoint]
):
+   if not data_points:
+       raise ValueError("data_points list cannot be empty")
    async with self.get_async_session() as session:
        if not await self.has_collection(collection_name):
            await self.create_collection(
                collection_name=collection_name,
                payload_schema=type(data_points[0]),
            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def create_data_points(
    self, collection_name: str, data_points: List[DataPoint]
):
    if not data_points:
        raise ValueError("data_points list cannot be empty")
    async with self.get_async_session() as session:
        if not await self.has_collection(collection_name):
            await self.create_collection(
                collection_name=collection_name,
                payload_schema=type(data_points[0]),
            )

124-134: ⚠️ Potential issue

Add error handling for index field access

The index_data_points method assumes the first index field exists in data_point._metadata["index_fields"]. This could raise KeyError or IndexError if metadata is missing or malformed.

Consider adding validation:

async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
+   for data_point in data_points:
+       if not hasattr(data_point, '_metadata') or 'index_fields' not in data_point._metadata:
+           raise ValueError(f"Data point {data_point.id} missing required index metadata")
+       if not data_point._metadata['index_fields']:
+           raise ValueError(f"Data point {data_point.id} has empty index_fields")
+       if not hasattr(data_point, data_point._metadata['index_fields'][0]):
+           raise ValueError(f"Data point {data_point.id} missing required index field: {data_point._metadata['index_fields'][0]}")

    await self.create_data_points(f"{index_name}_{index_property_name}", [
        IndexSchema(
            id = data_point.id,
            text = getattr(data_point, data_point._metadata["index_fields"][0]),
        ) for data_point in data_points
    ])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def create_vector_index(self, index_name: str, index_property_name: str):
        await self.create_collection(f"{index_name}_{index_property_name}")

    async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
        for data_point in data_points:
            if not hasattr(data_point, '_metadata') or 'index_fields' not in data_point._metadata:
                raise ValueError(f"Data point {data_point.id} missing required index metadata")
            if not data_point._metadata['index_fields']:
                raise ValueError(f"Data point {data_point.id} has empty index_fields")
            if not hasattr(data_point, data_point._metadata['index_fields'][0]):
                raise ValueError(f"Data point {data_point.id} missing required index field: {data_point._metadata['index_fields'][0]}")

        await self.create_data_points(f"{index_name}_{index_property_name}", [
            IndexSchema(
                id = data_point.id,
                text = getattr(data_point, data_point._metadata["index_fields"][0]),
            ) for data_point in data_points
        ])
cognee/shared/SourceCodeGraph.py (1)

31-31: 🛠️ Refactor suggestion

Use List instead of list for type annotations.

In type annotations, it's recommended to use List from the typing module instead of the built-in list to support static type checking and maintain consistency.

Apply this diff to update the type annotation:

-has_methods: list["Function"]
+has_methods: List["Function"]

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/chunking/TextChunker.py (3)

43-53: ⚠️ Potential issue

Improve exception handling by logging and specifying exceptions

Catching all exceptions with except Exception as e can mask unexpected errors and hinder debugging. Additionally, using print(e) is inadequate for error reporting in production code. It's better to catch specific exceptions you anticipate and use the logging module to log errors appropriately. Re-raise the exception after logging to ensure that calling code can handle it.

Apply this diff to enhance error handling:

 try:
     yield DocumentChunk(
         ...
     )
-except Exception as e:
-    print(e)
+except SpecificException as e:
+    logger.exception("Error yielding DocumentChunk: %s", e)
+    raise

Remember to import and configure the logging module at the top of your file:

import logging
logger = logging.getLogger(__name__)

Replace SpecificException with the actual exception(s) you expect could occur.


7-11: ⚠️ Potential issue

Avoid using class variables for instance-specific state

Defining document, max_chunk_size, chunk_index, chunk_size, and paragraph_chunks as class variables can lead to shared state across all instances of TextChunker, resulting in unexpected behavior when multiple instances are used concurrently. These should be instance variables initialized within the __init__ method.

Apply this diff to move these variables to instance variables:

 class TextChunker():
-    document = None
-    max_chunk_size: int
-    chunk_index = 0
-    chunk_size = 0
-    paragraph_chunks = []

     def __init__(self, document, get_text: callable, chunk_size: int = 1024):
+        self.document = document
+        self.max_chunk_size = chunk_size
+        self.get_text = get_text
+        self.chunk_index = 0
+        self.chunk_size = 0
+        self.paragraph_chunks = []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def __init__(self, document, get_text: callable, chunk_size: int = 1024):
        self.document = document
        self.max_chunk_size = chunk_size
        self.get_text = get_text
        self.chunk_index = 0
        self.chunk_size = 0
        self.paragraph_chunks = []

60-70: ⚠️ Potential issue

Ensure consistent and robust exception handling in final chunk processing

Similar to the previous comment, avoid catching all exceptions indiscriminately and improve logging in the final chunk processing. This practice ensures errors are properly logged and can be debugged effectively.

Apply this diff:

 try:
     yield DocumentChunk(
         ...
     )
-except Exception as e:
-    print(e)
+except SpecificException as e:
+    logger.exception("Error yielding final DocumentChunk: %s", e)
+    raise

Ensure you handle specific exceptions and log them appropriately.

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/graph/utils/get_graph_from_model.py (1)

5-5: ⚠️ Potential issue

Avoid mutable default arguments in function definition

Using mutable default arguments like {} for added_nodes and added_edges can lead to unexpected behavior because the default dictionaries are shared across all function calls. This can cause unintended side effects when the function is called multiple times.

Apply this diff to fix the issue:

-def get_graph_from_model(data_point: DataPoint, include_root = True, added_nodes = {}, added_edges = {}):
+def get_graph_from_model(data_point: DataPoint, include_root = True, added_nodes = None, added_edges = None):

Then, initialize the dictionaries inside the function:

if added_nodes is None:
    added_nodes = {}
if added_edges is None:
    added_edges = {}
cognee/api/v1/cognify/code_graph_pipeline.py (1)

40-40: 🛠️ Refactor suggestion

Use isinstance() for type checking instead of type()

Using type() for type checking is discouraged because it does not account for inheritance and can lead to issues with subclasses. It's better to use isinstance() for more reliable and Pythonic type checking.

Here's the suggested change:

-if type(datasets[0]) == str:
+if isinstance(datasets[0], str):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if isinstance(datasets[0], str):
🧰 Tools
🪛 Ruff

40-40: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

cognee/tasks/graph/extract_graph_from_data.py (4)

55-55: 🛠️ Refactor suggestion

Prevent key collisions in existing_edges_map

Using string concatenation for keys may lead to collisions. Use tuples as keys instead to ensure uniqueness.

Apply this diff:

-    existing_edges_map[edge[0] + edge[1] + edge[2]] = True
+    existing_edges_map[(edge[0], edge[1], edge[2])] = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        existing_edges_map[(edge[0], edge[1], edge[2])] = True

102-105: 🛠️ Refactor suggestion

Use tuples for edge_key to ensure uniqueness

Concatenating IDs and relationship names into a string for edge_key can cause collisions. Use a tuple to represent the key.

Apply this diff:

-    edge_key = str(source_node_id) + str(target_node_id) + relationship_name
+    edge_key = (source_node_id, target_node_id, relationship_name)

Ensure that existing_edges_map checks use this tuple key.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            edge_key = (source_node_id, target_node_id, relationship_name)

            if edge_key not in existing_edges_map:
                graph_edges.append((

24-25: ⚠️ Potential issue

Check for None before accessing chunk_graph.nodes

If chunk_graph is None, accessing chunk_graph.nodes will raise an AttributeError. Add a check to ensure chunk_graph is not None before iterating over its nodes.

Apply this diff to fix the issue:

 for (chunk_index, chunk) in enumerate(data_chunks):
     chunk_graph = chunk_graphs[chunk_index]
+    if chunk_graph is None:
+        continue
     for node in chunk_graph.nodes:

Committable suggestion skipped: line range outside the PR's diff.


38-41: 💡 Codebase verification

Based on the Edge class definition in cognee/shared/data_models.py, I can now provide a definitive response about the edge construction order.

The edge construction order in the code is incorrect

The Edge class clearly defines the order as source_node_id followed by target_node_id, but in the list comprehension, the order is reversed: (edge.target_node_id, edge.source_node_id, edge.relationship_name). This reversal changes the direction of the relationships in the graph.

  • Location: cognee/tasks/graph/extract_graph_from_data.py:38-41
  • Current: (edge.target_node_id, edge.source_node_id, edge.relationship_name)
  • Should be: (edge.source_node_id, edge.target_node_id, edge.relationship_name)
🔗 Analysis chain

Verify the order of source and target node IDs in edge construction

In graph_node_edges, the source and target node IDs seem to be reversed. Currently, edge.target_node_id is used as the source, and edge.source_node_id as the target. Confirm if this is intentional.

Run the following script to verify the consistency of edge constructions:


Let me gather more context about the edge construction and usage pattern in the codebase.


Let me search for the Edge class definition and its usage in the codebase to understand the intended direction of edges.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of source and target node IDs in edge constructions.

# Search for all instances where edges are constructed.
rg -A 2 'graph_node_edges = \['

# Expected: Edge tuples should be in the format (source_node_id, target_node_id, relationship_name)

Length of output: 335


Script:

#!/bin/bash
# Search for the Edge class definition and its usage
ast-grep --pattern 'class Edge'

# Search for any methods that use these edges
rg -A 5 'def.*edge' 

# Search for any documentation or type hints related to edges
rg -A 3 '(Edge|edge).*:' 

Length of output: 47555


Script:

#!/bin/bash
# Search for Edge class definition and its usage
rg -A 10 'class Edge'

# Search for where chunk_graph is created/modified
rg -A 5 'chunk_graph.*='

Length of output: 2520

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (4)

15-22: 🛠️ Refactor suggestion

Ensure consistent ID type in IndexSchema

The id field in IndexSchema is defined as str, but elsewhere in the code, IDs are handled as UUID. To maintain consistency and prevent type-related issues, consider defining id as UUID or ensure consistent handling of IDs as strings throughout the codebase.


100-109: ⚠️ Potential issue

Verify generic type usage in create_lance_data_point

When creating LanceDataPoint, specifying generic parameters with self.get_data_point_schema(type(data_point)) might lead to unexpected behavior or type issues. Ensure that the type parameters are correctly applied and compatible with the Generic type usage in LanceDataPoint.


167-176: ⚠️ Potential issue

Handle empty search results to prevent exceptions

In the search method, computing min() and max() on an empty result_values list will raise a ValueError if no results are found. Add a check for empty result_values before calculating min_value and max_value to avoid potential exceptions.

Apply this diff to handle empty results:

 result_values = list(results.to_dict("index").values())

+if not result_values:
+    return []

 min_value = min(result["_distance"] for result in result_values)
 max_value = max(result["_distance"] for result in result_values)

Committable suggestion skipped: line range outside the PR's diff.


214-221: ⚠️ Potential issue

Ensure safe access to data_point._metadata in index_data_points

Accessing data_point._metadata["index_fields"][0] may raise a KeyError if index_fields is not present. Confirm that all DataPoint instances have _metadata with index_fields, or add error handling to prevent exceptions.

Consider adding a default or error handling:

 IndexSchema(
     id = str(data_point.id),
-    text = getattr(data_point, data_point._metadata["index_fields"][0]),
+    text = getattr(
+        data_point,
+        data_point._metadata.get("index_fields", ["default_field"])[0],
+        ""
+    ),
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
        await self.create_data_points(f"{index_name}_{index_property_name}", [
            IndexSchema(
                id = str(data_point.id),
                text = getattr(
                    data_point,
                    data_point._metadata.get("index_fields", ["default_field"])[0],
                    ""
                ),
            ) for data_point in data_points
        ])
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (3)

33-33: ⚠️ Potential issue

Avoid using mutable default arguments

Using a mutable default argument like params: dict = {} can lead to unexpected behavior because the same dictionary is shared across all function calls. It's better to use None as the default value and initialize a new dictionary inside the function.

Apply this diff:

-        def query(self, query: str, params: dict = {}):
+        def query(self, query: str, params: dict = None):
+            if params is None:
+                params = {}
     graph = self.driver.select_graph(self.graph_name)

Committable suggestion skipped: line range outside the PR's diff.


108-109: 🛠️ Refactor suggestion

Specify exception type instead of using a bare except

Using a bare except clause can catch unexpected exceptions and make debugging difficult. Specify the exception type you want to handle to avoid masking other exceptions.

Apply this diff:

-            except:
+            except Exception:

Or, if you know the specific exception that might occur, replace Exception with that specific exception type.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

108-108: Do not use bare except

(E722)


24-24: ⚠️ Potential issue

Fix the default value of embedding_engine parameter

The default value for embedding_engine is set to EmbeddingEngine, which is likely a class, not an instance. This can cause runtime errors when embed_text is called on a class instead of an instance. Consider changing the default value to None and instantiating EmbeddingEngine inside the __init__ method if no instance is provided.

Apply this diff:

-            embedding_engine =  EmbeddingEngine,
+            embedding_engine: EmbeddingEngine = None,

Then, update the __init__ method:

             self.driver = FalkorDB(
                 host = database_url,
                 port = database_port,
             )
-            self.embedding_engine = embedding_engine
+            if embedding_engine is None:
+                self.embedding_engine = EmbeddingEngine()
+            else:
+                self.embedding_engine = embedding_engine

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/databases/graph/networkx/adapter.py (4)

37-38: ⚠️ Potential issue

Implement the query method or raise NotImplementedError

The query method is currently empty and contains only a pass statement. To clearly indicate that this method is intended to be implemented in the future, consider raising a NotImplementedError.

Apply this diff to raise NotImplementedError:

 async def query(self, query: str, params: dict):
-    pass
+    raise NotImplementedError("The 'query' method is not implemented yet.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def query(self, query: str, params: dict):
        raise NotImplementedError("The 'query' method is not implemented yet.")

255-257: ⚠️ Potential issue

Avoid using bare except; specify the exception type

Using a bare except clause can catch unexpected exceptions like KeyboardInterrupt or SystemExit. It's better to catch specific exceptions to prevent masking other errors.

Apply this diff to specify the exception type:

 try:
     node["id"] = UUID(node["id"])
-except:
+except ValueError:
     pass

If multiple exception types are possible, consider catching Exception or the specific exceptions that can occur during UUID conversion.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                          node["id"] = UUID(node["id"])
                        except ValueError:
                          pass
🧰 Tools
🪛 Ruff

256-256: Do not use bare except

(E722)


273-275: ⚠️ Potential issue

Correct the condition to check 'updated_at' in edge instead of node

The condition is currently checking if "updated_at" in node but then attempts to access edge["updated_at"], which may lead to a KeyError. The condition should check for "updated_at" in edge.

Apply this diff to correct the condition:

-            if "updated_at" in node:
+            if "updated_at" in edge:
                 edge["updated_at"] = datetime.strptime(edge["updated_at"], "%Y-%m-%dT%H:%M:%S.%f%z")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                        if "updated_at" in edge:
                            edge["updated_at"] = datetime.strptime(edge["updated_at"], "%Y-%m-%dT%H:%M:%S.%f%z")


270-271: ⚠️ Potential issue

Avoid using bare except; specify the exception type

Similar to the previous comment, replace the bare except with a specific exception to improve error handling.

Apply this diff to specify the exception type:

 try:
     # UUID conversion logic
-except:
+except ValueError:
     pass

Ensure that you catch the appropriate exceptions that may arise during the conversion.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

270-270: Do not use bare except

(E722)

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (4)

204-210: 🛠️ Refactor suggestion

⚠️ Potential issue

Ensure correct node identification and property handling in batch edge addition

In the add_edges method, verify that nodes are matched based on their id properties rather than labels derived from UUIDs. Also, ensure that the properties source_node_id and target_node_id are correctly assigned without duplication.

Suggested modifications:

  • Match nodes by id property in the query.
  • Avoid overwriting existing properties when adding source_node_id and target_node_id.

Updated query:

 query = """
 UNWIND $edges AS edge
 MATCH (from_node {id: edge.from_node})
 MATCH (to_node {id: edge.to_node})
 CALL apoc.create.relationship(from_node, edge.relationship_name, edge.properties, to_node) YIELD rel
 RETURN rel
 """

Update edge construction:

 edges = [{
     "from_node": str(edge[0]),
     "to_node": str(edge[1]),
     "relationship_name": edge[2],
     "properties": {
         **(edge[3] if edge[3] else {}),
-        "source_node_id": str(edge[0]),
-        "target_node_id": str(edge[1]),
     },
 } for edge in edges]

Committable suggestion skipped: line range outside the PR's diff.


63-69: 🛠️ Refactor suggestion

Set 'created_at' and 'updated_at' timestamps correctly when adding nodes

In the add_node method, the updated_at timestamp is currently set only during the ON MATCH clause. To accurately record both the creation and update times of a node, consider setting created_at during the ON CREATE clause and updated_at during both ON CREATE and ON MATCH clauses.

Proposed query modification:

 query = """MERGE (node {id: $node_id})
             ON CREATE SET node += $properties,
+                         node.created_at = timestamp(),
+                         node.updated_at = timestamp()
             ON MATCH SET node += $properties,
+                        node.updated_at = timestamp()
             RETURN ID(node) AS internal_id, node.id AS nodeId"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def add_node(self, node: DataPoint):
        serialized_properties = self.serialize_properties(node.model_dump())

        query = """MERGE (node {id: $node_id})
                ON CREATE SET node += $properties,
                            node.created_at = timestamp(),
                            node.updated_at = timestamp()
                ON MATCH SET node += $properties,
                           node.updated_at = timestamp()
                RETURN ID(node) AS internal_id, node.id AS nodeId"""

79-85: 🛠️ Refactor suggestion

Set 'created_at' and 'updated_at' timestamps correctly in batch node addition

In the add_nodes method, the updated_at timestamp is set only during the ON MATCH clause. For consistent timestamping and to track node creation times, set created_at during ON CREATE and updated_at during both ON CREATE and ON MATCH clauses.

Proposed query modification:

 query = """
 UNWIND $nodes AS node
 MERGE (n {id: node.node_id})
 ON CREATE SET n += node.properties,
+              n.created_at = timestamp(),
+              n.updated_at = timestamp()
 ON MATCH SET n += node.properties,
+             n.updated_at = timestamp()
 WITH n, node.node_id AS label
 CALL apoc.create.addLabels(n, [label]) YIELD node AS labeledNode
 RETURN ID(labeledNode) AS internal_id, labeledNode.id AS nodeId
 """

Committable suggestion skipped: line range outside the PR's diff.


139-143: ⚠️ Potential issue

Correct node matching by using properties instead of labels

In the has_edge method, UUIDs are incorrectly used as labels in the MATCH clause. Labels in Neo4j are meant for node categories, not unique identifiers like UUIDs. To match nodes by their id, use property matching instead.

Proposed query correction:

 query = """
-    MATCH (from_node:`{str(from_node)}`)-[relationship:`{edge_label}`]->(to_node:`{str(to_node)}`)
+    MATCH (from_node {id: $from_node_id})-[relationship:`{edge_label}`]->(to_node {id: $to_node_id})
     RETURN COUNT(relationship) > 0 AS edge_exists
 """

Update parameters:

 params = {
+    "from_node_id": str(from_node),
+    "to_node_id": str(to_node)
 }

Committable suggestion skipped: line range outside the PR's diff.

@0xideas 0xideas force-pushed the test-beartype-in-cognify-pipeline branch from 97f1a5a to 2da4e5d Compare November 11, 2024 14:33
@0xideas 0xideas force-pushed the test-beartype-in-cognify-pipeline branch from 2da4e5d to 4ee7c6f Compare November 11, 2024 14:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (16)
cognee/modules/engine/utils/generate_edge_name.py (1)

1-2: Add function documentation

Add a docstring to explain the function's purpose, parameters, return value, and provide usage examples.

Add documentation:

 def generate_edge_name(name: str) -> str:
+    """Generate a standardized edge name by converting input to lowercase and replacing special characters.
+
+    Args:
+        name (str): The input string to be converted into an edge name
+
+    Returns:
+        str: A standardized edge name with spaces converted to underscores and special characters removed
+
+    Examples:
+        >>> generate_edge_name("User's Profile")
+        "users_profile"
+        >>> generate_edge_name("Product Category")
+        "product_category"
+    """
     return name.lower().replace(" ", "_").replace("'", "")
cognee/tasks/storage/add_data_points.py (3)

1-7: Consider using absolute imports for better maintainability

While relative imports work, consider using absolute imports (e.g., from cognee.tasks.storage.index_data_points import index_data_points) to make the code more maintainable and less prone to breakage if the module structure changes.


8-9: Add docstring to document function behavior

The function would benefit from a docstring explaining:

  • Purpose of the function
  • Parameters and their expected format
  • Return value description
  • Any side effects (e.g., database operations)
  • Potential exceptions that might be raised

Example:

@beartype
async def add_data_points(data_points: list[DataPoint]) -> list[DataPoint]:
    """Add data points to the graph database.

    Args:
        data_points: List of DataPoint objects to be processed and stored

    Returns:
        The same list of DataPoint objects that was passed in

    Raises:
        Any exceptions from graph_engine operations
    """

26-26: Consider enhancing return value with operation status

The function returns the original data_points without indicating success/failure of the graph operations. Consider returning a tuple of (data_points, success_status) or raising exceptions on failure to help callers handle errors appropriately.

Example:

from typing import Tuple

async def add_data_points(data_points: list[DataPoint]) -> Tuple[list[DataPoint], bool]:
    success = False
    try:
        # ... existing operations ...
        success = True
    except Exception:
        # log error
        raise
    return data_points, success
cognee/tests/test_qdrant.py (2)

40-41: Enhance test coverage for vector search

While the test verifies basic search functionality, consider adding test cases for:

  1. Edge cases (empty search terms, non-existent entities)
  2. Multiple search results
  3. Different entity types

Would you like me to provide examples of additional test cases?


40-41: Document schema changes in docstrings or comments

The change in payload structure and search fields would benefit from documentation explaining the new schema.

Add a comment explaining the expected payload structure and search fields, for example:

# Search for entities using Entity_name field
# Expected payload structure: {"text": str, ...}
random_node = (await vector_engine.search("Entity_name", "Quantum computer"))[0]
random_node_name = random_node.payload["text"]
cognee/tests/test_weaviate.py (2)

Line range hint 1-58: Enhance test coverage and documentation

While the test demonstrates basic functionality, it could be improved in several areas:

  1. The test lacks assertions about the content of the search results
  2. There's no error handling for failed searches
  3. The test data setup could be moved to fixtures

Consider adding:

  • Content validation in assertions
  • Error cases
  • Cleanup in a teardown phase

Example improvements:

# Add content validation
assert "quantum" in random_node_name, "Search result should contain the search term"

# Add error handling test
with pytest.raises(SomeException):
    await vector_engine.search("Entity_name", "")  # Empty search term

# Add cleanup
def teardown_module():
    asyncio.run(cognee.prune.prune_data())
    asyncio.run(cognee.prune.prune_system(metadata=True))

38-39: Consider using test constants for search parameters

The hardcoded search parameters could be moved to test constants for better maintainability.

+# At the top of the file
+ENTITY_TYPE = "Entity_name"
+SEARCH_TERM = "quantum computer"
+
 # In the test
-random_node = (await vector_engine.search("Entity_name", "quantum computer"))[0]
+random_node = (await vector_engine.search(ENTITY_TYPE, SEARCH_TERM))[0]
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)

170-179: Add UUID validation and document score inversion

Two potential improvements for the search results handling:

  1. The UUID conversion could fail with invalid IDs
  2. The score inversion (1 - score) needs documentation explaining its purpose

Consider adding validation and documentation:

         return [
             ScoredResult(
-                id = UUID(result.id),
+                id = UUID(result.id) if is_valid_uuid(result.id) else result.id,
                 payload = {
                     **result.payload,
-                    "id": UUID(result.id),
+                    "id": UUID(result.id) if is_valid_uuid(result.id) else result.id,
                 },
+                # Invert score because Qdrant returns similarity scores (1=similar, 0=different)
+                # while our application expects distance scores (0=similar, 1=different)
                 score = 1 - result.score,
             ) for result in results
         ]

+def is_valid_uuid(uuid_str: str) -> bool:
+    try:
+        UUID(uuid_str)
+        return True
+    except ValueError:
+        return False
cognee/infrastructure/databases/graph/networkx/adapter.py (5)

8-8: Remove unused import

The import from re import A is not used anywhere in the code.

-from re import A

44-48: Add return type hints for better IDE support

The method signatures would benefit from explicit return type hints.

-    async def add_node(
-        self,
-        node: DataPoint,
-    ) -> None:
-        self.graph.add_node(node.id, **node.model_dump())
+    async def add_node(
+        self,
+        node: DataPoint,
+    ) -> None:
+        """Add a single node to the graph.
+        
+        Args:
+            node (DataPoint): The node to add
+        """
+        self.graph.add_node(node.id, **node.model_dump())

Line range hint 183-210: Consider optimizing connection retrieval

The get_connections method could be optimized to reduce complexity and improve performance:

  1. Combine predecessor and successor loops
  2. Use list comprehension for better readability
     async def get_connections(self, node_id: UUID) -> list:
         if not self.graph.has_node(node_id):
             return []
 
         node = self.graph.nodes[node_id]
         if "id" not in node:
             return []
 
-        predecessors, successors = await asyncio.gather(
-            self.get_predecessors(node_id),
-            self.get_successors(node_id),
-        )
-
-        connections = []
-
-        for neighbor in predecessors:
-            if "id" in neighbor:
-                edge_data = self.graph.get_edge_data(neighbor["id"], node["id"])
-                for edge_properties in edge_data.values():
-                    connections.append((neighbor, edge_properties, node))
-
-        for neighbor in successors:
-            if "id" in neighbor:
-                edge_data = self.graph.get_edge_data(node["id"], neighbor["id"])
-                for edge_properties in edge_data.values():
-                    connections.append((node, edge_properties, neighbor))
-
-        return connections
+        connections = [
+            (neighbor, edge_props, node)
+            for neighbor in await self.get_predecessors(node_id)
+            if "id" in neighbor
+            for edge_props in self.graph.get_edge_data(neighbor["id"], node["id"]).values()
+        ] + [
+            (node, edge_props, neighbor)
+            for neighbor in await self.get_successors(node_id)
+            if "id" in neighbor
+            for edge_props in self.graph.get_edge_data(node["id"], neighbor["id"]).values()
+        ]
+        return connections

Line range hint 235-239: Add error handling for file operations

The save_graph_to_file method should include error handling to manage potential I/O errors.

     async def save_graph_to_file(self, file_path: str=None) -> None:
         """Asynchronously save the graph to a file in JSON format."""
         if not file_path:
             file_path = self.filename
-
-        graph_data = nx.readwrite.json_graph.node_link_data(self.graph)
-
-        async with aiofiles.open(file_path, "w") as file:
-            await file.write(json.dumps(graph_data, cls = JSONEncoder))
+        try:
+            graph_data = nx.readwrite.json_graph.node_link_data(self.graph)
+            async with aiofiles.open(file_path, "w") as file:
+                await file.write(json.dumps(graph_data, cls = JSONEncoder))
+        except Exception as e:
+            logger.error(f"Failed to save graph to file {file_path}: {str(e)}")
+            raise

Line range hint 1-307: Consider architectural improvements for better maintainability

While the implementation is solid, consider the following architectural improvements:

  1. Add comprehensive docstrings to all public methods
  2. Consider implementing connection pooling for better performance with large graphs
  3. Add metrics/monitoring for graph operations
  4. Consider implementing a caching layer for frequently accessed nodes/edges

Would you like assistance in implementing any of these architectural improvements?

cognee/tasks/summarization/summarize_text.py (1)

12-12: Consider returning the generated summaries instead of input data chunks

The function currently returns the original data_chunks after generating summaries and adding them to the storage. If the intention is to provide the generated summaries to the caller for further processing, consider returning summaries instead.

Apply the following changes to return the summaries and update the return type annotation:

-async def summarize_text(data_chunks: list[DocumentChunk], summarization_model: Type[BaseModel]) -> list[DocumentChunk]:
+async def summarize_text(data_chunks: list[DocumentChunk], summarization_model: Type[BaseModel]) -> list[TextSummary]:
...
-    return data_chunks
+    return summaries
cognee/tasks/graph/extract_graph_from_data.py (1)

14-16: Handle exceptions in asynchronous graph extraction for robustness.

Using asyncio.gather without exception handling may cause the entire operation to fail if one coroutine raises an exception. To improve resilience, consider using the return_exceptions=True parameter and handle exceptions individually.

Modify the asyncio.gather call:

     chunk_graphs = await asyncio.gather(
         *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
+    , return_exceptions=True)

Then, handle exceptions after gathering results:

for idx, result in enumerate(chunk_graphs):
    if isinstance(result, Exception):
        # Log the exception and handle accordingly
        # For example, set the result to None or remove the corresponding chunk
        chunk_graphs[idx] = None

This ensures that exceptions in one chunk do not interrupt the processing of others.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 97f1a5a and 4ee7c6f.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • README.md (1 hunks)
  • cognee/infrastructure/databases/graph/networkx/adapter.py (9 hunks)
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (5 hunks)
  • cognee/modules/engine/utils/__init__.py (1 hunks)
  • cognee/modules/engine/utils/generate_edge_name.py (1 hunks)
  • cognee/modules/engine/utils/generate_node_name.py (1 hunks)
  • cognee/shared/utils.py (4 hunks)
  • cognee/tasks/documents/check_permissions_on_documents.py (1 hunks)
  • cognee/tasks/documents/classify_documents.py (1 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
  • cognee/tasks/graph/extract_graph_from_data.py (1 hunks)
  • cognee/tasks/storage/add_data_points.py (1 hunks)
  • cognee/tasks/summarization/summarize_text.py (1 hunks)
  • cognee/tests/test_qdrant.py (1 hunks)
  • cognee/tests/test_weaviate.py (1 hunks)
  • pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • README.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • cognee/modules/engine/utils/generate_node_name.py
  • cognee/shared/utils.py
  • cognee/tasks/documents/check_permissions_on_documents.py
  • cognee/tasks/documents/classify_documents.py
  • cognee/tasks/documents/extract_chunks_from_documents.py
🧰 Additional context used
🪛 Ruff
cognee/infrastructure/databases/graph/networkx/adapter.py

256-256: Do not use bare except

(E722)


270-270: Do not use bare except

(E722)

🔇 Additional comments (14)
cognee/modules/engine/utils/generate_edge_name.py (1)

1-2: Verify usage in graph operations

Since this is a utility function for graph edge names, we should verify its usage in graph-related operations.

✅ Verification successful

Let me check the actual usage in the graph extraction file to understand how this function is being used.


Function usage is consistent and properly integrated

The verification shows that generate_edge_name is used appropriately in graph operations:

  • It's used in extract_graph_from_data.py to normalize relationship names before creating edges
  • The function's behavior (lowercasing and replacing spaces/apostrophes) aligns with similar utilities like generate_node_id
  • The normalized edge names are used consistently in both Neo4j and NetworkX graph adapters
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of generate_edge_name in graph-related files
echo "Searching for generate_edge_name usage in graph operations..."
rg -l "generate_edge_name" | grep -i "graph"

# Check for existing edge name generation patterns that might conflict
echo "Checking for existing edge name generation patterns..."
rg -A 2 "\.lower\(\)\.replace\(['\"] ['\"], ['\"]_['\"]\)"

Length of output: 653


Script:

#!/bin/bash
# Check how generate_edge_name is used in graph extraction
echo "Checking usage in graph extraction:"
rg -A 5 "generate_edge_name" cognee/tasks/graph/extract_graph_from_data.py

# Also check if there are any other edge name manipulations in graph-related files
echo -e "\nChecking other edge name patterns in graph files:"
rg -l "edge.*name" | grep -i "graph" | xargs rg "edge.*name" -A 2

Length of output: 5676

cognee/modules/engine/utils/__init__.py (2)

1-3: LGTM! Clean and well-organized imports.

The imports are well-structured with clear separation of concerns, making the utility functions easily accessible from the package level.


1-3: Verify the module structure and dependencies.

Let's ensure all required modules are present in the new location and the imports resolve correctly.

✅ Verification successful

Let me verify the imports in the files that use these utility functions to ensure they're using the correct paths.


Module structure and dependencies are correctly configured

All utility modules are present in the new location (cognee/modules/engine/utils/), and the imports in dependent files (extract_graph_from_data.py and infer_data_ontology.py) are correctly using the new path cognee.modules.engine.utils. No references to the old location were found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the presence and structure of the utility modules

# Check if all referenced modules exist
echo "Checking for utility modules..."
for module in generate_node_id.py generate_node_name.py generate_edge_name.py; do
  if fd -p "cognee/modules/engine/utils/$module"; then
    echo "✓ Found $module"
  else
    echo "✗ Missing $module"
  fi
done

# Check for any remaining references to the old location
echo -e "\nChecking for references to old location..."
rg -l "cognee/modules/graph/utils" || echo "No references to old location found"

# Look for any imports of these functions in the codebase
echo -e "\nChecking usage of utility functions..."
rg -l "import.*generate_(node|edge)_(id|name)" || echo "No imports found"

Length of output: 1322


Script:

#!/bin/bash
# Check the import statements in the files using these utilities
echo "Checking import statements in extract_graph_from_data.py:"
rg "import.*generate_(node|edge)_(id|name)" cognee/tasks/graph/extract_graph_from_data.py -B1 -A1

echo -e "\nChecking import statements in infer_data_ontology.py:"
rg "import.*generate_(node|edge)_(id|name)" cognee/tasks/graph/infer_data_ontology.py -B1 -A1

# Also check if there are any direct references to the old path
echo -e "\nChecking for any remaining old path references in these files:"
rg "graph/utils" cognee/tasks/graph/extract_graph_from_data.py cognee/tasks/graph/infer_data_ontology.py

Length of output: 1026

cognee/tasks/storage/add_data_points.py (1)

19-24: ⚠️ Potential issue

Add error handling for async operations

The async operations lack error handling which could lead to inconsistent state if any operation fails.

Consider wrapping the operations in try-except blocks:

try:
    graph_engine = await get_graph_engine()
except Exception as e:
    logger.error(f"Failed to get graph engine: {e}")
    raise

try:
    await index_data_points(data_points)
    await graph_engine.add_nodes(nodes)
    await graph_engine.add_edges(edges)
except Exception as e:
    logger.error(f"Failed to process graph data: {e}")
    # Consider cleanup or rollback operations
    raise

Verify the order of operations

Ensure that indexing must occur before adding nodes/edges to the graph.

✅ Verification successful

Let me analyze the search results to verify the order of operations and generate a final response.

Error handling is needed, but operation order is correct

The search results confirm that the order of operations (indexing before graph operations) is consistent across the codebase:

  • In add_data_points.py, indexing happens before graph operations
  • The index_data_points function is implemented in vector database adapters (QDrant, PGVector, Weaviate, LanceDB) and handles vector indexing
  • Graph operations (add_nodes, add_edges) are implemented in graph database adapters (NetworkX, Neo4j) and handle graph persistence

The original error handling suggestion remains valid as these are critical persistence operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other usages of these operations to verify the correct order
rg -A 5 "index_data_points|add_nodes|add_edges" --type py

Length of output: 13477

pyproject.toml (2)

73-73: Consider performance implications of runtime type checking

Adding beartype for runtime type checking aligns with the PR objective but could impact performance. Consider:

  1. Implementing selective type checking for critical paths only
  2. Adding performance benchmarks before full deployment
  3. Using it primarily in development/testing environments
✅ Verification successful

Based on the results, I can now generate a final response since I have a clear picture of how beartype is being used in the codebase.

Runtime type checking is appropriately scoped to critical data processing tasks

The current implementation already follows best practices by selectively applying @beartype decorators to key data processing functions:

  • Data classification (classify_documents)
  • Document chunking (extract_chunks_from_documents)
  • Storage operations (add_data_points)
  • Summarization (summarize_text)
  • Graph extraction (extract_graph_from_data)

These are high-level task functions where type validation is valuable for data integrity, and the performance overhead is justified given their operational scope.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check current usage of beartype decorators

# Search for beartype usage patterns
rg "@beartype" -A 2

# Look for performance-critical paths
ast-grep --pattern 'from beartype import beartype'

Length of output: 1877


61-61: Verify compatibility with LanceDB major version upgrade

The upgrade from LanceDB 0.8.0 to 0.15.0 is a significant jump that might introduce breaking changes.

✅ Verification successful

Let's analyze the actual usage patterns to understand potential breaking changes impact.


LanceDB upgrade appears safe to proceed

The codebase uses LanceDB's stable APIs that have remained consistent:

  • Core functionality like connect_async, create_table, vector_search, and merge_insert are standard features
  • The implementation follows LanceDB's recommended patterns using LanceModel and Vector
  • No deprecated or removed features are being used

The adapter implementation is well-isolated and follows a clean interface pattern, making it easier to handle any minor compatibility issues if they arise.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for breaking changes in LanceDB upgrade

# Check changelog or release notes
curl -s https://raw.githubusercontent.com/lancedb/lancedb/main/CHANGELOG.md | grep -A 10 "0.15.0"

# Search for LanceDB usage in codebase
rg "lancedb" -l

Length of output: 585


Script:

#!/bin/bash
# Check LanceDB usage patterns in the codebase
rg -A 5 "from lancedb|import lancedb" 

# Check LanceDBAdapter implementation
cat cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py

Length of output: 9555

cognee/tests/test_qdrant.py (1)

40-41: Verify schema changes across the codebase

The changes to the search field ("Entity_name") and payload access ("text") suggest a schema modification. Let's ensure these changes are consistent throughout the codebase.

✅ Verification successful

The previous search was too restrictive with exact quotes. Let's broaden the search to find all payload access patterns and search method calls, along with the full test file content for better context.


Schema changes are consistent across the codebase

The search results show that the schema changes are consistently implemented across all test files and core functionality:

  • All vector engine searches for entities use "Entity_name" as the collection name:

    • test_qdrant.py, test_weaviate.py, test_neo4j.py, test_library.py, test_pgvector.py
    • tasks/graph/query_graph_connections.py
  • All payload access uses the "text" field consistently:

    • All test files access node.payload["text"]
    • No instances of the old "name" field were found

The changes in test_qdrant.py align with the established patterns across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining uses of the old schema
echo "Checking for old payload access patterns..."
rg -l '"name".*payload' --type py

echo "Checking for old entity search patterns..."
rg -l '"entities".*search' --type py

# Verify the new schema is used consistently
echo "Checking new schema usage..."
rg -l '"Entity_name".*search' --type py
rg -l '"text".*payload' --type py

Length of output: 411


Script:

#!/bin/bash
# Let's try a broader search without exact string matches
echo "Checking payload access patterns..."
rg "payload\[" --type py -A 1 -B 1

echo "Checking search patterns..."
rg "\.search\(" --type py -A 1 -B 1

echo "Checking test file content..."
cat cognee/tests/test_qdrant.py

Length of output: 14598

cognee/tests/test_weaviate.py (1)

38-39: Verify schema changes across the codebase

The changes in search parameters ("Entity_name") and payload access ("text") suggest a schema modification. Let's ensure these changes are consistent across the codebase.

✅ Verification successful

Let me try another search pattern to verify the payload access consistency.


Based on the search results, I can now generate the final response:

Schema changes are consistent across the codebase

The changes in the test file align with the rest of the codebase:

  • All test files consistently use "Entity_name" as the collection name for vector search
  • All payload access patterns consistently use payload["text"] across different vector engine implementations (Weaviate, Qdrant, PGVector, etc.)
  • The search term variation ("quantum computer" vs "AI") is just test-specific data and doesn't impact the schema
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other occurrences of vector_engine.search to verify consistent usage
rg -A 2 'vector_engine\.search\(' --type py

# Search for payload access patterns to verify schema consistency
rg -A 2 'payload\[(["'\''])(?:name|text)\1\]' --type py

# Search for class/type definitions to verify entity schema
ast-grep --pattern 'class Entity_name {
  $$$
}'

Length of output: 2428


Script:

#!/bin/bash
# Search for payload access patterns without backreference
rg -A 2 'payload\[["'\''][a-z]+["'\'']\]' --type py

# Search for Entity_name class/type definitions in Python files
rg -A 5 'class Entity_name' --type py

Length of output: 1235

cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)

13-18: LGTM: Well-structured schema definition

The IndexSchema class is well-designed with clear field definition and metadata structure.


108-109: LGTM: Good use of model_dump() and proper ID handling

The changes to use model_dump() and string IDs are appropriate and follow best practices.


128-129: Verify consistent collection naming across codebase

The new index naming convention {index_name}_{index_property_name} should be used consistently throughout the codebase.

✅ Verification successful

Let me gather more information about collection name usage with different patterns.


Let me try one more search with a modified pattern to specifically look for the collection name string formations.


Collection naming convention is consistently implemented across vector adapters

Based on the search results, the collection naming pattern f"{index_name}_{index_property_name}" is consistently used across all vector database adapters:

  • QDrantAdapter.py
  • WeaviateAdapter.py
  • PGVectorAdapter.py
  • LanceDBAdapter.py

Each adapter implements the same naming convention in both their create_vector_index and subsequent collection operations, ensuring consistency throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for collection name usage patterns to ensure consistency
echo "Searching for collection name patterns..."
rg -g '!*.pyc' -g '!*test*' 'collection_name.*=.*_.*_' .

echo "Searching for direct collection operations..."
ast-grep --pattern 'await client.create_collection(collection_name = $_)' .

Length of output: 516


Script:

#!/bin/bash
# Search for collection name patterns and collection operations more broadly
echo "Searching for collection name references..."
rg -g '!*.pyc' -g '!*test*' 'collection.*=|collection_name.*=' -A 2 .

echo "Searching for create_collection calls..."
rg -g '!*.pyc' -g '!*test*' 'create_collection\(' -A 2 .

echo "Searching for collection name string formations..."
rg -g '!*.pyc' -g '!*test*' 'f"{.*}_{.*}"' .

Length of output: 15279


Script:

#!/bin/bash
# Search for f-string collection name patterns
echo "Searching for collection name string formations..."
rg -g '!*.pyc' -g '!*test*' 'f".*_.*"' .

echo "Searching specifically in vector adapter implementations..."
rg -g '!*.pyc' -g '!*test*' 'create_vector_index' -A 3 .

Length of output: 16309

cognee/tasks/summarization/summarize_text.py (3)

2-3: Addition of runtime type checking with Beartype

The inclusion of beartype and Type imports enables runtime type enforcement at runtime, which can help catch type-related errors early.


24-24: Ensure correct access to the summary attribute

In text = chunk_summaries[chunk_index].summary, confirm that extract_summary returns an object with a summary attribute. If it returns a plain string, this line may raise an AttributeError.

If extract_summary returns a string, update the code as follows:

-            text = chunk_summaries[chunk_index].summary,
+            text = chunk_summaries[chunk_index],

11-12: Ensure all function calls are updated to match the new signature

The collection_name parameter has been removed from the summarize_text function signature. Please verify that all calls to this function have been updated accordingly to prevent any runtime errors.

Run the following script to find any calls to summarize_text that still use the collection_name parameter:

✅ Verification successful

Function calls are already compatible with the new signature

Based on the verification results, the only usage of summarize_text is in cognee/api/v1/cognify/cognify_v2.py and it's already using the correct parameter signature without the collection_name parameter. The function is called with only the required parameters: the data chunks (implicit in the context) and the summarization model.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all occurrences where 'summarize_text' is called with 'collection_name'.

rg -A 2 $'summarize_text\(.+, .+, collection_name='

Length of output: 52


Script:

#!/bin/bash
# Let's search for all calls to summarize_text function to verify the parameter usage
rg -A 2 "summarize_text\(" --type python

Length of output: 73


Script:

#!/bin/bash
# Let's also check for any imports of summarize_text to ensure we find all usage
rg "from.*summarize_text.*import.*summarize_text" --type python

Length of output: 96


Script:

#!/bin/bash
# Let's also use ast-grep to find function calls
ast-grep --pattern 'summarize_text($$$)'

Length of output: 42


Script:

#!/bin/bash
# Search for summarize_text function calls without file type restriction
rg -A 2 "summarize_text\("

Length of output: 341


Script:

#!/bin/bash
# Search for imports of summarize_text
rg "from.*summarize_text.*import"

Length of output: 117


Script:

#!/bin/bash
# Find all Python files that might contain the function call
fd "\.py$" -x grep -l "summarize_text"

Length of output: 164


Script:

#!/bin/bash
# Let's check the actual usage in the API file
rg -A 5 "summarize_text" ./cognee/api/v1/cognify/cognify_v2.py

Length of output: 381

Comment on lines +1 to +2
def generate_edge_name(name: str) -> str:
return name.lower().replace(" ", "_").replace("'", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add beartype runtime type checking

Given that this PR is specifically about testing beartype in the cognify pipeline, the function should implement runtime type checking using beartype.

Apply this diff:

+from beartype import beartype

+@beartype
 def generate_edge_name(name: str) -> str:
     return name.lower().replace(" ", "_").replace("'", "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def generate_edge_name(name: str) -> str:
return name.lower().replace(" ", "_").replace("'", "")
from beartype import beartype
@beartype
def generate_edge_name(name: str) -> str:
return name.lower().replace(" ", "_").replace("'", "")

🛠️ Refactor suggestion

Add input validation and handle edge cases

The function should validate input and handle additional edge cases for more robust edge name generation.

Consider this improved implementation:

 def generate_edge_name(name: str) -> str:
+    if not name:
+        raise ValueError("Edge name cannot be empty")
+    # Replace multiple spaces with single underscore
+    # Handle both types of quotes and other common special characters
-    return name.lower().replace(" ", "_").replace("'", "")
+    return re.sub(r'\s+', '_', name.lower()).translate(str.maketrans("'\"", "  ", "!@#$%^&*()+={}[]|\\:;\"'<>,.?/"))

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +10 to +18
nodes = []
edges = []

for data_point in data_points:
property_nodes, property_edges = get_graph_from_model(data_point)

nodes.extend(property_nodes)
edges.extend(property_edges)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and consider memory optimization

The current implementation has two potential issues:

  1. No error handling for get_graph_from_model which could fail for invalid data points
  2. For large datasets, collecting all nodes and edges in memory before processing could cause memory pressure

Consider this approach:

nodes = []
edges = []

for data_point in data_points:
    try:
        property_nodes, property_edges = get_graph_from_model(data_point)
        nodes.extend(property_nodes)
        edges.extend(property_edges)
        
        # Optional: Process in batches to reduce memory usage
        if len(nodes) >= BATCH_SIZE:
            await graph_engine.add_nodes(nodes)
            await graph_engine.add_edges(edges)
            nodes = []
            edges = []
    except Exception as e:
        logger.error(f"Failed to process data point: {e}")
        # Decide whether to continue or raise

Comment on lines +88 to +97
if not await client.collection_exists(collection_name):
await client.create_collection(
collection_name = collection_name,
vectors_config = {
"text": models.VectorParams(
size = self.embedding_engine.get_vector_size(),
distance = "Cosine"
)
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and ensure proper resource cleanup

While the collection existence check is good, the create_collection operation needs proper error handling.

Consider wrapping the operation in a try-finally block:

     async def create_collection(
         self,
         collection_name: str,
         payload_schema = None,
     ):
         client = self.get_qdrant_client()
-        if not await client.collection_exists(collection_name):
-            await client.create_collection(
-                collection_name = collection_name,
-                vectors_config = {
-                    "text": models.VectorParams(
-                        size = self.embedding_engine.get_vector_size(),
-                        distance = "Cosine"
-                    )
-                }
-            )
-        await client.close()
+        try:
+            if not await client.collection_exists(collection_name):
+                await client.create_collection(
+                    collection_name = collection_name,
+                    vectors_config = {
+                        "text": models.VectorParams(
+                            size = self.embedding_engine.get_vector_size(),
+                            distance = "Cosine"
+                        )
+                    }
+                )
+        finally:
+            await client.close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not await client.collection_exists(collection_name):
await client.create_collection(
collection_name = collection_name,
vectors_config = {
"text": models.VectorParams(
size = self.embedding_engine.get_vector_size(),
distance = "Cosine"
)
}
)
async def create_collection(
self,
collection_name: str,
payload_schema = None,
):
client = self.get_qdrant_client()
try:
if not await client.collection_exists(collection_name):
await client.create_collection(
collection_name = collection_name,
vectors_config = {
"text": models.VectorParams(
size = self.embedding_engine.get_vector_size(),
distance = "Cosine"
)
}
)
finally:
await client.close()

Comment on lines +128 to +138
async def create_vector_index(self, index_name: str, index_property_name: str):
await self.create_collection(f"{index_name}_{index_property_name}")

async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
await self.create_data_points(f"{index_name}_{index_property_name}", [
IndexSchema(
id = data_point.id,
text = getattr(data_point, data_point._metadata["index_fields"][0]),
) for data_point in data_points
])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation and error handling for index operations

The current implementation has several potential failure points:

  1. No validation of index_name/index_property_name for valid collection name characters
  2. No error handling for missing fields in getattr()
  3. No validation of data_points list being non-empty

Consider adding these safety checks:

     async def create_vector_index(self, index_name: str, index_property_name: str):
+        if not index_name or not index_property_name:
+            raise ValueError("Index name and property name must not be empty")
+        if not all(c.isalnum() or c in '_-' for c in f"{index_name}_{index_property_name}"):
+            raise ValueError("Index name and property name must contain only alphanumeric characters, underscores, or hyphens")
         await self.create_collection(f"{index_name}_{index_property_name}")

     async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
+        if not data_points:
+            raise ValueError("data_points list must not be empty")
+        
         await self.create_data_points(f"{index_name}_{index_property_name}", [
             IndexSchema(
                 id = data_point.id,
-                text = getattr(data_point, data_point._metadata["index_fields"][0]),
+                text = getattr(data_point, data_point._metadata["index_fields"][0], None) or 
+                    raise ValueError(f"Missing required index field {data_point._metadata['index_fields'][0]} in data point"),
             ) for data_point in data_points
         ])

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +37 to +38
async def query(self, query: str, params: dict):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement the query method

The query method is currently a placeholder. Since this method is part of the GraphDBInterface, it should be implemented to handle graph queries.

Would you like help implementing the query method with NetworkX's graph querying capabilities?

Comment on lines +254 to +271
try:
node["id"] = UUID(node["id"])
except:
pass
if "updated_at" in node:
node["updated_at"] = datetime.strptime(node["updated_at"], "%Y-%m-%dT%H:%M:%S.%f%z")

for edge in graph_data["links"]:
try:
source_id = UUID(edge["source"])
target_id = UUID(edge["target"])

edge["source"] = source_id
edge["target"] = target_id
edge["source_node_id"] = source_id
edge["target_node_id"] = target_id
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling for UUID conversion

The bare except blocks could mask important errors. Consider catching specific exceptions and logging them for better debugging.

-                        try:
-                          node["id"] = UUID(node["id"])
-                        except:
-                          pass
+                        try:
+                            node["id"] = UUID(node["id"])
+                        except ValueError as e:
+                            logger.warning(f"Failed to convert node ID to UUID: {e}")
+                            pass

-                        try:
-                          source_id = UUID(edge["source"])
-                          target_id = UUID(edge["target"])
-
-                          edge["source"] = source_id
-                          edge["target"] = target_id
-                          edge["source_node_id"] = source_id
-                          edge["target_node_id"] = target_id
-                        except:
-                          pass
+                        try:
+                            source_id = UUID(edge["source"])
+                            target_id = UUID(edge["target"])
+
+                            edge["source"] = source_id
+                            edge["target"] = target_id
+                            edge["source_node_id"] = source_id
+                            edge["target_node_id"] = target_id
+                        except ValueError as e:
+                            logger.warning(f"Failed to convert edge IDs to UUID: {e}")
+                            pass

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

256-256: Do not use bare except

(E722)


270-270: Do not use bare except

(E722)

vector_engine = get_vector_engine()
summaries = [
TextSummary(
id = uuid5(chunk.id, "TextSummary"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Verify the use of uuid5 for generating unique IDs

In the line id = uuid5(chunk.id, "TextSummary"), ensure that chunk.id is a UUID object, as uuid5 expects a UUID namespace as its first argument. If chunk.id is not a UUID, this may result in an error.

If chunk.id is not a UUID, consider using a valid UUID namespace like NAMESPACE_DNS or converting chunk.id to a UUID. For example:

from uuid import uuid5, NAMESPACE_DNS
# ...

id = uuid5(NAMESPACE_DNS, f"{chunk.id}-TextSummary")

Comment on lines +54 to +56
for edge in existing_edges:
existing_edges_map[edge[0] + edge[1] + edge[2]] = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Prevent potential key collisions by using tuples as dictionary keys.

Currently, keys in existing_edges_map are created by concatenating strings:

existing_edges_map[edge[0] + edge[1] + edge[2]] = True

and

edge_key = str(source_node_id) + str(target_node_id) + relationship_name

String concatenation may lead to key collisions and is less efficient. It is recommended to use tuples as dictionary keys to ensure uniqueness and improve code clarity.

Update the code as follows:

At lines 55-56:

-        existing_edges_map[edge[0] + edge[1] + edge[2]] = True
+        existing_edges_map[(edge[0], edge[1], edge[2])] = True

At lines 102-103:

-        edge_key = str(source_node_id) + str(target_node_id) + relationship_name
+        edge_key = (source_node_id, target_node_id, relationship_name)

And check for the key existence accordingly:

-        if edge_key not in existing_edges_map:
+        if edge_key not in existing_edges_map:
            # Proceed to add the edge

Also applies to: 102-105

Comment on lines +38 to +41
graph_node_edges = [
(edge.target_node_id, edge.source_node_id, edge.relationship_name) \
for edge in chunk_graph.edges
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure correct ordering of source and target node IDs in graph_node_edges.

In the construction of graph_node_edges, the tuples are currently formed with edge.target_node_id before edge.source_node_id. This may lead to confusion or errors if the expected convention is (source_node_id, target_node_id, relationship_name). Please verify and adjust the order to maintain consistency throughout the codebase.

Apply this diff to correct the ordering:

-        (edge.target_node_id, edge.source_node_id, edge.relationship_name) \
+        (edge.source_node_id, edge.target_node_id, edge.relationship_name) \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
graph_node_edges = [
(edge.target_node_id, edge.source_node_id, edge.relationship_name) \
for edge in chunk_graph.edges
]
graph_node_edges = [
(edge.source_node_id, edge.target_node_id, edge.relationship_name) \
for edge in chunk_graph.edges
]

@borisarzentar
Copy link
Member

borisarzentar commented Nov 12, 2024

@0xideas Should we close this PR?
Something looks weird here, too many changed files.

@0xideas 0xideas closed this Nov 14, 2024
@soobrosa soobrosa deleted the test-beartype-in-cognify-pipeline branch May 28, 2025 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants