Add type to DataPoint metadata #364

alekszievr · 2024-12-12T14:11:12Z

Summary by CodeRabbit

Release Notes

New Features
- Added _metadata attributes to multiple classes, enhancing data structure with type information.
- Introduced new classes Car and Person with metadata for improved data modeling.
- Enhanced search methods across various adapters to include a default limit parameter.
Bug Fixes
- Improved error handling in various database adapters, ensuring better input validation and logging.
Documentation
- Updated class signatures and metadata descriptions for clarity across multiple modules.
Refactor
- Reorganized import statements for better readability and maintenance across several files.

coderabbitai · 2024-12-12T14:11:23Z

Walkthrough

The pull request introduces several changes across multiple database adapter classes, enhancing the _metadata dictionary in the IndexSchema class to include a "type" key. Additionally, it reorganizes import statements for clarity, refines methods for better error handling and data processing, and introduces new classes with associated metadata. Overall, the changes improve code organization and maintain functionality across the database adapters.

Changes

File Path	Change Summary
`cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py`	Updated `_metadata` in `IndexSchema`, reorganized imports, refined `create_data_points`, `delete_nodes` methods, enhanced error handling in `delete_graph`.
`cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py`	Updated `_metadata` in `IndexSchema`, rearranged imports, defined `LanceDataPoint` as a generic class in `create_collection` and `create_data_points`.
`cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py`	Updated `_metadata` in `IndexSchema`, reorganized imports, clarified `embedding_engine` initialization, enhanced error handling in `create_collection` and `create_data_points`.
`cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`	Updated `_metadata` in `IndexSchema`, reorganized imports, added `PGVectorDataPoint` class in `create_collection` and `create_data_points`. Enhanced error handling in `get_table` and `search`.
`cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py`	Updated `_metadata` in `IndexSchema`, updated method signatures for `create_data_points` and `index_data_points`, added `limit` parameter to `search`.
`cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py`	Updated `_metadata` in `IndexSchema`, modified `create_collection`, `create_data_points`, and `index_data_points` methods for better handling and error logging.
`cognee/infrastructure/engine/models/DataPoint.py`	Updated `_metadata` in `DataPoint`, reorganized imports, set default for `updated_at` field.
`cognee/modules/chunking/models/DocumentChunk.py`	Added `_metadata` with `"type": "DocumentChunk"` to `DocumentChunk` class.
`cognee/modules/data/processing/document_types/Document.py`	Removed `type` attribute, added `_metadata` with `"type": "Document"` to `Document` class.
`cognee/modules/engine/models/Entity.py`	Added `_metadata` with `"type": "Entity"` to `Entity` class.
`cognee/modules/engine/models/EntityType.py`	Removed `type` attribute, added `_metadata` with `"type": "EntityType"` to `EntityType` class.
`cognee/modules/graph/models/EdgeType.py`	Added `_metadata` with `"type": "EdgeType"` to `EdgeType` class.
`cognee/modules/graph/utils/convert_node_to_data_point.py`	Changed access of "type" from `node_data` to `_metadata["type"]`.
`cognee/shared/CodeGraphEntities.py`	Removed `type` attributes, replaced with `_metadata` dictionaries for `Repository`, `CodeFile`, `CodePart`, and `CodeRelationship`.
`cognee/shared/SourceCodeGraph.py`	Removed `Literal` type annotations, added `_metadata` dictionaries for various classes.
`cognee/tasks/storage/index_data_points.py`	Modified `index_data_points` and `get_data_points_from_model` functions, added `Car` and `Person` classes with `_metadata`.
`cognee/tasks/summarization/models.py`	Added `_metadata` with `"type": "TextSummary"` and `"type": "CodeSummary"` to respective classes.
`cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py`	Added `_metadata` to `Repository`, `CodeFile`, and `CodePart` classes.
`cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py`	Added `_metadata` to `Document`, `DocumentChunk`, `EntityType`, and `Entity` classes.

Possibly related PRs

Cog 170 pgvector adapter #158: Updates to the FalkorDBAdapter class, including changes to _metadata, relevant to similar updates in other adapters.
COG-485 - Fix/integration test warnings #176: Modifications to VectorDBConfigDTO in client.py, related to broader vector database handling.
Feat/cog 553 graph memory projection #196: Enhancements in the add_data_points function, which includes database handling improvements.
Cog 417 chunking unit tests #205: Refactor of get_graph_from_model, relevant to graph-related functionalities.
Cog 533 pydantic unit tests #230: Changes to data model structures, aligning with the main PR's focus on data management.
Cog 505 data dataset model changes #260: The changes in the main PR that introduce new attributes to the Data class are relevant to the modifications in the PGVectorAdapter, which also deals with data management and structure.
Fix pgvector search #360: The removal of the singleton decorator in the PGVectorAdapter class and the changes in how instances are managed are related to the adjustments made in the main PR regarding the instantiation and management of data points and collections.

Suggested labels

run-checks

Suggested reviewers

dexters1
hajdul88

🐇 In the code we hop and play,
With metadata brightening the way.
Each class now knows its type,
Enhancing structure, oh so ripe!
As we code, let joy abound,
In every line, new wonders found! 🐇

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d9d7cc and 313ca9b.

📒 Files selected for processing (1)

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (17)

cognee/modules/data/processing/document_types/Document.py (1)

15-16: Implement or document the unimplemented read method.

The read method is currently a placeholder. Consider either implementing it or adding a docstring explaining why it's left unimplemented.

Would you like me to help implement this method or create an issue to track this task?
cognee/shared/CodeGraphEntities.py (1)
2-2: Remove unnecessary empty lines.

There are several unnecessary empty lines that could be removed to improve code organization.
-
-
 from cognee.infrastructure.engine import DataPoint
-
-
 class Repository(DataPoint):
Also applies to: 5-5, 32-32
cognee/infrastructure/engine/models/DataPoint.py (1)
45-45: Consider adding null safety check

The simplified return statement assumes _metadata is always present. While this is set in the class definition, derived classes might override it.
-        return data_point._metadata["index_fields"] or []
+        return data_point._metadata.get("index_fields", []) if data_point._metadata else []
cognee/shared/SourceCodeGraph.py (1)

1-1: Architectural improvement: Runtime type information

Good architectural decision to move from compile-time Literal types to runtime metadata. This change:

Maintains type information at runtime

Aligns with the DataPoint base class pattern

Provides consistent type identification across the system

Consider documenting this architectural decision in the project's ADR (Architecture Decision Records) to explain the rationale for future maintainers.

Also applies to: 15-15, 24-24, 36-36, 48-48, 60-60, 69-69, 79-79, 97-97
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)
7-12: Improve import organization

Consider grouping related imports together and adding a blank line between different groups:
 from __future__ import annotations
 
 import asyncio
 import logging
 from typing import List, Optional
 from uuid import UUID
-
-from cognee.infrastructure.engine import DataPoint
-
-from ..embeddings.EmbeddingEngine import EmbeddingEngine
-from ..models.ScoredResult import ScoredResult
-from ..vector_db_interface import VectorDBInterface
+
+from cognee.infrastructure.engine import DataPoint
+from ..embeddings.EmbeddingEngine import EmbeddingEngine
+from ..models.ScoredResult import ScoredResult
+from ..vector_db_interface import VectorDBInterface
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4)
8-11: Consider grouping related imports together

The imports could be better organized by grouping related imports:

Standard library imports (typing, uuid, etc.)

Third-party imports (embeddings, models)

Local imports (exceptions, infrastructure)
from ..embeddings.EmbeddingEngine import EmbeddingEngine
-from ..models.ScoredResult import ScoredResult
-from ..vector_db_interface import VectorDBInterface
+from ..vector_db_interface import VectorDBInterface
+from ..models.ScoredResult import ScoredResult
Line range hint 119-134: Enhance error handling in batch operations

The batch operation error handling could be improved by:

Adding specific error types

Providing more context in error messages

Ensuring proper cleanup in case of partial failures
 try:
     if len(data_points) > 1:
         with collection.batch.dynamic() as batch:
             for data_point in data_points:
+                try:
                     batch.add_object(
                         uuid = data_point.uuid,
                         vector = data_point.vector,
                         properties = data_point.properties,
                         references = data_point.references,
                     )
+                except Exception as e:
+                    logger.error("Failed to add data point %s: %s", data_point.uuid, str(e))
+                    raise
     else:
         data_point: DataObject = data_points[0]
         if collection.data.exists(data_point.uuid):
Line range hint 187-193: Consider adding retry logic for search operations

The search operation could benefit from retry logic to handle temporary network issues or service unavailability.
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
     async def search(
             self,
             collection_name: str,
             query_text: Optional[str] = None,
             query_vector: Optional[List[float]] = None,
             limit: int = None,
             with_vector: bool = False
     ):
Line range hint 234-236: Add parameter validation in batch_search

The batch_search method should validate its input parameters similar to the single search method.
     async def batch_search(self, collection_name: str, query_texts: List[str], limit: int, with_vectors: bool = False):
+        if not query_texts:
+            raise InvalidValueError(message="query_texts cannot be empty")
         query_vectors = await self.embed_data(query_texts)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)
Line range hint 91-121: Consider simplifying generic type implementation

The generic type implementation could be simplified while maintaining type safety.
-        IdType = TypeVar("IdType")
-        PayloadSchema = TypeVar("PayloadSchema")
-        vector_size = self.embedding_engine.get_vector_size()
-
-        class LanceDataPoint(LanceModel, Generic[IdType, PayloadSchema]):
+        class LanceDataPoint(LanceModel):
             id: str
-            vector: Vector(vector_size)
+            vector: Vector(self.embedding_engine.get_vector_size())
             payload: PayloadSchema
Line range hint 142-156: Add error handling for collection operations

The retrieve operation should include error handling for cases where the collection doesn't exist.
     async def retrieve(self, collection_name: str, data_point_ids: list[str]):
         connection = await self.get_connection()
+        if not await self.has_collection(collection_name):
+            raise InvalidValueError(f"Collection {collection_name} does not exist")
         collection = await connection.open_table(collection_name)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)
Line range hint 28-41: Consider using dataclasses for configuration

The configuration handling could be improved using dataclasses for better type safety and validation.
+from dataclasses import dataclass
+
+@dataclass
+class HnswConfig:
+    m: int = 16
+    ef_construct: int = 100
+
 def create_hnsw_config(hnsw_config: Dict):
     if hnsw_config is not None:
-        return models.HnswConfig()
+        return models.HnswConfig(**HnswConfig(**hnsw_config).__dict__)
     return None
Line range hint 95-112: Add connection pooling for better resource management

The client connection handling could be improved with connection pooling.

Consider implementing a connection pool to manage client connections more efficiently and prevent resource exhaustion during high concurrent loads.

Line range hint 251-253: Improve batch search result filtering

The current filtering approach in batch_search might drop important results.
-        return [filter(lambda result: result.score > 0.9, result_group) for result_group in results]
+        return [list(filter(lambda result: result.score > 0.9, result_group)) for result_group in results]
The current implementation:

Returns filter objects instead of lists

Uses a hard-coded threshold that might not be suitable for all use cases

Doesn't provide a way to customize the filtering threshold
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)
Line range hint 82-94: Eliminate duplicate PGVectorDataPoint class definition

The PGVectorDataPoint class is defined identically in both create_collection and create_data_points methods. This duplication could lead to maintenance issues if the class definition needs to be updated.

Consider moving the class definition to module level or creating a factory method:
+ def create_pgvector_data_point_class(collection_name: str, vector_size: int):
+     from pgvector.sqlalchemy import Vector
+     class PGVectorDataPoint(Base):
+         __tablename__ = collection_name
+         __table_args__ = {"extend_existing": True}
+         primary_key: Mapped[int] = mapped_column(
+             primary_key=True, autoincrement=True
+         )
+         id: Mapped[Any]  # Type will be set based on data_points
+         payload = Column(JSON)
+         vector = Column(Vector(vector_size))
+
+         def __init__(self, id, payload, vector):
+             self.id = id
+             self.payload = payload
+             self.vector = vector
+     return PGVectorDataPoint

  async def create_collection(self, collection_name: str, payload_schema=None):
      data_point_types = get_type_hints(DataPoint)
      vector_size = self.embedding_engine.get_vector_size()

      if not await self.has_collection(collection_name):
-         from pgvector.sqlalchemy import Vector
-         class PGVectorDataPoint(Base):
-             __tablename__ = collection_name
-             ...
+         PGVectorDataPoint = create_pgvector_data_point_class(collection_name, vector_size)
Also applies to: 124-136

17-17: Implement similarity score normalization using the imported utility

The normalize_distances utility is imported but not used, while there are TODO comments about normalizing similarity scores.

Consider implementing the normalization:
  # Extract distances and find min/max for normalization
+ distances = [vector.similarity for vector in closest_items]
+ normalized_distances = normalize_distances(distances)
  for vector in closest_items:
-     # TODO: Add normalization of similarity score
      vector_list.append(vector)

  # Create and return ScoredResult objects
  return [
      ScoredResult(
          id = UUID(str(row.id)),
          payload = row.payload,
-         score = row.similarity
+         score = normalized_distances[i]
      ) for i, row in enumerate(vector_list)
  ]
Also applies to: 214-214, 267-267
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1)
Line range hint 359-360: Fix missing await in delete_nodes method

The delete_data_points call is not awaited, which could lead to unhandled promises and race conditions.

Apply this fix:
  async def delete_nodes(self, collection_name: str, data_point_ids: list[str]):
-     self.delete_data_points(data_point_ids)
+     await self.delete_data_points(data_point_ids)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0bb54e and 81c355e.

📒 Files selected for processing (19)

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2 hunks)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1 hunks)
cognee/infrastructure/engine/models/DataPoint.py (3 hunks)
cognee/modules/chunking/models/DocumentChunk.py (2 hunks)
cognee/modules/data/processing/document_types/Document.py (1 hunks)
cognee/modules/engine/models/Entity.py (1 hunks)
cognee/modules/engine/models/EntityType.py (1 hunks)
cognee/modules/graph/models/EdgeType.py (1 hunks)
cognee/modules/graph/utils/convert_node_to_data_point.py (1 hunks)
cognee/shared/CodeGraphEntities.py (1 hunks)
cognee/shared/SourceCodeGraph.py (2 hunks)
cognee/tasks/storage/index_data_points.py (2 hunks)
cognee/tasks/summarization/models.py (2 hunks)
cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (2 hunks)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

cognee/modules/engine/models/Entity.py
cognee/modules/chunking/models/DocumentChunk.py

🔇 Additional comments (15)

cognee/modules/engine/models/EntityType.py (1)

8-12: LGTM! Clean metadata implementation.

The metadata structure properly includes both the type information and index fields, following the standardized pattern across the codebase.

cognee/tasks/summarization/models.py (1)

13-13: LGTM! Consistent type metadata implementation.

The addition of the "type" key to _metadata is consistent with the PR objective and follows the same pattern across both TextSummary and CodeSummary classes.

Let's verify the DataPoint base class implementation:

Also applies to: 24-24

✅ Verification successful

LGTM! The type metadata implementation is consistent with the base class.

The DataPoint base class already includes the "type" key in its _metadata dictionary with a value of "DataPoint", and the TextSummary and CodeSummary classes correctly override this value with their respective class names. The implementation maintains consistency across the inheritance hierarchy.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if DataPoint base class has consistent metadata implementation
ast-grep --pattern 'class DataPoint {
  $$$
  _metadata: dict = {
    $$$
    "type": $_
    $$$
  }
  $$$
}'

Length of output: 111

Script:

#!/bin/bash
# Let's try a different approach to find the DataPoint class implementation
ast-grep --pattern 'class DataPoint'

# Also search for any metadata-related code in DataPoint
rg -A 5 "_metadata.*=.*{"

Length of output: 160

Script:

#!/bin/bash
# Let's try a simpler approach to find the DataPoint class and its metadata
rg "_metadata" --type py

# Also search for DataPoint class definition
rg "class DataPoint" --type py

Length of output: 15720

Script:

#!/bin/bash
# Let's examine the DataPoint class implementation
cat cognee/infrastructure/engine/models/DataPoint.py

Length of output: 1524

cognee/shared/CodeGraphEntities.py (1)

9-12: LGTM! Consistent metadata implementation across entities.

The implementation of type metadata is consistent across all classes. The inclusion of index_fields where appropriate (Repository, CodeFile, CodePart) and its omission in CodeRelationship (which doesn't need indexed fields) shows good design consideration.

Let's verify the consistency of index_fields usage across the codebase:

Also applies to: 24-25, 34-35, 42-44

✅ Verification successful

Metadata implementation is indeed consistent across the codebase

The verification confirms that:

All entities that need indexed fields (Repository, CodeFile, CodePart) correctly include "index_fields" in their metadata
CodeRelationship correctly omits "index_fields" as it doesn't require indexed fields
This pattern is consistently followed across the entire codebase, with other entities like TextSummary, CodeSummary, and various database adapters also properly implementing their metadata with appropriate index fields

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any inconsistencies in index_fields usage
ast-grep --pattern 'class $_ {
  $$$
  _metadata: dict = {
    "index_fields": $_,
    $$$
  }
  $$$
}'

Length of output: 105

Script:

#!/bin/bash
# Let's try a different approach to find metadata definitions in Python files
rg -U "_metadata.*=.*{.*index_fields" --type py -A 3

Length of output: 181

Script:

#!/bin/bash
# Search for _metadata dictionary definitions in Python files
rg "_metadata: dict = \{" --type py -A 4

Length of output: 5143

cognee/infrastructure/engine/models/DataPoint.py (1)

18-19: LGTM: Consistent metadata type addition

The addition of "type": "DataPoint" to the _metadata dictionary aligns with the PR objectives and establishes a consistent pattern for type identification across derived classes.

cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1)

12-14: LGTM: Consistent type metadata across test classes

The addition of type metadata to all test classes follows the established pattern and matches class names appropriately. This ensures test data accurately reflects the production behavior.

Also applies to: 20-22, 26-28, 33-35

cognee/shared/SourceCodeGraph.py (1)

99-104: LGTM: Circular reference handling maintained

The model_rebuild() calls are correctly maintained for classes with circular references, ensuring proper model initialization.

cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (3)

14-16: LGTM: Repository metadata type addition