Skip to content

Conversation

@alekszievr
Copy link
Contributor

@alekszievr alekszievr commented Dec 12, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Added _metadata attributes to multiple classes, enhancing data structure with type information.
    • Introduced new classes Car and Person with metadata for improved data modeling.
    • Enhanced search methods across various adapters to include a default limit parameter.
  • Bug Fixes

    • Improved error handling in various database adapters, ensuring better input validation and logging.
  • Documentation

    • Updated class signatures and metadata descriptions for clarity across multiple modules.
  • Refactor

    • Reorganized import statements for better readability and maintenance across several files.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

Walkthrough

The pull request introduces several changes across multiple database adapter classes, enhancing the _metadata dictionary in the IndexSchema class to include a "type" key. Additionally, it reorganizes import statements for clarity, refines methods for better error handling and data processing, and introduces new classes with associated metadata. Overall, the changes improve code organization and maintain functionality across the database adapters.

Changes

File Path Change Summary
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py Updated _metadata in IndexSchema, reorganized imports, refined create_data_points, delete_nodes methods, enhanced error handling in delete_graph.
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py Updated _metadata in IndexSchema, rearranged imports, defined LanceDataPoint as a generic class in create_collection and create_data_points.
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py Updated _metadata in IndexSchema, reorganized imports, clarified embedding_engine initialization, enhanced error handling in create_collection and create_data_points.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py Updated _metadata in IndexSchema, reorganized imports, added PGVectorDataPoint class in create_collection and create_data_points. Enhanced error handling in get_table and search.
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py Updated _metadata in IndexSchema, updated method signatures for create_data_points and index_data_points, added limit parameter to search.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py Updated _metadata in IndexSchema, modified create_collection, create_data_points, and index_data_points methods for better handling and error logging.
cognee/infrastructure/engine/models/DataPoint.py Updated _metadata in DataPoint, reorganized imports, set default for updated_at field.
cognee/modules/chunking/models/DocumentChunk.py Added _metadata with "type": "DocumentChunk" to DocumentChunk class.
cognee/modules/data/processing/document_types/Document.py Removed type attribute, added _metadata with "type": "Document" to Document class.
cognee/modules/engine/models/Entity.py Added _metadata with "type": "Entity" to Entity class.
cognee/modules/engine/models/EntityType.py Removed type attribute, added _metadata with "type": "EntityType" to EntityType class.
cognee/modules/graph/models/EdgeType.py Added _metadata with "type": "EdgeType" to EdgeType class.
cognee/modules/graph/utils/convert_node_to_data_point.py Changed access of "type" from node_data to _metadata["type"].
cognee/shared/CodeGraphEntities.py Removed type attributes, replaced with _metadata dictionaries for Repository, CodeFile, CodePart, and CodeRelationship.
cognee/shared/SourceCodeGraph.py Removed Literal type annotations, added _metadata dictionaries for various classes.
cognee/tasks/storage/index_data_points.py Modified index_data_points and get_data_points_from_model functions, added Car and Person classes with _metadata.
cognee/tasks/summarization/models.py Added _metadata with "type": "TextSummary" and "type": "CodeSummary" to respective classes.
cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py Added _metadata to Repository, CodeFile, and CodePart classes.
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py Added _metadata to Document, DocumentChunk, EntityType, and Entity classes.

Possibly related PRs

Suggested labels

run-checks

Suggested reviewers

  • dexters1
  • hajdul88

🐇 In the code we hop and play,
With metadata brightening the way.
Each class now knows its type,
Enhancing structure, oh so ripe!
As we code, let joy abound,
In every line, new wonders found! 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d9d7cc and 313ca9b.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (17)
cognee/modules/data/processing/document_types/Document.py (1)

15-16: Implement or document the unimplemented read method.

The read method is currently a placeholder. Consider either implementing it or adding a docstring explaining why it's left unimplemented.

Would you like me to help implement this method or create an issue to track this task?

cognee/shared/CodeGraphEntities.py (1)

2-2: Remove unnecessary empty lines.

There are several unnecessary empty lines that could be removed to improve code organization.

-
-
 from cognee.infrastructure.engine import DataPoint
-
-
 class Repository(DataPoint):

Also applies to: 5-5, 32-32

cognee/infrastructure/engine/models/DataPoint.py (1)

45-45: Consider adding null safety check

The simplified return statement assumes _metadata is always present. While this is set in the class definition, derived classes might override it.

-        return data_point._metadata["index_fields"] or []
+        return data_point._metadata.get("index_fields", []) if data_point._metadata else []
cognee/shared/SourceCodeGraph.py (1)

1-1: Architectural improvement: Runtime type information

Good architectural decision to move from compile-time Literal types to runtime metadata. This change:

  1. Maintains type information at runtime
  2. Aligns with the DataPoint base class pattern
  3. Provides consistent type identification across the system

Consider documenting this architectural decision in the project's ADR (Architecture Decision Records) to explain the rationale for future maintainers.

Also applies to: 15-15, 24-24, 36-36, 48-48, 60-60, 69-69, 79-79, 97-97

cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)

7-12: Improve import organization

Consider grouping related imports together and adding a blank line between different groups:

 from __future__ import annotations
 
 import asyncio
 import logging
 from typing import List, Optional
 from uuid import UUID
-
-from cognee.infrastructure.engine import DataPoint
-
-from ..embeddings.EmbeddingEngine import EmbeddingEngine
-from ..models.ScoredResult import ScoredResult
-from ..vector_db_interface import VectorDBInterface
+
+from cognee.infrastructure.engine import DataPoint
+from ..embeddings.EmbeddingEngine import EmbeddingEngine
+from ..models.ScoredResult import ScoredResult
+from ..vector_db_interface import VectorDBInterface
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4)

8-11: Consider grouping related imports together

The imports could be better organized by grouping related imports:

  1. Standard library imports (typing, uuid, etc.)
  2. Third-party imports (embeddings, models)
  3. Local imports (exceptions, infrastructure)
from ..embeddings.EmbeddingEngine import EmbeddingEngine
-from ..models.ScoredResult import ScoredResult
-from ..vector_db_interface import VectorDBInterface
+from ..vector_db_interface import VectorDBInterface
+from ..models.ScoredResult import ScoredResult

Line range hint 119-134: Enhance error handling in batch operations

The batch operation error handling could be improved by:

  1. Adding specific error types
  2. Providing more context in error messages
  3. Ensuring proper cleanup in case of partial failures
 try:
     if len(data_points) > 1:
         with collection.batch.dynamic() as batch:
             for data_point in data_points:
+                try:
                     batch.add_object(
                         uuid = data_point.uuid,
                         vector = data_point.vector,
                         properties = data_point.properties,
                         references = data_point.references,
                     )
+                except Exception as e:
+                    logger.error("Failed to add data point %s: %s", data_point.uuid, str(e))
+                    raise
     else:
         data_point: DataObject = data_points[0]
         if collection.data.exists(data_point.uuid):

Line range hint 187-193: Consider adding retry logic for search operations

The search operation could benefit from retry logic to handle temporary network issues or service unavailability.

+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
     async def search(
             self,
             collection_name: str,
             query_text: Optional[str] = None,
             query_vector: Optional[List[float]] = None,
             limit: int = None,
             with_vector: bool = False
     ):

Line range hint 234-236: Add parameter validation in batch_search

The batch_search method should validate its input parameters similar to the single search method.

     async def batch_search(self, collection_name: str, query_texts: List[str], limit: int, with_vectors: bool = False):
+        if not query_texts:
+            raise InvalidValueError(message="query_texts cannot be empty")
         query_vectors = await self.embed_data(query_texts)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)

Line range hint 91-121: Consider simplifying generic type implementation

The generic type implementation could be simplified while maintaining type safety.

-        IdType = TypeVar("IdType")
-        PayloadSchema = TypeVar("PayloadSchema")
-        vector_size = self.embedding_engine.get_vector_size()
-
-        class LanceDataPoint(LanceModel, Generic[IdType, PayloadSchema]):
+        class LanceDataPoint(LanceModel):
             id: str
-            vector: Vector(vector_size)
+            vector: Vector(self.embedding_engine.get_vector_size())
             payload: PayloadSchema

Line range hint 142-156: Add error handling for collection operations

The retrieve operation should include error handling for cases where the collection doesn't exist.

     async def retrieve(self, collection_name: str, data_point_ids: list[str]):
         connection = await self.get_connection()
+        if not await self.has_collection(collection_name):
+            raise InvalidValueError(f"Collection {collection_name} does not exist")
         collection = await connection.open_table(collection_name)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)

Line range hint 28-41: Consider using dataclasses for configuration

The configuration handling could be improved using dataclasses for better type safety and validation.

+from dataclasses import dataclass
+
+@dataclass
+class HnswConfig:
+    m: int = 16
+    ef_construct: int = 100
+
 def create_hnsw_config(hnsw_config: Dict):
     if hnsw_config is not None:
-        return models.HnswConfig()
+        return models.HnswConfig(**HnswConfig(**hnsw_config).__dict__)
     return None

Line range hint 95-112: Add connection pooling for better resource management

The client connection handling could be improved with connection pooling.

Consider implementing a connection pool to manage client connections more efficiently and prevent resource exhaustion during high concurrent loads.


Line range hint 251-253: Improve batch search result filtering

The current filtering approach in batch_search might drop important results.

-        return [filter(lambda result: result.score > 0.9, result_group) for result_group in results]
+        return [list(filter(lambda result: result.score > 0.9, result_group)) for result_group in results]

The current implementation:

  1. Returns filter objects instead of lists
  2. Uses a hard-coded threshold that might not be suitable for all use cases
  3. Doesn't provide a way to customize the filtering threshold
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

Line range hint 82-94: Eliminate duplicate PGVectorDataPoint class definition

The PGVectorDataPoint class is defined identically in both create_collection and create_data_points methods. This duplication could lead to maintenance issues if the class definition needs to be updated.

Consider moving the class definition to module level or creating a factory method:

+ def create_pgvector_data_point_class(collection_name: str, vector_size: int):
+     from pgvector.sqlalchemy import Vector
+     class PGVectorDataPoint(Base):
+         __tablename__ = collection_name
+         __table_args__ = {"extend_existing": True}
+         primary_key: Mapped[int] = mapped_column(
+             primary_key=True, autoincrement=True
+         )
+         id: Mapped[Any]  # Type will be set based on data_points
+         payload = Column(JSON)
+         vector = Column(Vector(vector_size))
+
+         def __init__(self, id, payload, vector):
+             self.id = id
+             self.payload = payload
+             self.vector = vector
+     return PGVectorDataPoint

  async def create_collection(self, collection_name: str, payload_schema=None):
      data_point_types = get_type_hints(DataPoint)
      vector_size = self.embedding_engine.get_vector_size()

      if not await self.has_collection(collection_name):
-         from pgvector.sqlalchemy import Vector
-         class PGVectorDataPoint(Base):
-             __tablename__ = collection_name
-             ...
+         PGVectorDataPoint = create_pgvector_data_point_class(collection_name, vector_size)

Also applies to: 124-136


17-17: Implement similarity score normalization using the imported utility

The normalize_distances utility is imported but not used, while there are TODO comments about normalizing similarity scores.

Consider implementing the normalization:

  # Extract distances and find min/max for normalization
+ distances = [vector.similarity for vector in closest_items]
+ normalized_distances = normalize_distances(distances)
  for vector in closest_items:
-     # TODO: Add normalization of similarity score
      vector_list.append(vector)

  # Create and return ScoredResult objects
  return [
      ScoredResult(
          id = UUID(str(row.id)),
          payload = row.payload,
-         score = row.similarity
+         score = normalized_distances[i]
      ) for i, row in enumerate(vector_list)
  ]

Also applies to: 214-214, 267-267

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1)

Line range hint 359-360: Fix missing await in delete_nodes method

The delete_data_points call is not awaited, which could lead to unhandled promises and race conditions.

Apply this fix:

  async def delete_nodes(self, collection_name: str, data_point_ids: list[str]):
-     self.delete_data_points(data_point_ids)
+     await self.delete_data_points(data_point_ids)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0bb54e and 81c355e.

📒 Files selected for processing (19)
  • cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1 hunks)
  • cognee/infrastructure/engine/models/DataPoint.py (3 hunks)
  • cognee/modules/chunking/models/DocumentChunk.py (2 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/modules/engine/models/Entity.py (1 hunks)
  • cognee/modules/engine/models/EntityType.py (1 hunks)
  • cognee/modules/graph/models/EdgeType.py (1 hunks)
  • cognee/modules/graph/utils/convert_node_to_data_point.py (1 hunks)
  • cognee/shared/CodeGraphEntities.py (1 hunks)
  • cognee/shared/SourceCodeGraph.py (2 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
  • cognee/tasks/summarization/models.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cognee/modules/engine/models/Entity.py
  • cognee/modules/chunking/models/DocumentChunk.py
🔇 Additional comments (15)
cognee/modules/engine/models/EntityType.py (1)

8-12: LGTM! Clean metadata implementation.

The metadata structure properly includes both the type information and index fields, following the standardized pattern across the codebase.

cognee/tasks/summarization/models.py (1)

13-13: LGTM! Consistent type metadata implementation.

The addition of the "type" key to _metadata is consistent with the PR objective and follows the same pattern across both TextSummary and CodeSummary classes.

Let's verify the DataPoint base class implementation:

Also applies to: 24-24

✅ Verification successful

LGTM! The type metadata implementation is consistent with the base class.

The DataPoint base class already includes the "type" key in its _metadata dictionary with a value of "DataPoint", and the TextSummary and CodeSummary classes correctly override this value with their respective class names. The implementation maintains consistency across the inheritance hierarchy.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if DataPoint base class has consistent metadata implementation
ast-grep --pattern 'class DataPoint {
  $$$
  _metadata: dict = {
    $$$
    "type": $_
    $$$
  }
  $$$
}'

Length of output: 111


Script:

#!/bin/bash
# Let's try a different approach to find the DataPoint class implementation
ast-grep --pattern 'class DataPoint'

# Also search for any metadata-related code in DataPoint
rg -A 5 "_metadata.*=.*{" 

Length of output: 160


Script:

#!/bin/bash
# Let's try a simpler approach to find the DataPoint class and its metadata
rg "_metadata" --type py

# Also search for DataPoint class definition
rg "class DataPoint" --type py

Length of output: 15720


Script:

#!/bin/bash
# Let's examine the DataPoint class implementation
cat cognee/infrastructure/engine/models/DataPoint.py

Length of output: 1524

cognee/shared/CodeGraphEntities.py (1)

9-12: LGTM! Consistent metadata implementation across entities.

The implementation of type metadata is consistent across all classes. The inclusion of index_fields where appropriate (Repository, CodeFile, CodePart) and its omission in CodeRelationship (which doesn't need indexed fields) shows good design consideration.

Let's verify the consistency of index_fields usage across the codebase:

Also applies to: 24-25, 34-35, 42-44

✅ Verification successful

Metadata implementation is indeed consistent across the codebase

The verification confirms that:

  • All entities that need indexed fields (Repository, CodeFile, CodePart) correctly include "index_fields" in their metadata
  • CodeRelationship correctly omits "index_fields" as it doesn't require indexed fields
  • This pattern is consistently followed across the entire codebase, with other entities like TextSummary, CodeSummary, and various database adapters also properly implementing their metadata with appropriate index fields
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any inconsistencies in index_fields usage
ast-grep --pattern 'class $_ {
  $$$
  _metadata: dict = {
    "index_fields": $_,
    $$$
  }
  $$$
}'

Length of output: 105


Script:

#!/bin/bash
# Let's try a different approach to find metadata definitions in Python files
rg -U "_metadata.*=.*{.*index_fields" --type py -A 3

Length of output: 181


Script:

#!/bin/bash
# Search for _metadata dictionary definitions in Python files
rg "_metadata: dict = \{" --type py -A 4

Length of output: 5143

cognee/infrastructure/engine/models/DataPoint.py (1)

18-19: LGTM: Consistent metadata type addition

The addition of "type": "DataPoint" to the _metadata dictionary aligns with the PR objectives and establishes a consistent pattern for type identification across derived classes.

cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1)

12-14: LGTM: Consistent type metadata across test classes

The addition of type metadata to all test classes follows the established pattern and matches class names appropriately. This ensures test data accurately reflects the production behavior.

Also applies to: 20-22, 26-28, 33-35

cognee/shared/SourceCodeGraph.py (1)

99-104: LGTM: Circular reference handling maintained

The model_rebuild() calls are correctly maintained for classes with circular references, ensuring proper model initialization.

cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (3)

14-16: LGTM: Repository metadata type addition

The type metadata accurately reflects the class name.


23-25: LGTM: CodeFile metadata type addition

The type metadata accurately reflects the class name.


30-32: LGTM: CodePart metadata type addition

The type metadata accurately reflects the class name.

cognee/tasks/storage/index_data_points.py (2)

84-86: LGTM: Car metadata type addition

The type metadata accurately reflects the class name.


93-95: LGTM: Person metadata type addition

The type metadata accurately reflects the class name.

cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)

21-22: LGTM: IndexSchema metadata type addition

The type metadata accurately reflects the class name while preserving the existing index_fields configuration.

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

24-27: LGTM: Type metadata addition

The addition of the "type" field to metadata is consistent with other adapters and improves type identification.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

3-20: LGTM! Import organization and metadata changes look good.

The changes improve code organization and implement the PR objective of adding type metadata.

Also applies to: 26-27

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1)

5-16: LGTM! Import organization and metadata changes look good.

The changes improve code organization and implement the PR objective of adding type metadata consistently with other adapters.

Also applies to: 22-23

Comment on lines 11 to 14
_metadata: dict = {
"type": "Document"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add index_fields to Document metadata for consistency

Looking at the codebase, I can confirm that most DataPoint subclasses, especially those dealing with text or searchable content, include "index_fields" in their metadata. Specifically:

  • EntityType has index_fields: ["name"]
  • DocumentChunk has index_fields: ["text"]
  • Entity has index_fields: ["name"]
  • CodeFile/Repository/CodePart have index_fields: ["source_code"]

Since Document is also a searchable entity that contains text content, it should follow this pattern for consistency.

🔗 Analysis chain

Consider adding index_fields to metadata.

Other DataPoint subclasses (e.g., EntityType) include "index_fields" in their metadata. Consider whether Document should also specify its indexed fields for consistency.

Let's check if other DataPoint subclasses consistently use index_fields:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for _metadata declarations with index_fields
ast-grep --pattern '$$ = {
  "index_fields": $$$,
  $$$
}'

Length of output: 60


Script:

#!/bin/bash
# Let's search for all _metadata declarations in DataPoint subclasses
rg "_metadata.*=.*{" --type py -A 3

Length of output: 131


Script:

#!/bin/bash
# Let's search for _metadata declarations with proper escaping
rg "_metadata.*=.*\{" --type py -A 3

Length of output: 10977

@alekszievr alekszievr force-pushed the feat/COG-651-add-type-to-datapoint-metadata branch from 86b6146 to 6a99fc3 Compare December 12, 2024 17:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1)

36-39: Consider indexing the EntityType relationship.

While indexing the name field is appropriate, consider also including is_type in the index_fields list to optimize queries that filter entities by their type.

     _metadata = {
-        "index_fields": ["name"],
+        "index_fields": ["name", "is_type"],
         "type": "Entity"
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86b6146 and 6a99fc3.

📒 Files selected for processing (5)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/shared/SourceCodeGraph.py (2 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • cognee/modules/data/processing/document_types/Document.py
  • cognee/tasks/storage/index_data_points.py
  • cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py
  • cognee/shared/SourceCodeGraph.py
🔇 Additional comments (3)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (3)

4-4: LGTM!

The import reordering is acceptable and doesn't affect functionality.


21-24: LGTM!

The metadata structure is well-defined with appropriate indexing on the text field.


28-31: LGTM!

The metadata structure is well-defined with appropriate indexing on the name field.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

Line range hint 191-193: Consider implementing similarity score normalization and reducing code duplication.

  1. The TODO comments indicate missing score normalization. The normalize_distances utility is already imported but not used.
  2. There's significant code duplication between get_distance_from_collection_elements and search methods.

Consider extracting the common functionality into a private method:

async def _search_vectors(
    self,
    collection_name: str,
    query_vector: List[float],
    limit: Optional[int] = None
) -> List[ScoredResult]:
    PGVectorDataPoint = await self.get_table(collection_name)
    
    async with self.get_async_session() as session:
        query = select(
            PGVectorDataPoint,
            PGVectorDataPoint.c.vector.cosine_distance(query_vector).label("similarity"),
        ).order_by("similarity")
        
        if limit:
            query = query.limit(limit)
            
        results = await session.execute(query)
        vector_list = [row for row in results]
        
        # Normalize similarity scores
        scores = [v.similarity for v in vector_list]
        normalized_scores = normalize_distances(scores)
        
        return [
            ScoredResult(
                id=UUID(str(row.id)),
                payload=row.payload,
                score=norm_score
            ) for row, norm_score in zip(vector_list, normalized_scores)
        ]

Then update both methods to use this common implementation:

async def get_distance_from_collection_elements(
    self,
    collection_name: str,
    query_text: str = None,
    query_vector: List[float] = None,
    with_vector: bool = False
) -> List[ScoredResult]:
    if query_text is None and query_vector is None:
        raise ValueError("One of query_text or query_vector must be provided!")

    if query_text and not query_vector:
        query_vector = (await self.embedding_engine.embed_text([query_text]))[0]

    return await self._search_vectors(collection_name, query_vector)

async def search(
    self,
    collection_name: str,
    query_text: Optional[str] = None,
    query_vector: Optional[List[float]] = None,
    limit: int = 5,
    with_vector: bool = False,
) -> List[ScoredResult]:
    if query_text is None and query_vector is None:
        raise InvalidValueError(message="One of query_text or query_vector must be provided!")

    if query_text and not query_vector:
        query_vector = (await self.embedding_engine.embed_text([query_text]))[0]

    return await self._search_vectors(collection_name, query_vector, limit)

Also applies to: 249-251

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a99fc3 and 9682b5f.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

3-20: LGTM! Import organization looks good.

The imports are well-organized with system imports followed by relative imports, and all imports are being used in the code.


26-27: LGTM! Type metadata addition is consistent.

The addition of the "type" field to IndexSchema metadata aligns with the PR objective and maintains consistency across different adapters.

@alekszievr alekszievr force-pushed the feat/COG-651-add-type-to-datapoint-metadata branch from 9682b5f to c097aee Compare December 13, 2024 09:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
cognee/shared/SourceCodeGraph.py (4)

73-76: Consider using an enum for called_by field

While the metadata changes look good, consider replacing Literal["main"] with an enum to maintain consistency with the new type system approach.

from enum import Enum

class CallerType(Enum):
    MAIN = "main"

# Then update the type hint:
called_by: Union[Function, CallerType]

84-87: Consider type alias for complex Union type

While the implementation is correct, consider creating a type alias for the complex Union type in members for better readability.

from typing import TypeAlias

ExpressionMember: TypeAlias = Union[Variable, Function, Operator, "Expression"]
# Then use:
members: List[ExpressionMember]

103-106: Consider type alias for nodes Union type

Similar to the previous suggestion, consider creating a type alias for the nodes Union type to improve code readability.

from typing import TypeAlias

GraphNode: TypeAlias = Union[
    Class,
    ClassInstance,
    Function,
    FunctionCall,
    Variable,
    Operator,
    Expression,
]
# Then use:
nodes: List[GraphNode]

Line range hint 1-112: Well-structured type system migration

The migration from Literal types to metadata maintains a clear hierarchy while improving consistency in type handling. The approach aligns well with object-oriented principles and maintains clear separation of concerns.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9682b5f and 23fc77a.

📒 Files selected for processing (19)
  • cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3 hunks)
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1 hunks)
  • cognee/infrastructure/engine/models/DataPoint.py (3 hunks)
  • cognee/modules/chunking/models/DocumentChunk.py (2 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/modules/engine/models/Entity.py (1 hunks)
  • cognee/modules/engine/models/EntityType.py (1 hunks)
  • cognee/modules/graph/models/EdgeType.py (1 hunks)
  • cognee/modules/graph/utils/convert_node_to_data_point.py (1 hunks)
  • cognee/shared/CodeGraphEntities.py (1 hunks)
  • cognee/shared/SourceCodeGraph.py (2 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
  • cognee/tasks/summarization/models.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (2 hunks)
  • cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (18)
  • cognee/modules/chunking/models/DocumentChunk.py
  • cognee/modules/graph/utils/convert_node_to_data_point.py
  • cognee/modules/engine/models/Entity.py
  • cognee/modules/graph/models/EdgeType.py
  • cognee/modules/engine/models/EntityType.py
  • cognee/infrastructure/engine/models/DataPoint.py
  • cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py
  • cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py
  • cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
  • cognee/modules/data/processing/document_types/Document.py
  • cognee/tasks/summarization/models.py
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
  • cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py
  • cognee/shared/CodeGraphEntities.py
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
  • cognee/tasks/storage/index_data_points.py
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py
🔇 Additional comments (3)
cognee/shared/SourceCodeGraph.py (3)

24-27: LGTM! Consistent metadata structure

The metadata structure is consistently implemented across these classes with appropriate index fields for searchable entities.

Also applies to: 38-40, 51-53, 64-66


14-17: Verify type attribute usage in dependent code

The migration from a type Literal to metadata dictionary is a breaking change that could affect code that directly accesses the type attribute.

✅ Verification successful

Migration to metadata dictionary is safe to proceed

The type attribute usage in the codebase is safe for this migration as:

  1. The .type accesses in extract_code_parts.py and get_local_dependencies.py are for AST node types, not for our DataPoint types.
  2. The .type usage in graph utils is for node type generation and naming, which will continue to work as the type information is preserved in the metadata dictionary.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any direct access to .type attribute that might break
rg -l '\.type\s*[=!]=\s*["\']Variable["\']' || \
rg -l '\.type\s*[=!]=\s*["\']Class["\']' || \
rg -l '\.type\s*[=!]=\s*["\']Function["\']'

Length of output: 116


Script:

#!/bin/bash
# Search for direct access to type attribute with proper quote handling
rg -l "\.type\s*[=!]=\s*['\"]Variable['\"]" || \
rg -l "\.type\s*[=!]=\s*['\"]Class['\"]" || \
rg -l "\.type\s*[=!]=\s*['\"]Function['\"]"

# Also search for type attribute access in method calls
rg -l "\.type\s+in\s+|\.type\s+==\s+|\.type\s+!=\s+" 

# Search for any other potential type attribute usage
ast-grep --pattern 'node.type'

Length of output: 684


Script:

#!/bin/bash
# Check the specific usage in the identified files
rg -A 5 "\.type" cognee/tasks/repo_processor/extract_code_parts.py cognee/tasks/repo_processor/get_local_dependencies.py

# Check if these files are accessing metadata instead
rg -A 5 "_metadata" cognee/tasks/repo_processor/extract_code_parts.py cognee/tasks/repo_processor/get_local_dependencies.py

# Check the node type usage in graph utils
cat cognee/modules/graph/utils/retrieve_existing_edges.py
cat cognee/modules/graph/utils/expand_with_nodes_and_edges.py

Length of output: 7051


108-112: Verify model_rebuild coverage

Some classes (Variable, Operator) don't have model_rebuild calls. Please verify if this is intentional.

✅ Verification successful

Let me gather more information about the model_rebuild implementations and their usage patterns.


Let me search for the inheritance hierarchy and model definitions to understand if Variable and Operator are meant to have model_rebuild.


Based on the gathered information, I can now provide the final response:

model_rebuild() calls are correctly implemented

The absence of model_rebuild() calls for Variable and Operator classes is correct because:

  1. Both Variable and Operator inherit from DataPoint, which is a base class for data entities
  2. The model_rebuild() is only called after create_model() operations for dynamic model modifications
  3. The current calls in SourceCodeGraph.py are specifically for classes that need dynamic model rebuilding
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for model_rebuild implementation to understand its purpose
rg -A 5 'def model_rebuild' 

# Search for other usage patterns of model_rebuild
rg -l '\.model_rebuild\(\)'

Length of output: 362


Script:

#!/bin/bash
# Search for class definitions and their model_rebuild methods
ast-grep --pattern 'class $_ {
  $$$
  def model_rebuild($$$) {
    $$$
  }
  $$$
}'

# Search for Variable and Operator class definitions
ast-grep --pattern 'class Variable'
ast-grep --pattern 'class Operator'

# Get more context around model_rebuild usage
rg -B 2 -A 2 '\.model_rebuild\(\)'

Length of output: 2452


Script:

#!/bin/bash
# Search for BaseModel inheritance and model definitions
rg -B 2 -A 2 'class.*\(.*BaseModel.*\)'

# Search specifically for Variable and Operator class definitions with more context
rg -B 2 -A 5 'class (Variable|Operator)'

# Look for model_rebuild usage in model creation/initialization
rg -B 2 -A 5 'create_model'

Length of output: 17769

Copy link
Member

@borisarzentar borisarzentar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with this implementation!

@alekszievr alekszievr merged commit bfa0f06 into dev Dec 16, 2024
24 checks passed
@alekszievr alekszievr deleted the feat/COG-651-add-type-to-datapoint-metadata branch December 16, 2024 15:27
borisarzentar added a commit that referenced this pull request Jan 10, 2025
* feat: Add error handling in case user is already part of database and permission already given to group

Added error handling in case permission is already given to group and user is already part of group

Feature COG-656

* feat: Add user verification for accessing data

Verify user has access to data before returning it

Feature COG-656

* feat: Add compute search to cognee

Add compute search to cognee which makes searches human readable

Feature COG-656

* feat: Add simple instruction for system prompt

Add simple instruction for system prompt

Feature COG-656

* pass pydantic model tocognify

* feat: Add unauth access error to getting data

Raise unauth access error when trying to read data without access

Feature COG-656

* refactor: Rename query compute to query completion

Rename searching type from compute to completion

Refactor COG-656

* chore: Update typo in code

Update typo in string in code

Chore COG-656

* Add mcp to cognee

* Add simple README

* Update cognee-mcp/mcpcognee/__main__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Create dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve reflection issue when running cognee a second time after pruning data

When running cognee a second time after pruning data some metadata doesn't get pruned.
This makes cognee believe some tables exist that have been deleted

Fix

* fix: Add metadata reflection fix to sqlite as well

Added fix when reflecting metadata to sqlite as well

Fix

* update

* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add type to DataPoint metadata

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* Fixes

* Fixes to our demo

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Implement PR review

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

---------

Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
borisarzentar added a commit that referenced this pull request Jan 13, 2025
* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix diagram

* Fix instructions

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Implement PR review

* Comment out profiling

* Comment out profiling

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Fix visualization

* Fix visualization

* Fix visualization

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Fix visualization

* Fix visualization

* Fix visualization

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Fix visualization

* Fix poetry issues

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* test: Test ubuntu 24.04

* test: change all actions to ubuntu-latest

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* docs: Add LlamaIndex Cognee integration notebook

Added LlamaIndex Cognee integration notebook

* test: Add github action for testing llama index cognee integration notebook

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

* fix: update dependencies of the mcp server

* Update README.md

* Fix: Fixes logging setup

* feat: deletes on the fly embeddings as uses edge collections

* fix: Change nbformat on llama index integration notebook

* fix: Resolve api key issue with llama index integration notebook

* fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault

* version: Increase version to 0.1.22

---------

Co-authored-by: vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants