Skip to content

Conversation

@dexters1
Copy link
Collaborator

Description

Allow embedding of big strings to support full row embedding in SQL databases

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@gitguardian
Copy link

gitguardian bot commented Apr 15, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 15, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces a wide range of changes including new features, configuration enhancements, test additions, prompt and documentation updates, and significant codebase cleanups. Major additions include new prompt templates for knowledge graph extraction and QA benchmarking, new scripts for workflow management, and comprehensive evaluation pipelines for multiple AI systems. Configuration is made more flexible with environment-based default user credentials and dynamic prompt paths. Test coverage is expanded with server startup, telemetry, and relational database migration tests. Several obsolete or demonstration notebooks and example scripts are removed, and batch sizes and error handling are refined in various modules. Enum and function signatures are updated for clarity and consistency.

Changes

File(s) / Path(s) Change Summary
.env.template, cognee/base_config.py, cognee/modules/users/methods/create_default_user.py, cognee/modules/users/methods/get_default_user.py Added DEFAULT_USER_EMAIL and DEFAULT_USER_PASSWORD to config and used them in default user creation/retrieval.
.github/README_WORKFLOW_MIGRATION.md, .github/workflows/disable_independent_workflows.sh Added documentation and script for migrating GitHub workflows to centralized execution.
cognee-mcp/src/server.py Modified search function to handle new search types GRAPH_COMPLETION and RAG_COMPLETION.
cognee/api/v1/config/config.py Added set_migration_db_config static method for migration DB config management.
cognee/eval_framework/corpus_builder/run_corpus_builder.py Reformatted function signature for readability (no logic change).
cognee/eval_framework/corpus_builder/task_getters/get_default_tasks_by_indices.py Refactored task selection to explicit construction; updated signatures to accept more parameters.
cognee/eval_framework/evaluation/deep_eval_adapter.py Changed how retrieval_context is set in evaluate_answers.
cognee/infrastructure/databases/relational/config.py Updated type annotations for migration DB fields to allow None.
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py Improved handling of context window errors and pooling logic for embeddings.
cognee/infrastructure/databases/vector/embeddings/get_embedding_engine.py Pass embedding endpoint to OllamaEmbeddingEngine.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py Changed data point creation to update existing records or insert new ones.
cognee/infrastructure/llm/config.py Added graph_prompt_path config field and included in serialization.
cognee/infrastructure/llm/prompts/answer_simple_question_benchmark2.txt, ...benchmark3.txt, ...benchmark4.txt Added new benchmark QA prompt templates.
cognee/infrastructure/llm/prompts/generate_graph_prompt_guided.txt, ...oneshot.txt, ...simple.txt, ...strict.txt Added new knowledge graph extraction prompt templates.
cognee/modules/data/extraction/knowledge_graph/extract_content_graph.py Made prompt path dynamic based on config.
cognee/modules/search/methods/search.py, cognee/modules/search/types/SearchType.py, cognee/tests/test_custom_model.py Renamed COMPLETION to RAG_COMPLETION in search types and updated usage.
cognee/shared/logging_utils.py Added attribute check before assigning exception type name.
cognee/tasks/graph/extract_graph_from_data.py Added Optional to imports (no logic change).
cognee/tasks/ingestion/migrate_relational_database.py Added deduplication of edges before adding to graph DB.
cognee/tasks/storage/index_data_points.py Reduced batch size for data point indexing from 1000 to 100.
cognee/tests/test_cognee_server_start.py Added server startup and health check test.
cognee/tests/test_relational_db_migration.py Added comprehensive test for relational DB migration and verification.
cognee/tests/test_telemetry.py Added telemetry unit tests with environment and HTTP request mocking.
cognee/tests/unit/entity_extraction/regex_entity_extraction_test.py Disabled failing money extraction test with TODO comment.
evals/README.md Added documentation for AI memory system accuracy analysis.
evals/falkor_01042025/hotpot_qa_falkor_graphrag_sdk.py, evals/graphiti_01042025/hotpot_qa_graphiti.py, evals/mem0_01042025/hotpot_qa_mem0.py Added new evaluation pipelines for different QA systems.
evals/plot_metrics.py, evals/requirements.txt Added metric plotting script and requirements for evaluation.
examples/python/pokemon_datapoints_example.py Simplified add_data_points task instantiation (removed explicit batch size config).
.env.template, cognee/base_config.py, cognee/modules/users/methods/create_default_user.py, cognee/modules/users/methods/get_default_user.py Added support for configurable default user credentials.
notebooks/cognee_simple_demo.ipynb Updated cognee package install version.
evals/eval_swe_bench.py, evals/eval_utils.py, evals/test_datasets/initial_test/natural_language_processing.txt, evals/test_datasets/initial_test/trump.txt, examples/node/fetch.js, examples/node/handleServerErrors.js, examples/node/main.js, notebooks/cognee_code_graph_demo.ipynb, notebooks/cognee_hotpot_eval.ipynb, notebooks/hr_demo.ipynb, notebooks/pokemon_datapoints_notebook.ipynb Deleted obsolete scripts, datasets, and demo notebooks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Config
    participant System
    participant DB

    User->>Config: Set DEFAULT_USER_EMAIL/PASSWORD in env
    System->>Config: get_base_config()
    System->>Config: Read default_user_email/password
    alt If not set
        System->>System: Use hardcoded defaults
    end
    System->>DB: Create or retrieve default user with credentials
    DB-->>System: User object
Loading
sequenceDiagram
    participant Workflow Script
    participant Workflow YAML

    Workflow Script->>Workflow YAML: Backup file
    Workflow Script->>Workflow YAML: Check for workflow_call trigger
    alt Not present
        Workflow Script->>Workflow YAML: Replace on: triggers with workflow_call
    else Already present
        Workflow Script->>Workflow YAML: Skip modification
    end
Loading

Possibly related PRs

  • topoteretes/cognee#421: Adds a similar static method for graph DB config, conceptually parallel to the new migration DB config method.
  • topoteretes/cognee#682: Introduces the same configurable default user credentials and related logic—directly related.
  • topoteretes/cognee#707: Makes the same exception handler attribute check as in this PR—directly related.

Suggested labels

run-checks

Poem

In the warren of code, a rabbit hopped,
New prompts and configs, the carrots he chopped.
Tests now abound, and workflows align,
Default users set—credentials divine!
Old scripts are gone, the garden is neat,
With graphs and QA, the harvest is sweet.
🐇✨ Hooray for the code—so nimble, so fleet!


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dexters1 dexters1 changed the base branch from main to dev April 15, 2025 12:40
@dexters1 dexters1 self-assigned this Apr 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 12

🔭 Outside diff range comments (2)
cognee/api/v1/config/config.py (1)

157-158: 💡 Verification agent

🧩 Analysis chain

Inconsistent error handling detected

This is unrelated to your current changes, but I noticed that this error handling is different from other similar methods. This line uses AttributeError directly with a message parameter, while other methods use InvalidAttributeError.


🏁 Script executed:

#!/bin/bash
# Check for similar error handling patterns in the file
rg -A 3 "raise.*AttributeError" --context 2 cognee/api/v1/config/config.py

Length of output: 1187


Action Required: Align Exception Type for Consistency

The verification confirms that while most configuration blocks in cognee/api/v1/config/config.py are using InvalidAttributeError to signal invalid attributes, the graph_db_config block is still using AttributeError. To maintain consistency:

  • File: cognee/api/v1/config/config.py
  • Location: The block handling graph_db_config (currently at lines 157-158)
  • Suggested Change: Replace
              else:
                  raise AttributeError(message=f"'{key}' is not a valid attribute of the config.")
    with
              else:
                  raise InvalidAttributeError(
                      message=f"'{key}' is not a valid attribute of the config."
                  )

Please update the error handling accordingly.

cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (1)

1-28: 💡 Verification agent

🧩 Analysis chain

Address handling of large string embeddings.

Since the PR objective mentions "allowing embedding of large strings to support full row embedding in SQL databases," consider adding specific guidance on handling large text fields or complex data structures.


🌐 Web query:

What is the maximum string length for embedding in SQL databases?

💡 Result:

The maximum string length for embedding in SQL databases varies significantly depending on the database system and the data type used. Below are the details for some common database systems:

SQL Server

  • VARCHAR(MAX) or NVARCHAR(MAX): These types can store up to 2 GB of data (approximately 1,073,741,823 characters for NVARCHAR and 2,147,483,647 characters for VARCHAR)[1][6][9].
  • VARCHAR(N): Supports strings up to 8,000 bytes in length; exceeding this limit requires using VARCHAR(MAX)[9].
  • Other types like CHAR are fixed-length and generally smaller, with a maximum of 8,000 bytes[9].

MySQL

  • VARCHAR: The maximum size is constrained by the row size limit (65,535 bytes), including storage overhead and character set encoding. For example, with utf8mb4, a VARCHAR column can hold fewer characters than its byte limit[5][9].
  • TEXT Types (e.g., LONGTEXT): Can store up to 4 GB of data, but these are not directly comparable to VARCHAR for embedding strings[5].

SQLite

  • The maximum string length is capped at 1 billion bytes, based on the SQLITE_MAX_LENGTH parameter[7].

Oracle

  • VARCHAR2: By default, supports up to 4,000 bytes. With the MAX_STRING_SIZE parameter set to EXTENDED, it allows up to 32,767 bytes[10].

BigQuery

  • SQL queries, including embedded text, are limited to 1 MB in size (1,024 KB)[2].

Caché (InterSystems)

  • String: Maximum embedded string length is 32,767 characters, though certain configurations (like enabling long strings or using STREAM data types) allow higher limits[8].

PowerBuilder with SQL Anywhere

  • While the theoretical limit can be 2 GB for types like LONG VARCHAR, safe embedded string lengths vary significantly. Developers report issues with strings exceeding 10 million characters[3].

Each database has different practical and theoretical limits depending on the specific data types, so it's essential to choose appropriate types (e.g., VARCHAR(MAX), LONGTEXT, or BLOB) when working with large strings.

Citations:


Enhance SQL Embedding Guidelines

  • The current prompt focuses on generating a knowledge graph but does not address the potential challenges when embedding large text fields for full row SQL embedding.
  • Since SQL databases have widely varying string length limits (e.g., up to 2 GB in SQL Server with VARCHAR(MAX), row size limits in MySQL, 1 billion bytes in SQLite, etc.), please add explicit guidance. For example:
    • Instruct the extraction algorithm to verify the target SQL engine’s string length limits.
    • Specify whether to truncate, split, or use alternative data types (e.g., TEXT, BLOB) when processing large text embeddings.
  • This addition will help ensure robust handling of complex or oversized data structures.
🧹 Nitpick comments (31)
cognee/infrastructure/llm/prompts/answer_simple_question_benchmark4.txt (1)

1-15: Prompt is clear, precise, and well-structured for benchmarking.

The instructions are explicit, unambiguous, and well-aligned with the goal of minimal, context-bound QA benchmarking. No issues found.

If you want to be even more explicit, you could clarify whether "no punctuation" applies to numbers/dates (e.g., "2024.04.01") or just to terminal punctuation, but this is a minor nitpick.

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2)

81-89: Recursive splitting for multi-element lists.

Splitting the text array in half and embedding each side in parallel is a workable approach to reduce context size. However, if there's a very large input list, this recursive approach can rapidly spawn multiple async tasks. You may consider implementing a maximum recursion depth or a chunking mechanism to avoid excessive concurrency in extreme cases.


93-108: Handling single oversize strings with overlap-splitting.

Splitting one large string into overlapping thirds, embedding each, then pooling the results helps reduce context size while retaining context overlap. If the remaining segments are still too large, subsequent recursions will continue. Although it will shrink geometrically, consider a safeguard for unbounded recursion (e.g., a configurable recursion cap) and log the recursion depth for performance monitoring.

evals/requirements.txt (1)

1-6: LGTM! Dependencies for the plot_metrics.py script are properly defined.

The requirements file correctly specifies the exact dependencies needed for the metrics plotting functionality, with properly pinned versions for reproducibility.

Note: Since pathlib is part of the Python standard library since Python 3.4, it doesn't need to be explicitly installed. Consider removing this dependency unless you're supporting Python versions older than 3.4.

cognee/tests/unit/entity_extraction/regex_entity_extraction_test.py (1)

85-96: Consider adding more details to the TODO comment

The test for money extraction has been temporarily disabled with a TODO comment. While this is a reasonable short-term solution to prevent CI failures, consider adding more specific information about what's failing with the regex and the deadline for fixing it.

Also, ensure you have a tracking issue in your issue tracking system for this fix, since disabled tests are often forgotten.

Would you like me to help diagnose the regex issues or create a tracking issue for this task?

cognee/infrastructure/llm/prompts/answer_simple_question_benchmark2.txt (1)

1-7: Clear and well-structured prompt format

The prompt template provides clear instructions for generating minimal, essential answers extracted from context with specific formatting requirements for different question types. This aligns with the PR objective by supporting efficient embedding of answers in SQL databases through consistent, minimalistic formatting.

The bullet point structure and consistent "For x questions: y" format makes the instructions easy to follow, though you might consider varying the beginning of some sentences to improve readability as suggested by the static analysis.

 You are a benchmark-optimized QA system. Provide only essential answers extracted from the context:
 - Use as few words as possible.
 - For yes/no questions: answer with "yes" or "no".
 - For what/who/where questions: reply with a single word or brief phrase.
-For when questions: return only the relevant date/time.
-For how/why questions: use the briefest phrase.
+- When asked about time: return only the relevant date/time.
+- Questions about how/why: use the briefest phrase.
 No punctuation, lowercase answers only.
🧰 Tools
🪛 LanguageTool

[style] ~5-~5: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...y with a single word or brief phrase. - For when questions: return only the relevan...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~6-~6: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: return only the relevant date/time. - For how/why questions: use the briefest phr...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

cognee/api/v1/config/config.py (1)

134-147: Well-implemented migration configuration method

The new set_migration_db_config method follows the established pattern for configuration methods in this class. It properly validates attributes and raises appropriate exceptions for invalid keys.

However, there's a minor documentation issue in the docstring.

-        """
-        Updates the relational db config with values from config_dict.
-        """
+        """
+        Updates the migration db config with values from config_dict.
+        """
cognee/eval_framework/corpus_builder/run_corpus_builder.py (1)

3-3: Remove the unused import Optional.

According to the static analysis hint, Optional is imported but not actually used. Consider removing it to keep the import list clean.

-from typing import List, Optional
+from typing import List
🧰 Tools
🪛 Ruff (0.8.2)

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

128-152: Optimize lookups for improved performance.

Currently, the code performs a SELECT query for each data point to check for existence before inserting or updating. This can be costly for large lists of data points. Consider querying once for all IDs, building a dictionary of existing records, and then iterating through the data points to update or insert as needed. This avoids repeated round trips and speeds up batch processing.

Below is an illustrative diff:

 async with self.get_async_session() as session:
-    pgvector_data_points = []
-    for data_index, data_point in enumerate(data_points):
-        data_point_db = (
-            await session.execute(
-                select(PGVectorDataPoint).filter(PGVectorDataPoint.id == data_point.id)
-            )
-        ).scalar_one_or_none()
-        if data_point_db:
-            data_point_db.id = data_point.id
-            data_point_db.vector = data_vectors[data_index]
-            data_point_db.payload = serialize_data(data_point.model_dump())
-            pgvector_data_points.append(data_point_db)
-        else:
-            pgvector_data_points.append(
-                PGVectorDataPoint(
-                    id=data_point.id,
-                    vector=data_vectors[data_index],
-                    payload=serialize_data(data_point.model_dump()),
-                )
-            )

+    # Gather all IDs first
+    data_point_ids = [dp.id for dp in data_points]
+    existing_records = (
+        await session.execute(
+            select(PGVectorDataPoint).filter(PGVectorDataPoint.id.in_(data_point_ids))
+        )
+    ).scalars().all()
+
+    existing_map = {record.id: record for record in existing_records}
+    pgvector_data_points = []
+
+    for data_index, dp in enumerate(data_points):
+        if dp.id in existing_map:
+            # Update existing record
+            record = existing_map[dp.id]
+            record.vector = data_vectors[data_index]
+            record.payload = serialize_data(dp.model_dump())
+            pgvector_data_points.append(record)
+        else:
+            # Create a new record
+            new_record = PGVectorDataPoint(
+                id=dp.id,
+                vector=data_vectors[data_index],
+                payload=serialize_data(dp.model_dump()),
+            )
+            pgvector_data_points.append(new_record)

     session.add_all(pgvector_data_points)
     await session.commit()
.github/workflows/disable_independent_workflows.sh (2)

65-70: Add comments explaining the AWK command logic.

The AWK command is used to find the next section after the 'on:' block, but this logic is complex and would benefit from more detailed comments explaining how it identifies the next section.

      # Find where to continue after the original 'on:' section
-     next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow")
+     # Find the next line that starts with a lowercase letter (indicating a new YAML section)
+     # after the line containing 'on:' section
+     next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow")

53-74: Consider error handling for file operations.

The script creates and moves files but doesn't check if these operations succeed. In a production environment, it would be better to add error handling for these operations.

      # Create a new file with the modified content
      {
        # Copy the part before 'on:'
        head -n $((on_line-1)) "$workflow"
        
        # Add the new on: section that only includes workflow_call
        echo "on:"
        echo "  workflow_call:"
        echo "    secrets:"
        echo "      inherit: true"
        
        # Find where to continue after the original 'on:' section
        next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow")
        
        if [ -z "$next_section" ]; then
          next_section=$(wc -l < "$workflow")
          next_section=$((next_section+1))
        fi
        
        # Copy the rest of the file starting from the next section
        tail -n +$next_section "$workflow"
-     } > "${workflow}.new"
+     } > "${workflow}.new" || { echo "Error creating new file for $workflow"; continue; }
      
      # Replace the original with the new version
-     mv "${workflow}.new" "$workflow"
+     mv "${workflow}.new" "$workflow" || echo "Error replacing $workflow with modified version"
cognee/tests/test_cognee_server_start.py (1)

28-28: Add platform-specific code for Windows compatibility.

Using preexec_fn=os.setsid is only supported on Unix-like systems and will cause issues on Windows. Consider adding platform-specific code to handle this.

+       # Platform-specific process group handling
+       kwargs = {}
+       if os.name != 'nt':  # If not Windows
+           kwargs['preexec_fn'] = os.setsid
+       else:
+           # Windows uses a different approach for process groups
+           # For simplicity, we'll omit this in Windows and manage the process differently
+           pass
+           
        cls.server_process = subprocess.Popen(
            [
                sys.executable,
                "-m",
                "uvicorn",
                "cognee.api.client:app",
                "--host",
                "0.0.0.0",
                "--port",
                "8000",
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
-           preexec_fn=os.setsid,
+           **kwargs,
        )
cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (2)

8-12: Consider adding examples for Node Labeling & IDs.

The rules for node labeling are good, but would benefit from concrete examples to illustrate the expected format.

1. **Node Labeling & IDs**
   - Use basic types only (e.g., "Person", "Date", "Organization").
   - Avoid overly specific or generic terms (e.g., no "Mathematician" or "Entity").
   - Node IDs must be human-readable names from the text (no numbers).
+   - *Example*: "Person:John_Smith", "Organization:Acme_Corp", "Place:New_York_City"

13-15: Add examples for date and property formatting.

The date and property formatting instructions would be clearer with specific examples.

2. **Dates & Numbers**
   - Label dates as **"Date"** in "YYYY-MM-DD" format (use available parts if incomplete).
   - Properties are key-value pairs; do not use escaped quotes.
+   - *Example*: "Date:2023-04-15", "Date:2023-04" (if only year and month are available)
+   - *Example Properties*: {"age": 42, "location": "San Francisco"}
.github/README_WORKFLOW_MIGRATION.md (2)

47-54: Add troubleshooting section.

The document would benefit from a troubleshooting section to help users resolve common issues that might arise during or after migration.

3. No tests are left out of the test-suites.yml orchestrator

+## Troubleshooting
+
+If you encounter issues after migration:
+
+1. **Workflow Not Running**: Make sure it's properly called in test-suites.yml
+2. **Missing Secrets**: Ensure secrets are properly inherited
+3. **Rollback**: You can restore from .bak files created by the script:
+   ```bash
+   cp .github/workflows/your_workflow.yml.bak .github/workflows/your_workflow.yml
+   ```
+4. **Script Errors**: Check for error messages and verify file paths

93-98: Add information about testing migrated workflows.

The document mentions verifying that workflows run correctly when called by test-suites.yml, but doesn't provide specific instructions on how to test this.

- **Infrastructure Workflows**: Don't modify workflows that handle infrastructure or deployments

+## Testing Migrated Workflows
+
+To test a migrated workflow:
+
+1. Push your changes to a branch
+2. Go to the Actions tab in GitHub
+3. Select the test-suites workflow
+4. Click "Run workflow" and select your branch
+5. Verify that your migrated workflow executes successfully when called by test-suites
evals/README.md (2)

70-70: Use more concise wording.

Replace "In order to ensure the highest possible accuracy" with "To ensure the highest possible accuracy" for brevity.

🧰 Tools
🪛 LanguageTool

[style] ~70-~70: Consider a shorter alternative to avoid wordiness.
Context: ...ect_llm.json` #### Human Evaluation In order to ensure the highest possible accuracy of...

(IN_ORDER_TO_PREMIUM)


87-87: Spell "labor-intensive" with a hyphen.

Use "labor-intensive" instead of "labor intensive" to conform to standard spelling.

🧰 Tools
🪛 LanguageTool

[misspelling] ~87-~87: This word is normally spelled with a hyphen.
Context: ...memory evaluation - Human as a judge is labor intensive and does not scale - Hotpot is not the ...

(EN_COMPOUNDS_LABOR_INTENSIVE)

cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt (2)

83-83: Add a comma after the year.

In a month-day-year date format (e.g., "September 4, 1998,"), include a comma after the day to follow standard style conventions.

🧰 Tools
🪛 LanguageTool

[style] ~83-~83: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...**: "Google was founded on September 4, 1998 and has a market cap of 800000000000." ...

(MISSING_COMMA_AFTER_YEAR)


127-127: Use a shorter phrase.

"absolutely essential" might be wordy. Consider using "essential" instead.

🧰 Tools
🪛 LanguageTool

[style] ~127-~127: ‘absolutely essential’ might be wordy. Consider a shorter alternative.
Context: ...y edges (e.g., "X is a concept") unless absolutely essential. ### 4.3 Inferred Facts - Rule: On...

(EN_WORDINESS_PREMIUM_ABSOLUTELY_ESSENTIAL)

cognee/tests/test_relational_db_migration.py (1)

111-111: Rename unused variable.

edge_data is never used inside this loop. Rename it to _edge_data to signal it's unused and suppress warnings.

🧰 Tools
🪛 Ruff (0.8.2)

111-111: Loop control variable edge_data not used within loop body

Rename unused edge_data to _edge_data

(B007)

cognee/eval_framework/corpus_builder/task_getters/get_default_tasks_by_indices.py (2)

25-31: Optional parameters are a good addition for extensibility.

If user is not used inside the function, consider removing it or leveraging it for user-specific logic to avoid confusion.


53-54: Same caution with task indices.

Consider adding dynamic references or named constants for task indices to avoid possible confusion in the future.

evals/falkor_01042025/hotpot_qa_falkor_graphrag_sdk.py (2)

1-10: Remove unused import URL.

-from graphrag_sdk.source import URL, STRING
+from graphrag_sdk.source import STRING
🧰 Tools
🪛 Ruff (0.8.2)

5-5: graphrag_sdk.source.URL imported but unused

Remove unused import: graphrag_sdk.source.URL

(F401)


66-121: Knowledge graph creation logic is comprehensive.

Consider adding a small retry mechanism around database connection steps to handle transient connection issues.

evals/graphiti_01042025/hotpot_qa_graphiti.py (2)

1-11: Remove unused import OpenAI.

-from openai import OpenAI
🧰 Tools
🪛 Ruff (0.8.2)

9-9: openai.OpenAI imported but unused

Remove unused import: openai.OpenAI

(F401)


51-122: Retrieval-based QA logic is well-articulated.

Consider parameterizing the system prompt content to allow flexible instructions for minimal or detailed answers.

evals/plot_metrics.py (1)

210-212: Use dictionary iteration directly rather than .keys().
This minor fix enhances readability and aligns with Python best practices.

- for system in all_systems_metrics.keys():
+ for system in all_systems_metrics:
    if system not in systems:
        systems.append(system)
🧰 Tools
🪛 Ruff (0.8.2)

210-210: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

evals/mem0_01042025/hotpot_qa_mem0.py (2)

28-28: Rename the unused loop variable to _i.
This avoids confusion and aligns with Python convention for unused variables.

- for i, document in enumerate(tqdm(corpus, desc="Adding documents")):
+ for _i, document in enumerate(tqdm(corpus, desc="Adding documents")):
🧰 Tools
🪛 Ruff (0.8.2)

28-28: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


81-83: Consider adding error handling for potential API failures.
Using try/except around the OpenAI API call can help handle network or rate-limiting errors gracefully.

    try:
        response = openai_client.chat.completions.create(model=model_name, messages=messages)
        answer = response.choices[0].message.content
    except Exception as e:
        print(f"OpenAI API error: {e}")
        answer = "Error: Unable to fetch answer"
cognee/tests/test_telemetry.py (1)

87-98: Add test for telemetry failure handling

The tests cover the cases where telemetry is enabled or disabled, but not when telemetry sending fails (e.g., network error).

Consider adding a test case for network failure:

@patch("cognee.shared.utils.requests.post")
def test_telemetry_network_failure(self, mock_post):
    # Setup mock to raise an exception
    mock_post.side_effect = requests.exceptions.RequestException("Network error")
    
    # Ensure telemetry is enabled
    if "TELEMETRY_DISABLED" in os.environ:
        del os.environ["TELEMETRY_DISABLED"]
    
    os.environ["ENV"] = "prod"
    
    try:
        # Send telemetry - this should not raise an exception
        send_telemetry("failure_test", "user123", {"key": "value"})
        # If we get here, the function handled the error properly
        success = True
    except Exception:
        success = False
    
    # Verify the function caught the exception without propagating it
    self.assertTrue(success, "Telemetry should handle network failures gracefully")
    
    # Clean up
    del os.environ["ENV"]

This test ensures that telemetry failures don't impact the application's functionality.

🛑 Comments failed to post (12)
cognee/shared/logging_utils.py (1)

216-217: ⚠️ Potential issue

Critical bug: __name__ is not quoted in the hasattr check

The current code will raise a NameError because __name__ is treated as a variable instead of a string literal attribute name.

Apply this fix:

-            if hasattr(exc_type, __name__):
+            if hasattr(exc_type, '__name__'):
                event_dict["exception_type"] = exc_type.__name__
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            if hasattr(exc_type, '__name__'):
                event_dict["exception_type"] = exc_type.__name__
cognee/tasks/graph/extract_graph_from_data.py (1)

2-2: ⚠️ Potential issue

Remove unused import

The Optional type is imported but not used anywhere in this file. To maintain clean imports and follow best practices, consider removing it.

-from typing import Type, List, Optional
+from typing import Type, List
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import Type, List
🧰 Tools
🪛 Ruff (0.8.2)

2-2: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

cognee/modules/data/extraction/knowledge_graph/extract_content_graph.py (1)

1-1: ⚠️ Potential issue

Remove unused import

The Optional type is imported but not used anywhere in the code.

-from typing import Type, Optional
+from typing import Type
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from typing import Type
🧰 Tools
🪛 Ruff (0.8.2)

1-1: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

.github/workflows/disable_independent_workflows.sh (1)

4-4: ⚠️ Potential issue

Add error handling to the CD command.

The cd command could fail if the directory doesn't exist, but the script would continue executing in the wrong directory, potentially causing unexpected behavior.

-cd "$(dirname "$0")"
+cd "$(dirname "$0")" || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

cd "$(dirname "$0")" || exit 1
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

cognee/tests/test_cognee_server_start.py (5)

48-49: ⚠️ Potential issue

Update endpoint URLs to use the dynamic port.

If implementing dynamic port selection, the test methods need to use the stored port instead of the hardcoded value.

        # Test health endpoint
-       health_response = requests.get("http://localhost:8000/health", timeout=10)
+       health_response = requests.get(f"http://localhost:{self.__class__.port}/health", timeout=10)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        health_response = requests.get(f"http://localhost:{self.__class__.port}/health", timeout=10)
        self.assertEqual(health_response.status_code, 200)

52-55: ⚠️ Potential issue

Update root endpoint test to use the dynamic port.

Similar to the health endpoint, the root endpoint test should use the dynamic port if implemented.

        # Test root endpoint
-       root_response = requests.get("http://localhost:8000/", timeout=10)
+       root_response = requests.get(f"http://localhost:{self.__class__.port}/", timeout=10)
        self.assertEqual(root_response.status_code, 200)
        self.assertIn("message", root_response.json())
        self.assertEqual(root_response.json()["message"], "Hello, World, I am alive!")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # Test root endpoint
        root_response = requests.get(f"http://localhost:{self.__class__.port}/", timeout=10)
        self.assertEqual(root_response.status_code, 200)
        self.assertIn("message", root_response.json())
        self.assertEqual(root_response.json()["message"], "Hello, World, I am alive!")

30-30: 🛠️ Refactor suggestion

Implement a more robust waiting mechanism for server startup.

Using a fixed sleep time of 20 seconds may be too long or too short depending on the environment. Consider implementing a polling mechanism to check when the server is actually ready.

-       # Give the server some time to start
-       time.sleep(20)
+       # Wait for the server to start with timeout
+       max_retries = 20
+       retry_interval = 1.0
+       server_ready = False
+       
+       for _ in range(max_retries):
+           if cls.server_process.poll() is not None:
+               # Server process has terminated
+               break
+               
+           try:
+               response = requests.get(f"http://localhost:{cls.port}/health", timeout=1)
+               if response.status_code == 200:
+                   server_ready = True
+                   break
+           except (requests.RequestException, ConnectionError):
+               pass
+               
+           time.sleep(retry_interval)
+           
+       if not server_ready and cls.server_process.poll() is None:
+           # Server is running but not responding
+           print("Server started but not responding to health checks", file=sys.stderr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

       # Wait for the server to start with timeout
       max_retries = 20
       retry_interval = 1.0
       server_ready = False
       
       for _ in range(max_retries):
           if cls.server_process.poll() is not None:
               # Server process has terminated
               break
               
           try:
               response = requests.get(f"http://localhost:{cls.port}/health", timeout=1)
               if response.status_code == 200:
                   server_ready = True
                   break
           except (requests.RequestException, ConnectionError):
               pass
               
           time.sleep(retry_interval)
           
       if not server_ready and cls.server_process.poll() is None:
           # Server is running but not responding
           print("Server started but not responding to health checks", file=sys.stderr)

14-28: 🛠️ Refactor suggestion

Add port conflict detection and dynamic port selection.

The test uses a hardcoded port 8000, which could be in use by other processes during testing, causing the test to fail. Consider checking if the port is available and/or using a dynamic port.

        # Start the Cognee server - just check if the server can start without errors
+       port = self._find_available_port(8000, 8100)
        cls.server_process = subprocess.Popen(
            [
                sys.executable,
                "-m",
                "uvicorn",
                "cognee.api.client:app",
                "--host",
                "0.0.0.0",
                "--port",
-               "8000",
+               str(port),
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            preexec_fn=os.setsid,
        )
+       cls.port = port
        
+    @staticmethod
+    def _find_available_port(start_port, end_port):
+        """Find an available port in the given range."""
+        import socket
+        for port in range(start_port, end_port):
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex(('localhost', port)) != 0:
+                    return port
+        raise RuntimeError(f"No available ports in range {start_port}-{end_port}")

Committable suggestion skipped: line range outside the PR's diff.


41-43: ⚠️ Potential issue

Implement platform-independent process termination.

The tearDown method uses Unix-specific os.killpg to terminate the process group, which won't work on Windows. Add platform-specific code for terminating the process on all platforms.

        # Terminate the server process
        if hasattr(cls, "server_process") and cls.server_process:
-           os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM)
-           cls.server_process.wait()
+           if os.name != 'nt':  # Unix-like systems
+               try:
+                   os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM)
+               except (ProcessLookupError, OSError):
+                   # Process might already be gone
+                   pass
+           else:  # Windows
+               import subprocess
+               try:
+                   subprocess.call(['taskkill', '/F', '/T', '/PID', str(cls.server_process.pid)])
+               except (subprocess.SubprocessError, OSError):
+                   # Process might already be gone
+                   pass
+           
+           try:
+               cls.server_process.wait(timeout=5)
+           except subprocess.TimeoutExpired:
+               # Force kill if it doesn't terminate gracefully
+               if cls.server_process.poll() is None:
+                   cls.server_process.kill()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if hasattr(cls, "server_process") and cls.server_process:
            if os.name != 'nt':  # Unix-like systems
                try:
                    os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM)
                except (ProcessLookupError, OSError):
                    # Process might already be gone
                    pass
            else:  # Windows
                import subprocess
                try:
                    subprocess.call(['taskkill', '/F', '/T', '/PID', str(cls.server_process.pid)])
                except (subprocess.SubprocessError, OSError):
                    # Process might already be gone
                    pass
           
            try:
                cls.server_process.wait(timeout=5)
            except subprocess.TimeoutExpired:
                # Force kill if it doesn't terminate gracefully
                if cls.server_process.poll() is None:
                    cls.server_process.kill()
cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (1)

26-27: 🛠️ Refactor suggestion

Specify expected output format.

The prompt doesn't specify the exact format in which the knowledge graph should be returned (JSON, XML, etc.). This could lead to inconsistent outputs.

5. **Strict Compliance**
   - Follow these rules exactly. Non-compliance results in termination.
+
+6. **Output Format**
+   - Return the knowledge graph as a JSON object with "nodes" and "edges" arrays.
+   - Each node should have "id", "type", and optional "properties".
+   - Each edge should have "source", "target", "label", and optional "properties".
+   - *Example Output*:
+   ```json
+   {
+     "nodes": [
+       {"id": "John_Smith", "type": "Person", "properties": {"age": 42}},
+       {"id": "Acme_Corp", "type": "Organization"}
+     ],
+     "edges": [
+       {"source": "John_Smith", "target": "Acme_Corp", "label": "works_for"}
+     ]
+   }
+   ```
.github/README_WORKFLOW_MIGRATION.md (1)

24-46: 🛠️ Refactor suggestion

Add instruction to update test-suites.yml.

The manual migration section explains how to modify individual workflows but doesn't mention how to update the test-suites.yml file to actually call those workflows after migration.

4. Save the file

+## Updating test-suites.yml
+
+After migrating a workflow, you need to ensure it's called by the test-suites.yml file:
+
+1. Open `.github/workflows/test-suites.yml`
+2. Find the appropriate job or create a new one
+3. Add a step to call your migrated workflow:
+   ```yaml
+   test_your_workflow:
+     needs: [previous_job]  # If there are dependencies
+     uses: ./.github/workflows/your_workflow.yml
+     secrets: inherit
+   ```
+4. Save the file
cognee/tests/test_telemetry.py (1)

19-29: 🛠️ Refactor suggestion

Improve file handling in tests

The test creates a .anon_id file in the project root, which could cause issues:

  1. It modifies the filesystem outside the test directory
  2. It doesn't clean up after test execution
  3. There's no error handling for file operations

Consider using a temporary directory and patching the path:

-        # Check if .anon_id exists in the project root
-        anon_id_path = os.path.join(
-            os.path.dirname(os.path.dirname(os.path.dirname(__file__))), ".anon_id"
-        )
-
-        if not os.path.exists(anon_id_path):
-            # Create the file with a test ID if it doesn't exist
-            with open(anon_id_path, "w") as f:
-                f.write("test-machine")
-            print(f"Created .anon_id file at {anon_id_path}", file=sys.stderr)
+        # Use a temporary file for testing
+        with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
+            temp_file.write("test-machine")
+            anon_id_path = temp_file.name
+        
+        # Patch the path used by the telemetry function
+        with patch('cognee.shared.utils.ANON_ID_PATH', anon_id_path):

Also add cleanup after the test:

    def tearDown(self):
        # Clean up temporary file if it exists
        if hasattr(self, 'anon_id_path') and os.path.exists(self.anon_id_path):
            os.unlink(self.anon_id_path)

Don't forget to import tempfile at the top of the file.

@dexters1 dexters1 requested review from borisarzentar and lxobr April 15, 2025 13:34
@Vasilije1990 Vasilije1990 self-requested a review April 16, 2025 12:16
@dexters1 dexters1 merged commit a036787 into dev Apr 16, 2025
46 of 49 checks passed
@dexters1 dexters1 deleted the embedding-string-fix branch April 16, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants