-
Notifications
You must be signed in to change notification settings - Fork 1k
Embedding string fix [COG-1900] #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis update introduces a wide range of changes including new features, configuration enhancements, test additions, prompt and documentation updates, and significant codebase cleanups. Major additions include new prompt templates for knowledge graph extraction and QA benchmarking, new scripts for workflow management, and comprehensive evaluation pipelines for multiple AI systems. Configuration is made more flexible with environment-based default user credentials and dynamic prompt paths. Test coverage is expanded with server startup, telemetry, and relational database migration tests. Several obsolete or demonstration notebooks and example scripts are removed, and batch sizes and error handling are refined in various modules. Enum and function signatures are updated for clarity and consistency. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Config
participant System
participant DB
User->>Config: Set DEFAULT_USER_EMAIL/PASSWORD in env
System->>Config: get_base_config()
System->>Config: Read default_user_email/password
alt If not set
System->>System: Use hardcoded defaults
end
System->>DB: Create or retrieve default user with credentials
DB-->>System: User object
sequenceDiagram
participant Workflow Script
participant Workflow YAML
Workflow Script->>Workflow YAML: Backup file
Workflow Script->>Workflow YAML: Check for workflow_call trigger
alt Not present
Workflow Script->>Workflow YAML: Replace on: triggers with workflow_call
else Already present
Workflow Script->>Workflow YAML: Skip modification
end
Possibly related PRs
Suggested labels
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 12
🔭 Outside diff range comments (2)
cognee/api/v1/config/config.py (1)
157-158: 💡 Verification agent🧩 Analysis chain
Inconsistent error handling detected
This is unrelated to your current changes, but I noticed that this error handling is different from other similar methods. This line uses
AttributeErrordirectly with amessageparameter, while other methods useInvalidAttributeError.
🏁 Script executed:
#!/bin/bash # Check for similar error handling patterns in the file rg -A 3 "raise.*AttributeError" --context 2 cognee/api/v1/config/config.pyLength of output: 1187
Action Required: Align Exception Type for Consistency
The verification confirms that while most configuration blocks in
cognee/api/v1/config/config.pyare usingInvalidAttributeErrorto signal invalid attributes, thegraph_db_configblock is still usingAttributeError. To maintain consistency:
- File:
cognee/api/v1/config/config.py- Location: The block handling
graph_db_config(currently at lines 157-158)- Suggested Change: Replace
withelse: raise AttributeError(message=f"'{key}' is not a valid attribute of the config.")else: raise InvalidAttributeError( message=f"'{key}' is not a valid attribute of the config." )Please update the error handling accordingly.
cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (1)
1-28: 💡 Verification agent🧩 Analysis chain
Address handling of large string embeddings.
Since the PR objective mentions "allowing embedding of large strings to support full row embedding in SQL databases," consider adding specific guidance on handling large text fields or complex data structures.
🌐 Web query:
What is the maximum string length for embedding in SQL databases?💡 Result:
The maximum string length for embedding in SQL databases varies significantly depending on the database system and the data type used. Below are the details for some common database systems:
SQL Server
- VARCHAR(MAX) or NVARCHAR(MAX): These types can store up to 2 GB of data (approximately 1,073,741,823 characters for NVARCHAR and 2,147,483,647 characters for VARCHAR)[1][6][9].
- VARCHAR(N): Supports strings up to 8,000 bytes in length; exceeding this limit requires using VARCHAR(MAX)[9].
- Other types like CHAR are fixed-length and generally smaller, with a maximum of 8,000 bytes[9].
MySQL
- VARCHAR: The maximum size is constrained by the row size limit (65,535 bytes), including storage overhead and character set encoding. For example, with
utf8mb4, a VARCHAR column can hold fewer characters than its byte limit[5][9].- TEXT Types (e.g., LONGTEXT): Can store up to 4 GB of data, but these are not directly comparable to VARCHAR for embedding strings[5].
SQLite
- The maximum string length is capped at 1 billion bytes, based on the
SQLITE_MAX_LENGTHparameter[7].Oracle
- VARCHAR2: By default, supports up to 4,000 bytes. With the
MAX_STRING_SIZEparameter set to EXTENDED, it allows up to 32,767 bytes[10].BigQuery
- SQL queries, including embedded text, are limited to 1 MB in size (1,024 KB)[2].
Caché (InterSystems)
- String: Maximum embedded string length is 32,767 characters, though certain configurations (like enabling long strings or using STREAM data types) allow higher limits[8].
PowerBuilder with SQL Anywhere
- While the theoretical limit can be 2 GB for types like LONG VARCHAR, safe embedded string lengths vary significantly. Developers report issues with strings exceeding 10 million characters[3].
Each database has different practical and theoretical limits depending on the specific data types, so it's essential to choose appropriate types (e.g.,
VARCHAR(MAX),LONGTEXT, orBLOB) when working with large strings.Citations:
- 1: https://learn.microsoft.com/en-us/sql/sql-server/maximum-capacity-specifications-for-sql-server?view=sql-server-ver16
- 2: https://www.googlecloudcommunity.com/gc/Data-Analytics/The-query-is-too-large-The-maximum-standard-SQL-query-length-is/m-p/699275
- 3: https://community.appeon.com/qna/q-a/embedded-sql-input-variable-value-max-length
- 4: https://www.sqlservercentral.com/blogs/the-length-limitations-of-len
- 5: https://dev.mysql.com/doc/en/char.html
- 6: https://learn.microsoft.com/en-us/answers/questions/264575/character-limit-exceeding-db-limit
- 7: https://www.sqlite.org/limits.html
- 8: https://community.intersystems.com/post/cach%C3%A9-sql-maximum-supported-length
- 9: https://www.red-gate.com/simple-talk/databases/sql-server/learn/when-use-char-varchar-varcharmax/
- 10: https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/datatype-limits.html
Enhance SQL Embedding Guidelines
- The current prompt focuses on generating a knowledge graph but does not address the potential challenges when embedding large text fields for full row SQL embedding.
- Since SQL databases have widely varying string length limits (e.g., up to 2 GB in SQL Server with
VARCHAR(MAX), row size limits in MySQL, 1 billion bytes in SQLite, etc.), please add explicit guidance. For example:
- Instruct the extraction algorithm to verify the target SQL engine’s string length limits.
- Specify whether to truncate, split, or use alternative data types (e.g.,
TEXT,BLOB) when processing large text embeddings.- This addition will help ensure robust handling of complex or oversized data structures.
🧹 Nitpick comments (31)
cognee/infrastructure/llm/prompts/answer_simple_question_benchmark4.txt (1)
1-15: Prompt is clear, precise, and well-structured for benchmarking.The instructions are explicit, unambiguous, and well-aligned with the goal of minimal, context-bound QA benchmarking. No issues found.
If you want to be even more explicit, you could clarify whether "no punctuation" applies to numbers/dates (e.g., "2024.04.01") or just to terminal punctuation, but this is a minor nitpick.
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2)
81-89: Recursive splitting for multi-element lists.Splitting the text array in half and embedding each side in parallel is a workable approach to reduce context size. However, if there's a very large input list, this recursive approach can rapidly spawn multiple async tasks. You may consider implementing a maximum recursion depth or a chunking mechanism to avoid excessive concurrency in extreme cases.
93-108: Handling single oversize strings with overlap-splitting.Splitting one large string into overlapping thirds, embedding each, then pooling the results helps reduce context size while retaining context overlap. If the remaining segments are still too large, subsequent recursions will continue. Although it will shrink geometrically, consider a safeguard for unbounded recursion (e.g., a configurable recursion cap) and log the recursion depth for performance monitoring.
evals/requirements.txt (1)
1-6: LGTM! Dependencies for the plot_metrics.py script are properly defined.The requirements file correctly specifies the exact dependencies needed for the metrics plotting functionality, with properly pinned versions for reproducibility.
Note: Since
pathlibis part of the Python standard library since Python 3.4, it doesn't need to be explicitly installed. Consider removing this dependency unless you're supporting Python versions older than 3.4.cognee/tests/unit/entity_extraction/regex_entity_extraction_test.py (1)
85-96: Consider adding more details to the TODO commentThe test for money extraction has been temporarily disabled with a TODO comment. While this is a reasonable short-term solution to prevent CI failures, consider adding more specific information about what's failing with the regex and the deadline for fixing it.
Also, ensure you have a tracking issue in your issue tracking system for this fix, since disabled tests are often forgotten.
Would you like me to help diagnose the regex issues or create a tracking issue for this task?
cognee/infrastructure/llm/prompts/answer_simple_question_benchmark2.txt (1)
1-7: Clear and well-structured prompt formatThe prompt template provides clear instructions for generating minimal, essential answers extracted from context with specific formatting requirements for different question types. This aligns with the PR objective by supporting efficient embedding of answers in SQL databases through consistent, minimalistic formatting.
The bullet point structure and consistent "For x questions: y" format makes the instructions easy to follow, though you might consider varying the beginning of some sentences to improve readability as suggested by the static analysis.
You are a benchmark-optimized QA system. Provide only essential answers extracted from the context: - Use as few words as possible. - For yes/no questions: answer with "yes" or "no". - For what/who/where questions: reply with a single word or brief phrase. -For when questions: return only the relevant date/time. -For how/why questions: use the briefest phrase. +- When asked about time: return only the relevant date/time. +- Questions about how/why: use the briefest phrase. No punctuation, lowercase answers only.🧰 Tools
🪛 LanguageTool
[style] ~5-~5: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...y with a single word or brief phrase. - For when questions: return only the relevan...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~6-~6: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: return only the relevant date/time. - For how/why questions: use the briefest phr...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
cognee/api/v1/config/config.py (1)
134-147: Well-implemented migration configuration methodThe new
set_migration_db_configmethod follows the established pattern for configuration methods in this class. It properly validates attributes and raises appropriate exceptions for invalid keys.However, there's a minor documentation issue in the docstring.
- """ - Updates the relational db config with values from config_dict. - """ + """ + Updates the migration db config with values from config_dict. + """cognee/eval_framework/corpus_builder/run_corpus_builder.py (1)
3-3: Remove the unused importOptional.According to the static analysis hint,
Optionalis imported but not actually used. Consider removing it to keep the import list clean.-from typing import List, Optional +from typing import List🧰 Tools
🪛 Ruff (0.8.2)
3-3:
typing.Optionalimported but unusedRemove unused import:
typing.Optional(F401)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
128-152: Optimize lookups for improved performance.Currently, the code performs a SELECT query for each data point to check for existence before inserting or updating. This can be costly for large lists of data points. Consider querying once for all IDs, building a dictionary of existing records, and then iterating through the data points to update or insert as needed. This avoids repeated round trips and speeds up batch processing.
Below is an illustrative diff:
async with self.get_async_session() as session: - pgvector_data_points = [] - for data_index, data_point in enumerate(data_points): - data_point_db = ( - await session.execute( - select(PGVectorDataPoint).filter(PGVectorDataPoint.id == data_point.id) - ) - ).scalar_one_or_none() - if data_point_db: - data_point_db.id = data_point.id - data_point_db.vector = data_vectors[data_index] - data_point_db.payload = serialize_data(data_point.model_dump()) - pgvector_data_points.append(data_point_db) - else: - pgvector_data_points.append( - PGVectorDataPoint( - id=data_point.id, - vector=data_vectors[data_index], - payload=serialize_data(data_point.model_dump()), - ) - ) + # Gather all IDs first + data_point_ids = [dp.id for dp in data_points] + existing_records = ( + await session.execute( + select(PGVectorDataPoint).filter(PGVectorDataPoint.id.in_(data_point_ids)) + ) + ).scalars().all() + + existing_map = {record.id: record for record in existing_records} + pgvector_data_points = [] + + for data_index, dp in enumerate(data_points): + if dp.id in existing_map: + # Update existing record + record = existing_map[dp.id] + record.vector = data_vectors[data_index] + record.payload = serialize_data(dp.model_dump()) + pgvector_data_points.append(record) + else: + # Create a new record + new_record = PGVectorDataPoint( + id=dp.id, + vector=data_vectors[data_index], + payload=serialize_data(dp.model_dump()), + ) + pgvector_data_points.append(new_record) session.add_all(pgvector_data_points) await session.commit().github/workflows/disable_independent_workflows.sh (2)
65-70: Add comments explaining the AWK command logic.The AWK command is used to find the next section after the 'on:' block, but this logic is complex and would benefit from more detailed comments explaining how it identifies the next section.
# Find where to continue after the original 'on:' section - next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow") + # Find the next line that starts with a lowercase letter (indicating a new YAML section) + # after the line containing 'on:' section + next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow")
53-74: Consider error handling for file operations.The script creates and moves files but doesn't check if these operations succeed. In a production environment, it would be better to add error handling for these operations.
# Create a new file with the modified content { # Copy the part before 'on:' head -n $((on_line-1)) "$workflow" # Add the new on: section that only includes workflow_call echo "on:" echo " workflow_call:" echo " secrets:" echo " inherit: true" # Find where to continue after the original 'on:' section next_section=$(awk "NR > $on_line && /^[a-z]/ {print NR; exit}" "$workflow") if [ -z "$next_section" ]; then next_section=$(wc -l < "$workflow") next_section=$((next_section+1)) fi # Copy the rest of the file starting from the next section tail -n +$next_section "$workflow" - } > "${workflow}.new" + } > "${workflow}.new" || { echo "Error creating new file for $workflow"; continue; } # Replace the original with the new version - mv "${workflow}.new" "$workflow" + mv "${workflow}.new" "$workflow" || echo "Error replacing $workflow with modified version"cognee/tests/test_cognee_server_start.py (1)
28-28: Add platform-specific code for Windows compatibility.Using
preexec_fn=os.setsidis only supported on Unix-like systems and will cause issues on Windows. Consider adding platform-specific code to handle this.+ # Platform-specific process group handling + kwargs = {} + if os.name != 'nt': # If not Windows + kwargs['preexec_fn'] = os.setsid + else: + # Windows uses a different approach for process groups + # For simplicity, we'll omit this in Windows and manage the process differently + pass + cls.server_process = subprocess.Popen( [ sys.executable, "-m", "uvicorn", "cognee.api.client:app", "--host", "0.0.0.0", "--port", "8000", ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, - preexec_fn=os.setsid, + **kwargs, )cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (2)
8-12: Consider adding examples for Node Labeling & IDs.The rules for node labeling are good, but would benefit from concrete examples to illustrate the expected format.
1. **Node Labeling & IDs** - Use basic types only (e.g., "Person", "Date", "Organization"). - Avoid overly specific or generic terms (e.g., no "Mathematician" or "Entity"). - Node IDs must be human-readable names from the text (no numbers). + - *Example*: "Person:John_Smith", "Organization:Acme_Corp", "Place:New_York_City"
13-15: Add examples for date and property formatting.The date and property formatting instructions would be clearer with specific examples.
2. **Dates & Numbers** - Label dates as **"Date"** in "YYYY-MM-DD" format (use available parts if incomplete). - Properties are key-value pairs; do not use escaped quotes. + - *Example*: "Date:2023-04-15", "Date:2023-04" (if only year and month are available) + - *Example Properties*: {"age": 42, "location": "San Francisco"}.github/README_WORKFLOW_MIGRATION.md (2)
47-54: Add troubleshooting section.The document would benefit from a troubleshooting section to help users resolve common issues that might arise during or after migration.
3. No tests are left out of the test-suites.yml orchestrator +## Troubleshooting + +If you encounter issues after migration: + +1. **Workflow Not Running**: Make sure it's properly called in test-suites.yml +2. **Missing Secrets**: Ensure secrets are properly inherited +3. **Rollback**: You can restore from .bak files created by the script: + ```bash + cp .github/workflows/your_workflow.yml.bak .github/workflows/your_workflow.yml + ``` +4. **Script Errors**: Check for error messages and verify file paths
93-98: Add information about testing migrated workflows.The document mentions verifying that workflows run correctly when called by test-suites.yml, but doesn't provide specific instructions on how to test this.
- **Infrastructure Workflows**: Don't modify workflows that handle infrastructure or deployments +## Testing Migrated Workflows + +To test a migrated workflow: + +1. Push your changes to a branch +2. Go to the Actions tab in GitHub +3. Select the test-suites workflow +4. Click "Run workflow" and select your branch +5. Verify that your migrated workflow executes successfully when called by test-suitesevals/README.md (2)
70-70: Use more concise wording.Replace "In order to ensure the highest possible accuracy" with "To ensure the highest possible accuracy" for brevity.
🧰 Tools
🪛 LanguageTool
[style] ~70-~70: Consider a shorter alternative to avoid wordiness.
Context: ...ect_llm.json` #### Human Evaluation In order to ensure the highest possible accuracy of...(IN_ORDER_TO_PREMIUM)
87-87: Spell "labor-intensive" with a hyphen.Use "labor-intensive" instead of "labor intensive" to conform to standard spelling.
🧰 Tools
🪛 LanguageTool
[misspelling] ~87-~87: This word is normally spelled with a hyphen.
Context: ...memory evaluation - Human as a judge is labor intensive and does not scale - Hotpot is not the ...(EN_COMPOUNDS_LABOR_INTENSIVE)
cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt (2)
83-83: Add a comma after the year.In a month-day-year date format (e.g., "September 4, 1998,"), include a comma after the day to follow standard style conventions.
🧰 Tools
🪛 LanguageTool
[style] ~83-~83: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...**: "Google was founded on September 4, 1998 and has a market cap of 800000000000." ...(MISSING_COMMA_AFTER_YEAR)
127-127: Use a shorter phrase."absolutely essential" might be wordy. Consider using "essential" instead.
🧰 Tools
🪛 LanguageTool
[style] ~127-~127: ‘absolutely essential’ might be wordy. Consider a shorter alternative.
Context: ...y edges (e.g., "X is a concept") unless absolutely essential. ### 4.3 Inferred Facts - Rule: On...(EN_WORDINESS_PREMIUM_ABSOLUTELY_ESSENTIAL)
cognee/tests/test_relational_db_migration.py (1)
111-111: Rename unused variable.
edge_datais never used inside this loop. Rename it to_edge_datato signal it's unused and suppress warnings.🧰 Tools
🪛 Ruff (0.8.2)
111-111: Loop control variable
edge_datanot used within loop bodyRename unused
edge_datato_edge_data(B007)
cognee/eval_framework/corpus_builder/task_getters/get_default_tasks_by_indices.py (2)
25-31: Optional parameters are a good addition for extensibility.If
useris not used inside the function, consider removing it or leveraging it for user-specific logic to avoid confusion.
53-54: Same caution with task indices.Consider adding dynamic references or named constants for task indices to avoid possible confusion in the future.
evals/falkor_01042025/hotpot_qa_falkor_graphrag_sdk.py (2)
1-10: Remove unused importURL.-from graphrag_sdk.source import URL, STRING +from graphrag_sdk.source import STRING🧰 Tools
🪛 Ruff (0.8.2)
5-5:
graphrag_sdk.source.URLimported but unusedRemove unused import:
graphrag_sdk.source.URL(F401)
66-121: Knowledge graph creation logic is comprehensive.Consider adding a small retry mechanism around database connection steps to handle transient connection issues.
evals/graphiti_01042025/hotpot_qa_graphiti.py (2)
1-11: Remove unused importOpenAI.-from openai import OpenAI🧰 Tools
🪛 Ruff (0.8.2)
9-9:
openai.OpenAIimported but unusedRemove unused import:
openai.OpenAI(F401)
51-122: Retrieval-based QA logic is well-articulated.Consider parameterizing the system prompt content to allow flexible instructions for minimal or detailed answers.
evals/plot_metrics.py (1)
210-212: Use dictionary iteration directly rather than.keys().
This minor fix enhances readability and aligns with Python best practices.- for system in all_systems_metrics.keys(): + for system in all_systems_metrics: if system not in systems: systems.append(system)🧰 Tools
🪛 Ruff (0.8.2)
210-210: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
evals/mem0_01042025/hotpot_qa_mem0.py (2)
28-28: Rename the unused loop variable to_i.
This avoids confusion and aligns with Python convention for unused variables.- for i, document in enumerate(tqdm(corpus, desc="Adding documents")): + for _i, document in enumerate(tqdm(corpus, desc="Adding documents")):🧰 Tools
🪛 Ruff (0.8.2)
28-28: Loop control variable
inot used within loop bodyRename unused
ito_i(B007)
81-83: Consider adding error handling for potential API failures.
Usingtry/exceptaround the OpenAI API call can help handle network or rate-limiting errors gracefully.try: response = openai_client.chat.completions.create(model=model_name, messages=messages) answer = response.choices[0].message.content except Exception as e: print(f"OpenAI API error: {e}") answer = "Error: Unable to fetch answer"cognee/tests/test_telemetry.py (1)
87-98: Add test for telemetry failure handlingThe tests cover the cases where telemetry is enabled or disabled, but not when telemetry sending fails (e.g., network error).
Consider adding a test case for network failure:
@patch("cognee.shared.utils.requests.post") def test_telemetry_network_failure(self, mock_post): # Setup mock to raise an exception mock_post.side_effect = requests.exceptions.RequestException("Network error") # Ensure telemetry is enabled if "TELEMETRY_DISABLED" in os.environ: del os.environ["TELEMETRY_DISABLED"] os.environ["ENV"] = "prod" try: # Send telemetry - this should not raise an exception send_telemetry("failure_test", "user123", {"key": "value"}) # If we get here, the function handled the error properly success = True except Exception: success = False # Verify the function caught the exception without propagating it self.assertTrue(success, "Telemetry should handle network failures gracefully") # Clean up del os.environ["ENV"]This test ensures that telemetry failures don't impact the application's functionality.
🛑 Comments failed to post (12)
cognee/shared/logging_utils.py (1)
216-217:
⚠️ Potential issueCritical bug:
__name__is not quoted in the hasattr checkThe current code will raise a NameError because
__name__is treated as a variable instead of a string literal attribute name.Apply this fix:
- if hasattr(exc_type, __name__): + if hasattr(exc_type, '__name__'): event_dict["exception_type"] = exc_type.__name__📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if hasattr(exc_type, '__name__'): event_dict["exception_type"] = exc_type.__name__cognee/tasks/graph/extract_graph_from_data.py (1)
2-2:
⚠️ Potential issueRemove unused import
The
Optionaltype is imported but not used anywhere in this file. To maintain clean imports and follow best practices, consider removing it.-from typing import Type, List, Optional +from typing import Type, List📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from typing import Type, List🧰 Tools
🪛 Ruff (0.8.2)
2-2:
typing.Optionalimported but unusedRemove unused import:
typing.Optional(F401)
cognee/modules/data/extraction/knowledge_graph/extract_content_graph.py (1)
1-1:
⚠️ Potential issueRemove unused import
The
Optionaltype is imported but not used anywhere in the code.-from typing import Type, Optional +from typing import Type📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from typing import Type🧰 Tools
🪛 Ruff (0.8.2)
1-1:
typing.Optionalimported but unusedRemove unused import:
typing.Optional(F401)
.github/workflows/disable_independent_workflows.sh (1)
4-4:
⚠️ Potential issueAdd error handling to the CD command.
The
cdcommand could fail if the directory doesn't exist, but the script would continue executing in the wrong directory, potentially causing unexpected behavior.-cd "$(dirname "$0")" +cd "$(dirname "$0")" || exit 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.cd "$(dirname "$0")" || exit 1🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
cognee/tests/test_cognee_server_start.py (5)
48-49:
⚠️ Potential issueUpdate endpoint URLs to use the dynamic port.
If implementing dynamic port selection, the test methods need to use the stored port instead of the hardcoded value.
# Test health endpoint - health_response = requests.get("http://localhost:8000/health", timeout=10) + health_response = requests.get(f"http://localhost:{self.__class__.port}/health", timeout=10)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.health_response = requests.get(f"http://localhost:{self.__class__.port}/health", timeout=10) self.assertEqual(health_response.status_code, 200)
52-55:
⚠️ Potential issueUpdate root endpoint test to use the dynamic port.
Similar to the health endpoint, the root endpoint test should use the dynamic port if implemented.
# Test root endpoint - root_response = requests.get("http://localhost:8000/", timeout=10) + root_response = requests.get(f"http://localhost:{self.__class__.port}/", timeout=10) self.assertEqual(root_response.status_code, 200) self.assertIn("message", root_response.json()) self.assertEqual(root_response.json()["message"], "Hello, World, I am alive!")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Test root endpoint root_response = requests.get(f"http://localhost:{self.__class__.port}/", timeout=10) self.assertEqual(root_response.status_code, 200) self.assertIn("message", root_response.json()) self.assertEqual(root_response.json()["message"], "Hello, World, I am alive!")
30-30: 🛠️ Refactor suggestion
Implement a more robust waiting mechanism for server startup.
Using a fixed sleep time of 20 seconds may be too long or too short depending on the environment. Consider implementing a polling mechanism to check when the server is actually ready.
- # Give the server some time to start - time.sleep(20) + # Wait for the server to start with timeout + max_retries = 20 + retry_interval = 1.0 + server_ready = False + + for _ in range(max_retries): + if cls.server_process.poll() is not None: + # Server process has terminated + break + + try: + response = requests.get(f"http://localhost:{cls.port}/health", timeout=1) + if response.status_code == 200: + server_ready = True + break + except (requests.RequestException, ConnectionError): + pass + + time.sleep(retry_interval) + + if not server_ready and cls.server_process.poll() is None: + # Server is running but not responding + print("Server started but not responding to health checks", file=sys.stderr)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Wait for the server to start with timeout max_retries = 20 retry_interval = 1.0 server_ready = False for _ in range(max_retries): if cls.server_process.poll() is not None: # Server process has terminated break try: response = requests.get(f"http://localhost:{cls.port}/health", timeout=1) if response.status_code == 200: server_ready = True break except (requests.RequestException, ConnectionError): pass time.sleep(retry_interval) if not server_ready and cls.server_process.poll() is None: # Server is running but not responding print("Server started but not responding to health checks", file=sys.stderr)
14-28: 🛠️ Refactor suggestion
Add port conflict detection and dynamic port selection.
The test uses a hardcoded port 8000, which could be in use by other processes during testing, causing the test to fail. Consider checking if the port is available and/or using a dynamic port.
# Start the Cognee server - just check if the server can start without errors + port = self._find_available_port(8000, 8100) cls.server_process = subprocess.Popen( [ sys.executable, "-m", "uvicorn", "cognee.api.client:app", "--host", "0.0.0.0", "--port", - "8000", + str(port), ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, preexec_fn=os.setsid, ) + cls.port = port + @staticmethod + def _find_available_port(start_port, end_port): + """Find an available port in the given range.""" + import socket + for port in range(start_port, end_port): + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex(('localhost', port)) != 0: + return port + raise RuntimeError(f"No available ports in range {start_port}-{end_port}")Committable suggestion skipped: line range outside the PR's diff.
41-43:
⚠️ Potential issueImplement platform-independent process termination.
The tearDown method uses Unix-specific
os.killpgto terminate the process group, which won't work on Windows. Add platform-specific code for terminating the process on all platforms.# Terminate the server process if hasattr(cls, "server_process") and cls.server_process: - os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM) - cls.server_process.wait() + if os.name != 'nt': # Unix-like systems + try: + os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM) + except (ProcessLookupError, OSError): + # Process might already be gone + pass + else: # Windows + import subprocess + try: + subprocess.call(['taskkill', '/F', '/T', '/PID', str(cls.server_process.pid)]) + except (subprocess.SubprocessError, OSError): + # Process might already be gone + pass + + try: + cls.server_process.wait(timeout=5) + except subprocess.TimeoutExpired: + # Force kill if it doesn't terminate gracefully + if cls.server_process.poll() is None: + cls.server_process.kill()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if hasattr(cls, "server_process") and cls.server_process: if os.name != 'nt': # Unix-like systems try: os.killpg(os.getpgid(cls.server_process.pid), signal.SIGTERM) except (ProcessLookupError, OSError): # Process might already be gone pass else: # Windows import subprocess try: subprocess.call(['taskkill', '/F', '/T', '/PID', str(cls.server_process.pid)]) except (subprocess.SubprocessError, OSError): # Process might already be gone pass try: cls.server_process.wait(timeout=5) except subprocess.TimeoutExpired: # Force kill if it doesn't terminate gracefully if cls.server_process.poll() is None: cls.server_process.kill()cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt (1)
26-27: 🛠️ Refactor suggestion
Specify expected output format.
The prompt doesn't specify the exact format in which the knowledge graph should be returned (JSON, XML, etc.). This could lead to inconsistent outputs.
5. **Strict Compliance** - Follow these rules exactly. Non-compliance results in termination. + +6. **Output Format** + - Return the knowledge graph as a JSON object with "nodes" and "edges" arrays. + - Each node should have "id", "type", and optional "properties". + - Each edge should have "source", "target", "label", and optional "properties". + - *Example Output*: + ```json + { + "nodes": [ + {"id": "John_Smith", "type": "Person", "properties": {"age": 42}}, + {"id": "Acme_Corp", "type": "Organization"} + ], + "edges": [ + {"source": "John_Smith", "target": "Acme_Corp", "label": "works_for"} + ] + } + ```.github/README_WORKFLOW_MIGRATION.md (1)
24-46: 🛠️ Refactor suggestion
Add instruction to update test-suites.yml.
The manual migration section explains how to modify individual workflows but doesn't mention how to update the test-suites.yml file to actually call those workflows after migration.
4. Save the file +## Updating test-suites.yml + +After migrating a workflow, you need to ensure it's called by the test-suites.yml file: + +1. Open `.github/workflows/test-suites.yml` +2. Find the appropriate job or create a new one +3. Add a step to call your migrated workflow: + ```yaml + test_your_workflow: + needs: [previous_job] # If there are dependencies + uses: ./.github/workflows/your_workflow.yml + secrets: inherit + ``` +4. Save the filecognee/tests/test_telemetry.py (1)
19-29: 🛠️ Refactor suggestion
Improve file handling in tests
The test creates a
.anon_idfile in the project root, which could cause issues:
- It modifies the filesystem outside the test directory
- It doesn't clean up after test execution
- There's no error handling for file operations
Consider using a temporary directory and patching the path:
- # Check if .anon_id exists in the project root - anon_id_path = os.path.join( - os.path.dirname(os.path.dirname(os.path.dirname(__file__))), ".anon_id" - ) - - if not os.path.exists(anon_id_path): - # Create the file with a test ID if it doesn't exist - with open(anon_id_path, "w") as f: - f.write("test-machine") - print(f"Created .anon_id file at {anon_id_path}", file=sys.stderr) + # Use a temporary file for testing + with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file: + temp_file.write("test-machine") + anon_id_path = temp_file.name + + # Patch the path used by the telemetry function + with patch('cognee.shared.utils.ANON_ID_PATH', anon_id_path):Also add cleanup after the test:
def tearDown(self): # Clean up temporary file if it exists if hasattr(self, 'anon_id_path') and os.path.exists(self.anon_id_path): os.unlink(self.anon_id_path)Don't forget to import
tempfileat the top of the file.
Description
Allow embedding of big strings to support full row embedding in SQL databases
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.