Skip to content

Conversation

@borisarzentar
Copy link
Member

@borisarzentar borisarzentar commented Feb 26, 2025

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features
    • Introduced an automated deployment workflow to build and push container images.
    • Updated dependency management to include additional database support.
  • Refactor
    • Enhanced asynchronous operations and logging in the server for improved performance.
    • Optimized extraction and retrieval processes for code-related data.
  • Chores
    • Streamlined build configurations and startup scripts for greater reliability.

@borisarzentar borisarzentar self-assigned this Feb 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2025

Warning

Rate limit exceeded

@borisarzentar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 24 minutes and 19 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 828fce9 and 0e06ffc.

📒 Files selected for processing (5)
  • .github/workflows/dockerhub-mcp.yml (1 hunks)
  • .github/workflows/dockerhub.yml (1 hunks)
  • cognee/api/v1/cognify/code_graph_pipeline.py (3 hunks)
  • cognee/tasks/repo_processor/get_local_dependencies.py (4 hunks)
  • cognee/tasks/repo_processor/get_repo_file_dependencies.py (2 hunks)

Walkthrough

The changes introduce a series of updates across CI/CD configurations, containerization, dependency management, asynchronous processing, and code retrieval. A new GitHub Actions workflow is added to build and push Docker images. The project’s Dockerfile and dependency file have been updated for enhanced deployment and database support. Several Python modules have been refactored to improve asynchronous handling, error management, and structured code extraction. New classes and functions have been added to optimize file parsing and retrieval, and minor modifications were made to startup scripts.

Changes

File(s) Change Summary
.github/workflows/dockerhub-mcp.yml Added a new GitHub Actions workflow to build & push Docker images on pushes to the main branch.
cognee-mcp/Dockerfile New multi-stage Dockerfile using base images for building and running the application with efficient dependency installation.
cognee-mcp/pyproject.toml Updated dependency "cognee[codegraph]" to include PostgreSQL and Neo4j support; added a dev dependency group for debugpy.
cognee-mcp/src/server.py Introduced asynchronous handling with await for cognify, added startup logging, and reorganized import statements.
cognee/api/v1/cognify/code_graph_pipeline.py Modified task configuration (fixed batch_size and detailed extraction flag), updated repository path, removed graph rendering, and added async search.
cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt New file detailing instructions for extracting file names and code snippets from text.
cognee/modules/retrieval/code_graph_retrieval.py,
cognee/shared/CodeGraphEntities.py
Restructured retrieval logic: updated function signature to return a list of dicts, introduced CodeQueryInfo, improved error handling, and refined metadata/attributes.
cognee/tasks/repo_processor/get_local_dependencies.py,
cognee/tasks/repo_processor/get_repo_file_dependencies.py
Added class FileParser and related helper functions for file and module path resolution, updated dependency extraction to avoid duplicates, and expanded the file range for dependency processing.
entrypoint.sh,
entrypoint-old.sh
Updated entrypoint.sh to check if DEBUG equals "true" instead of a boolean; removed the outdated entrypoint-old.sh script.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub Actions
    participant Checkout as actions/checkout
    participant Buildx as docker/setup-buildx-action
    participant DockerLogin as docker/login-action
    participant Metadata as docker/metadata-action
    participant BuildPush as docker/build-push-action
    Dev->>GH: Push commit to main
    GH->>Checkout: Checkout repository
    GH->>Buildx: Set up Docker Buildx
    GH->>DockerLogin: Log in to Docker Hub
    GH->>Metadata: Extract image metadata
    GH->>BuildPush: Build and push Docker image
    BuildPush-->>GH: Return image digest
    GH->>Dev: CI/CD process complete
Loading
sequenceDiagram
    participant User as User Query
    participant CR as code_graph_retrieval()
    participant Prompt as System Prompt File
    participant LLM as LLM Client
    participant FileIO as Async File Reader
    User->>CR: Sends query
    CR->>CR: Validate input
    CR->>Prompt: Read system prompt
    CR->>LLM: Send query with prompt
    LLM-->>CR: Return structured output
    CR->>CR: Process search results and query vector engine
    CR->>FileIO: Asynchronously read file contents
    FileIO-->>CR: Return file data
    CR-->>User: Return list of file information
Loading

Poem

I’m a bunny hopping through these lines of code,
From Docker builds to async streams, down each winding road.
Dependencies updated, workflows set to race,
My whiskers twitch with every bug we chase.
With leaps of joy for each new feature shown,
I celebrate these changes – let our project bloom! 🐰✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Vasilije1990 Vasilije1990 self-requested a review February 26, 2025 18:40
@Vasilije1990 Vasilije1990 marked this pull request as ready for review February 26, 2025 18:40
@hajdul88 hajdul88 self-requested a review February 26, 2025 18:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (3)
cognee-mcp/src/server.py (1)

122-134: ⚠️ Potential issue

Cognify function may not be fully asynchronous

While cognify() is now being awaited in call_tools(), the function itself doesn't seem to fully utilize async patterns. It awaits cognee.add(text) but then creates an asyncio task for cognee.cognify() without awaiting it.

This means that cognify() will return before the actual cognification is complete, which could lead to race conditions or incorrect behavior.

Consider either:

  1. Awaiting the created task, or
  2. Documenting clearly that this is intentional background processing
async def cognify(text: str, graph_model_file: str = None, graph_model_name: str = None) -> str:
    """Build knowledge graph from the input text"""
    if graph_model_file and graph_model_name:
        graph_model = load_class(graph_model_file, graph_model_name)
    else:
        graph_model = KnowledgeGraph

    await cognee.add(text)

    try:
-        asyncio.create_task(cognee.cognify(graph_model=graph_model))
+        # Option 1: Wait for cognify to complete
+        await cognee.cognify(graph_model=graph_model)
+        
+        # OR Option 2: Run in background but track the task
+        task = asyncio.create_task(cognee.cognify(graph_model=graph_model))
+        # Optional: Store task reference somewhere if you need to track it
    except Exception as e:
        raise ValueError(f"Failed to cognify: {str(e)}")
cognee/tasks/repo_processor/get_local_dependencies.py (1)

104-203: ⚠️ Potential issue

Correct the default parameter type for existing_nodes.

The signature uses existing_nodes: list[DataPoint] = {}, yet the default is a dictionary. This is misleading and can cause runtime errors. Also, mutable default arguments can introduce subtle bugs.

Additionally, the logic that splits import lines might have edge cases (e.g., multiple imports, “from … import …, …”). Validate that these statements parse robustly.

-async def extract_code_parts(
-    tree_root: Node, script_path: str, existing_nodes: list[DataPoint] = {}
-) -> AsyncGenerator[DataPoint, None]:
+async def extract_code_parts(
+    tree_root: Node, script_path: str, existing_nodes: Optional[dict[str, DataPoint]] = None
+) -> AsyncGenerator[DataPoint, None]:
+    if existing_nodes is None:
+        existing_nodes = {}
cognee/modules/retrieval/code_graph_retrieval.py (1)

20-129: ⚠️ Potential issue

Check iteration usage on files_and_codeparts.sourcecode.

sourcecode is declared as a single string in CodeQueryInfo, yet you're iterating over it (lines 76–79), which will process each character instead of separate code blocks. This likely causes incorrect matches. Correct the type or iteration logic.

-class CodeQueryInfo(BaseModel):
-    filenames: List[str] = []
-    sourcecode: str
+class CodeQueryInfo(BaseModel):
+    filenames: List[str] = []
+    sourcecode: List[str] = Field(default_factory=list)
🧹 Nitpick comments (9)
cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt (1)

1-23: New system prompt for code extraction

The prompt provides clear instructions for extracting filenames and code snippets from text with specific examples. The instruction set is well-structured, focusing on accuracy and relevance of extracted code.

Consider capitalizing "Markdown" in instruction #1 since it's a proper noun.

🧰 Tools
🪛 LanguageTool

[grammar] ~6-~6: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...filenames from inline text, headers, or markdown formatting. Empty list of filenames is ...

(MARKDOWN_NNP)

cognee-mcp/Dockerfile (2)

33-37: Consider removing commented installation code or documenting its purpose

These commented-out installation commands might confuse future developers. Either remove them if not needed or add a comment explaining when they should be uncommented.


39-49: Consider adding a non-root user for security

Currently, the container will run as root. For better security, consider creating and switching to a non-root user.

FROM python:3.12-slim-bookworm

WORKDIR /app

+# Create a non-root user
+RUN groupadd -r appuser && useradd -r -g appuser appuser

COPY --from=uv /root/.local /root/.local
COPY --from=uv --chown=appuser:appuser /app/.venv /app/.venv

# Place executables in the environment at the front of the path
ENV PATH="/app/.venv/bin:$PATH"

+# Switch to non-root user
+USER appuser

ENTRYPOINT ["cognee"]
cognee/api/v1/cognify/code_graph_pipeline.py (1)

47-47: Consider documenting the impact of enabling detailed extraction

Changing detailed_extraction from False to True may have performance implications. Consider adding a comment explaining the impact and why this change was made.

cognee/shared/CodeGraphEntities.py (1)

11-11: Consider adding usage notes for module attribute.

This new module: str property in ImportStatement looks good for distinguishing import sources. Just ensure you document its intended use and validate that references to this attribute won't cause confusion with the existing name field for imports.

cognee/tasks/repo_processor/get_local_dependencies.py (3)

1-2: Validate the new imports.

You’ve introduced os and importlib. Make sure both are necessary. If importlib is required only in a small part of the code, consider a more localized import to avoid polluting the global namespace.


30-37: Handle missing or unreadable files robustly.

Currently, you track None if the file can’t be read, but the code returns (None, None) if source_code is nil. Ensure subsequent logic gracefully handles None results to avoid errors when calling source_code_parser.parse.


49-58: Enhance error-handling in resolve_module_path.

While ModuleNotFoundError is caught, other exceptions such as importlib-related errors may occur. Consider broader exception handling or fallback logic for unexpected failures.

cognee/modules/retrieval/code_graph_retrieval.py (1)

4-10: Avoid overshadowing local imports with broad project-specific imports.

You import from cognee.modules.graph.cognee_graph.CogneeGraph and other internal modules. If local variable names or classes partially match, it may cause confusion. Keep an eye on naming collisions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b777cf and 828fce9.

⛔ Files ignored due to path filters (1)
  • cognee-mcp/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/dockerhub-mcp.yml (1 hunks)
  • cognee-mcp/Dockerfile (1 hunks)
  • cognee-mcp/pyproject.toml (2 hunks)
  • cognee-mcp/src/server.py (3 hunks)
  • cognee/api/v1/cognify/code_graph_pipeline.py (3 hunks)
  • cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt (1 hunks)
  • cognee/modules/retrieval/code_graph_retrieval.py (1 hunks)
  • cognee/shared/CodeGraphEntities.py (3 hunks)
  • cognee/tasks/repo_processor/get_local_dependencies.py (4 hunks)
  • cognee/tasks/repo_processor/get_repo_file_dependencies.py (1 hunks)
  • entrypoint-old.sh (0 hunks)
  • entrypoint.sh (1 hunks)
💤 Files with no reviewable changes (1)
  • entrypoint-old.sh
🧰 Additional context used
🪛 LanguageTool
cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt

[grammar] ~6-~6: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...filenames from inline text, headers, or markdown formatting. Empty list of filenames is ...

(MARKDOWN_NNP)

🪛 Hadolint (2.12.0)
cognee-mcp/Dockerfile

[error] 29-29: Use COPY instead of ADD for files and folders

(DL3020)

🪛 actionlint (1.7.4)
.github/workflows/dockerhub-mcp.yml

47-47: property "build" is not defined in object type {meta: {conclusion: string; outcome: string; outputs: {annotations: string; bake-file: string; bake-file-annotations: string; bake-file-labels: string; bake-file-tags: string; json: string; labels: string; tags: string; version: string}}}

(expression)

🔇 Additional comments (19)
entrypoint.sh (1)

34-34: String comparison update for DEBUG environment variable check

The comparison for the DEBUG environment variable has been updated to use a string comparison "true" instead of just true. This ensures proper string-based comparison which is more reliable in shell scripts.

cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)

77-77: Range adjustment to include the end file

The slicing range now includes the end file by using end_range + 1 as the upper bound. This is a proper fix to ensure that all files are processed, particularly the file at the end_range index.

cognee-mcp/pyproject.toml (2)

9-9: Additional database integrations for cognee

The dependency for cognee has been updated to include postgres and neo4j support, which enables more database options for the project.


24-27:

Details

❓ Verification inconclusive

New development dependency group with debugpy

Added a development dependency group with debugpy, which aligns with the debugging functionality in the entrypoint.sh script. This is a good practice for separating development dependencies.

Make sure the debugpy version is compatible with your Python version. You might want to run the following to verify compatibility:


🏁 Script executed:

#!/bin/bash
# Check Python version and debugpy compatibility
python --version
pip show debugpy

Length of output: 90


Action Required: Verify Debugpy Installation & Compatibility
The newly added dev dependency group with debugpy in cognee-mcp/pyproject.toml (lines 24-27) aligns well with the debugging functionality in entrypoint.sh. However, our verification script indicated that while Python 3.11.2 is used, debugpy is not currently installed (i.e., pip show debugpy returned a warning). This could be because development dependencies haven’t been installed by default.

  • Review Location: cognee-mcp/pyproject.toml, lines 24-27.
  • Action Items:
    • Confirm that the development environment installs the dev dependencies (e.g., using pip install .[dev]).
    • Manually verify that the installed version of debugpy is compatible with Python 3.11.2.
cognee-mcp/Dockerfile (1)

1-49: Best practice: Multi-stage Docker build structure looks good

The multi-stage build pattern used here is excellent for minimizing the final image size while properly managing dependencies.

🧰 Tools
🪛 Hadolint (2.12.0)

[error] 29-29: Use COPY instead of ADD for files and folders

(DL3020)

cognee/api/v1/cognify/code_graph_pipeline.py (1)

55-55: Task configuration was simplified with fixed batch size

The batch size for add_data_points is now fixed at 500, which is more straightforward than previous conditional logic.

.github/workflows/dockerhub-mcp.yml (1)

1-45: CI/CD workflow for Docker image looks good

The workflow correctly sets up Docker Buildx, handles authentication, and builds for multiple platforms (amd64 and arm64). Good use of caching to optimize build times.

cognee-mcp/src/server.py (3)

1-1: Import organization improvement

Moving the asyncio import to the top of the file improves code organization.


96-100: Added await keyword for cognify function

The code now correctly awaits the cognify function, which is good for asynchronous flow.


165-166: Added helpful startup log message

Adding a log message at server startup improves observability.

cognee/shared/CodeGraphEntities.py (4)

24-24: Confirm whether removing name from index fields is intentional.

Previously, the metadata included ["name", "source_code"], but now it’s reduced to only ["source_code"]. This may reduce search precision if the function’s name was used for indexing. Verify other parts of the pipeline don’t depend on it.


33-33: Check class name indexing.

Similarly, ClassDefinition now indexes only the source_code and omits name. Ensure that search or classification processes that rely on class names are still functional.


37-37: Good addition of name attribute in CodeFile.

Attaching a dedicated name field to the file entity can simplify file-based lookups and indexing.


44-44: Ensure feasibility of indexing by name alone.

This change removes source_code from the metadata["index_fields"] in CodeFile. If code-based searches rely on source_code, re-check that the shift to indexing only by name is the intended approach.

cognee/tasks/repo_processor/get_local_dependencies.py (3)

6-6: Use direct references rather than wildcard imports.

from tree_sitter import Language, Node, Parser, Tree is typically fine, but ensure that these references don’t conflict with other similarly named imports in your codebase. It's a good practice to confirm that no naming collisions occur.


26-29: Encapsulating file parsing in FileParser is a good design move.

The dedicated class approach is clearer and prevents re-parsing. This also helps keep code modular and testable.


83-87: Validate concurrency when handling shared parsing.

Reusing the same FileParser instance is efficient, but confirm this is safe in all concurrent contexts (e.g., multiple tasks calling parse_file at once). If concurrency is expected, consider adding locks or reassessing concurrency usage.

cognee/modules/retrieval/code_graph_retrieval.py (2)

1-2: Confirm usage of asyncio and aiofiles.

These asynchronous modules are widely used, but ensure that all file I/O is awaited properly and that any synchronous file operations are replaced if you intend full async performance.


13-17: CodeQueryInfo adds clarity for the return structure.

Defining a Pydantic model is a solid approach for typed outputs. Just ensure the sourcecode field is typed or explained if you plan to parse it further (e.g., as single string vs. a collection of code fragments).

@borisarzentar borisarzentar changed the title Feat/cog 1351 codegraph improvements feat: codegraph improvements and new CODE search [COG-1351] Feb 26, 2025
@borisarzentar borisarzentar merged commit 711ae8e into dev Feb 26, 2025
36 checks passed
@borisarzentar borisarzentar deleted the feat/COG-1351-codegraph-improvements branch February 26, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants