-
Notifications
You must be signed in to change notification settings - Fork 966
feat: codegraph improvements and new CODE search [COG-1351] #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@borisarzentar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 24 minutes and 19 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (5)
WalkthroughThe changes introduce a series of updates across CI/CD configurations, containerization, dependency management, asynchronous processing, and code retrieval. A new GitHub Actions workflow is added to build and push Docker images. The project’s Dockerfile and dependency file have been updated for enhanced deployment and database support. Several Python modules have been refactored to improve asynchronous handling, error management, and structured code extraction. New classes and functions have been added to optimize file parsing and retrieval, and minor modifications were made to startup scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer
participant GH as GitHub Actions
participant Checkout as actions/checkout
participant Buildx as docker/setup-buildx-action
participant DockerLogin as docker/login-action
participant Metadata as docker/metadata-action
participant BuildPush as docker/build-push-action
Dev->>GH: Push commit to main
GH->>Checkout: Checkout repository
GH->>Buildx: Set up Docker Buildx
GH->>DockerLogin: Log in to Docker Hub
GH->>Metadata: Extract image metadata
GH->>BuildPush: Build and push Docker image
BuildPush-->>GH: Return image digest
GH->>Dev: CI/CD process complete
sequenceDiagram
participant User as User Query
participant CR as code_graph_retrieval()
participant Prompt as System Prompt File
participant LLM as LLM Client
participant FileIO as Async File Reader
User->>CR: Sends query
CR->>CR: Validate input
CR->>Prompt: Read system prompt
CR->>LLM: Send query with prompt
LLM-->>CR: Return structured output
CR->>CR: Process search results and query vector engine
CR->>FileIO: Asynchronously read file contents
FileIO-->>CR: Return file data
CR-->>User: Return list of file information
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (3)
cognee-mcp/src/server.py (1)
122-134:⚠️ Potential issueCognify function may not be fully asynchronous
While
cognify()is now being awaited incall_tools(), the function itself doesn't seem to fully utilize async patterns. It awaitscognee.add(text)but then creates an asyncio task forcognee.cognify()without awaiting it.This means that
cognify()will return before the actual cognification is complete, which could lead to race conditions or incorrect behavior.Consider either:
- Awaiting the created task, or
- Documenting clearly that this is intentional background processing
async def cognify(text: str, graph_model_file: str = None, graph_model_name: str = None) -> str: """Build knowledge graph from the input text""" if graph_model_file and graph_model_name: graph_model = load_class(graph_model_file, graph_model_name) else: graph_model = KnowledgeGraph await cognee.add(text) try: - asyncio.create_task(cognee.cognify(graph_model=graph_model)) + # Option 1: Wait for cognify to complete + await cognee.cognify(graph_model=graph_model) + + # OR Option 2: Run in background but track the task + task = asyncio.create_task(cognee.cognify(graph_model=graph_model)) + # Optional: Store task reference somewhere if you need to track it except Exception as e: raise ValueError(f"Failed to cognify: {str(e)}")cognee/tasks/repo_processor/get_local_dependencies.py (1)
104-203:⚠️ Potential issueCorrect the default parameter type for
existing_nodes.The signature uses
existing_nodes: list[DataPoint] = {}, yet the default is a dictionary. This is misleading and can cause runtime errors. Also, mutable default arguments can introduce subtle bugs.Additionally, the logic that splits import lines might have edge cases (e.g., multiple imports, “from … import …, …”). Validate that these statements parse robustly.
-async def extract_code_parts( - tree_root: Node, script_path: str, existing_nodes: list[DataPoint] = {} -) -> AsyncGenerator[DataPoint, None]: +async def extract_code_parts( + tree_root: Node, script_path: str, existing_nodes: Optional[dict[str, DataPoint]] = None +) -> AsyncGenerator[DataPoint, None]: + if existing_nodes is None: + existing_nodes = {}cognee/modules/retrieval/code_graph_retrieval.py (1)
20-129:⚠️ Potential issueCheck iteration usage on
files_and_codeparts.sourcecode.
sourcecodeis declared as a single string inCodeQueryInfo, yet you're iterating over it (lines 76–79), which will process each character instead of separate code blocks. This likely causes incorrect matches. Correct the type or iteration logic.-class CodeQueryInfo(BaseModel): - filenames: List[str] = [] - sourcecode: str +class CodeQueryInfo(BaseModel): + filenames: List[str] = [] + sourcecode: List[str] = Field(default_factory=list)
🧹 Nitpick comments (9)
cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt (1)
1-23: New system prompt for code extractionThe prompt provides clear instructions for extracting filenames and code snippets from text with specific examples. The instruction set is well-structured, focusing on accuracy and relevance of extracted code.
Consider capitalizing "Markdown" in instruction #1 since it's a proper noun.
🧰 Tools
🪛 LanguageTool
[grammar] ~6-~6: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...filenames from inline text, headers, or markdown formatting. Empty list of filenames is ...(MARKDOWN_NNP)
cognee-mcp/Dockerfile (2)
33-37: Consider removing commented installation code or documenting its purposeThese commented-out installation commands might confuse future developers. Either remove them if not needed or add a comment explaining when they should be uncommented.
39-49: Consider adding a non-root user for securityCurrently, the container will run as root. For better security, consider creating and switching to a non-root user.
FROM python:3.12-slim-bookworm WORKDIR /app +# Create a non-root user +RUN groupadd -r appuser && useradd -r -g appuser appuser COPY --from=uv /root/.local /root/.local COPY --from=uv --chown=appuser:appuser /app/.venv /app/.venv # Place executables in the environment at the front of the path ENV PATH="/app/.venv/bin:$PATH" +# Switch to non-root user +USER appuser ENTRYPOINT ["cognee"]cognee/api/v1/cognify/code_graph_pipeline.py (1)
47-47: Consider documenting the impact of enabling detailed extractionChanging
detailed_extractionfrom False to True may have performance implications. Consider adding a comment explaining the impact and why this change was made.cognee/shared/CodeGraphEntities.py (1)
11-11: Consider adding usage notes formoduleattribute.This new
module: strproperty inImportStatementlooks good for distinguishing import sources. Just ensure you document its intended use and validate that references to this attribute won't cause confusion with the existingnamefield for imports.cognee/tasks/repo_processor/get_local_dependencies.py (3)
1-2: Validate the new imports.You’ve introduced
osandimportlib. Make sure both are necessary. Ifimportlibis required only in a small part of the code, consider a more localized import to avoid polluting the global namespace.
30-37: Handle missing or unreadable files robustly.Currently, you track
Noneif the file can’t be read, but the code returns(None, None)ifsource_codeis nil. Ensure subsequent logic gracefully handlesNoneresults to avoid errors when callingsource_code_parser.parse.
49-58: Enhance error-handling inresolve_module_path.While
ModuleNotFoundErroris caught, other exceptions such as importlib-related errors may occur. Consider broader exception handling or fallback logic for unexpected failures.cognee/modules/retrieval/code_graph_retrieval.py (1)
4-10: Avoid overshadowing local imports with broad project-specific imports.You import from
cognee.modules.graph.cognee_graph.CogneeGraphand other internal modules. If local variable names or classes partially match, it may cause confusion. Keep an eye on naming collisions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
cognee-mcp/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/dockerhub-mcp.yml(1 hunks)cognee-mcp/Dockerfile(1 hunks)cognee-mcp/pyproject.toml(2 hunks)cognee-mcp/src/server.py(3 hunks)cognee/api/v1/cognify/code_graph_pipeline.py(3 hunks)cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt(1 hunks)cognee/modules/retrieval/code_graph_retrieval.py(1 hunks)cognee/shared/CodeGraphEntities.py(3 hunks)cognee/tasks/repo_processor/get_local_dependencies.py(4 hunks)cognee/tasks/repo_processor/get_repo_file_dependencies.py(1 hunks)entrypoint-old.sh(0 hunks)entrypoint.sh(1 hunks)
💤 Files with no reviewable changes (1)
- entrypoint-old.sh
🧰 Additional context used
🪛 LanguageTool
cognee/infrastructure/llm/prompts/codegraph_retriever_system.txt
[grammar] ~6-~6: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...filenames from inline text, headers, or markdown formatting. Empty list of filenames is ...
(MARKDOWN_NNP)
🪛 Hadolint (2.12.0)
cognee-mcp/Dockerfile
[error] 29-29: Use COPY instead of ADD for files and folders
(DL3020)
🪛 actionlint (1.7.4)
.github/workflows/dockerhub-mcp.yml
47-47: property "build" is not defined in object type {meta: {conclusion: string; outcome: string; outputs: {annotations: string; bake-file: string; bake-file-annotations: string; bake-file-labels: string; bake-file-tags: string; json: string; labels: string; tags: string; version: string}}}
(expression)
🔇 Additional comments (19)
entrypoint.sh (1)
34-34: String comparison update for DEBUG environment variable checkThe comparison for the DEBUG environment variable has been updated to use a string comparison
"true"instead of justtrue. This ensures proper string-based comparison which is more reliable in shell scripts.cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
77-77: Range adjustment to include the end fileThe slicing range now includes the end file by using
end_range + 1as the upper bound. This is a proper fix to ensure that all files are processed, particularly the file at theend_rangeindex.cognee-mcp/pyproject.toml (2)
9-9: Additional database integrations for cogneeThe dependency for cognee has been updated to include postgres and neo4j support, which enables more database options for the project.
24-27:Details
❓ Verification inconclusive
New development dependency group with debugpy
Added a development dependency group with debugpy, which aligns with the debugging functionality in the entrypoint.sh script. This is a good practice for separating development dependencies.
Make sure the debugpy version is compatible with your Python version. You might want to run the following to verify compatibility:
🏁 Script executed:
#!/bin/bash # Check Python version and debugpy compatibility python --version pip show debugpyLength of output: 90
Action Required: Verify Debugpy Installation & Compatibility
The newly addeddevdependency group withdebugpyincognee-mcp/pyproject.toml(lines 24-27) aligns well with the debugging functionality inentrypoint.sh. However, our verification script indicated that while Python 3.11.2 is used,debugpyis not currently installed (i.e.,pip show debugpyreturned a warning). This could be because development dependencies haven’t been installed by default.
- Review Location:
cognee-mcp/pyproject.toml, lines 24-27.- Action Items:
- Confirm that the development environment installs the
devdependencies (e.g., usingpip install .[dev]).- Manually verify that the installed version of
debugpyis compatible with Python 3.11.2.cognee-mcp/Dockerfile (1)
1-49: Best practice: Multi-stage Docker build structure looks goodThe multi-stage build pattern used here is excellent for minimizing the final image size while properly managing dependencies.
🧰 Tools
🪛 Hadolint (2.12.0)
[error] 29-29: Use COPY instead of ADD for files and folders
(DL3020)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
55-55: Task configuration was simplified with fixed batch sizeThe batch size for
add_data_pointsis now fixed at 500, which is more straightforward than previous conditional logic..github/workflows/dockerhub-mcp.yml (1)
1-45: CI/CD workflow for Docker image looks goodThe workflow correctly sets up Docker Buildx, handles authentication, and builds for multiple platforms (amd64 and arm64). Good use of caching to optimize build times.
cognee-mcp/src/server.py (3)
1-1: Import organization improvementMoving the asyncio import to the top of the file improves code organization.
96-100: Added await keyword for cognify functionThe code now correctly awaits the cognify function, which is good for asynchronous flow.
165-166: Added helpful startup log messageAdding a log message at server startup improves observability.
cognee/shared/CodeGraphEntities.py (4)
24-24: Confirm whether removingnamefrom index fields is intentional.Previously, the metadata included
["name", "source_code"], but now it’s reduced to only["source_code"]. This may reduce search precision if the function’s name was used for indexing. Verify other parts of the pipeline don’t depend on it.
33-33: Check class name indexing.Similarly,
ClassDefinitionnow indexes only thesource_codeand omitsname. Ensure that search or classification processes that rely on class names are still functional.
37-37: Good addition ofnameattribute inCodeFile.Attaching a dedicated
namefield to the file entity can simplify file-based lookups and indexing.
44-44: Ensure feasibility of indexing bynamealone.This change removes
source_codefrom themetadata["index_fields"]inCodeFile. If code-based searches rely onsource_code, re-check that the shift to indexing only bynameis the intended approach.cognee/tasks/repo_processor/get_local_dependencies.py (3)
6-6: Use direct references rather than wildcard imports.
from tree_sitter import Language, Node, Parser, Treeis typically fine, but ensure that these references don’t conflict with other similarly named imports in your codebase. It's a good practice to confirm that no naming collisions occur.
26-29: Encapsulating file parsing inFileParseris a good design move.The dedicated class approach is clearer and prevents re-parsing. This also helps keep code modular and testable.
83-87: Validate concurrency when handling shared parsing.Reusing the same
FileParserinstance is efficient, but confirm this is safe in all concurrent contexts (e.g., multiple tasks callingparse_fileat once). If concurrency is expected, consider adding locks or reassessing concurrency usage.cognee/modules/retrieval/code_graph_retrieval.py (2)
1-2: Confirm usage ofasyncioandaiofiles.These asynchronous modules are widely used, but ensure that all file I/O is awaited properly and that any synchronous file operations are replaced if you intend full async performance.
13-17:CodeQueryInfoadds clarity for the return structure.Defining a Pydantic model is a solid approach for typed outputs. Just ensure the
sourcecodefield is typed or explained if you plan to parse it further (e.g., as single string vs. a collection of code fragments).
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit