-
Notifications
You must be signed in to change notification settings - Fork 955
Updating cognify pipeline documentation #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes in this pull request include the addition of a new section in Changes
Possibly related PRs
Suggested reviewers
Poem
Warning Tool Failures:Tool Failure Count:Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
docs/pipelines.md (2)
5-11: Enhance the documentation of task execution modes.While the documentation mentions both
run_tasksandrun_tasks_parallel, onlyrun_tasksis demonstrated. Consider adding:
- An example of
run_tasks_parallelusage- Guidelines on when to use sequential vs parallel execution
- Performance implications of each approach
Example addition:
### Sequential vs Parallel Execution Choose between `run_tasks` and `run_tasks_parallel` based on your needs: ```python # Sequential execution - for tasks that must run in order pipeline = run_tasks(tasks, documents) # Parallel execution - for independent tasks pipeline = run_tasks_parallel(tasks, documents)
- Use sequential execution when tasks depend on each other's results
- Use parallel execution for independent tasks to improve performance
--- `77-82`: **Enhance task documentation and fix grammar.** The task descriptions could be more detailed, and there's a grammar issue with "an llm". Suggest updating to: ```markdown The pipeline executes the following tasks in order: 1. `classify_documents`: Analyzes and categorizes documents into specific types (PdfDocument, AudioDocument, ImageDocument, or TextDocument) based on their content and metadata. 2. `check_permissions_on_documents`: Validates user permissions, ensuring "write" access is available for all operations. 3. `extract_chunks_from_documents`: Processes documents to extract meaningful text segments, with chunk size and overlap configurable per document type. 4. `add_data_points`: Constructs the knowledge graph by: - Creating nodes from document chunks - Establishing relationships between nodes - Storing metadata and properties 5. `extract_graph_from_data`: Generates a structured knowledge representation by analyzing relationships between chunks. 6. `summarize_text`: Creates concise summaries of each chunk using a Large Language Model (LLM), preserving key information while reducing content size. Note: These tasks must execute in the specified order due to dependencies between steps.🧰 Tools
🪛 LanguageTool
[uncategorized] ~77-~77: Loose punctuation mark.
Context: ... the processing: -classify_documents: Converts each of the documents into one...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ssion. -extract_chunks_from_documents: Extracts text chunks based on the docum...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...n the document type. -add_data_points: Creates nodes and edges from the chunks...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~81-~81: Loose punctuation mark.
Context: ...raph engine. -extract_graph_from_data: Generates knowledge graphs from the doc...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~82-~82: Loose punctuation mark.
Context: ... the document chunks. -summarize_text: Extracts a summary for each chunk using...(UNLIKELY_OPENING_PUNCTUATION)
[misspelling] ~82-~82: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...Extracts a summary for each chunk using an llm.(EN_A_VS_AN)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
docs/pipelines.md(1 hunks)docs/templates.md(0 hunks)
💤 Files with no reviewable changes (1)
- docs/templates.md
🧰 Additional context used
🪛 LanguageTool
docs/pipelines.md
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... dataset: Union[str, list[str]] = None: A string or list of dataset names to be...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. - user: User = None: The user requesting the processing. If ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~77-~77: Loose punctuation mark.
Context: ... the processing: - classify_documents: Converts each of the documents into one...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ssion. - extract_chunks_from_documents: Extracts text chunks based on the docum...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...n the document type. - add_data_points: Creates nodes and edges from the chunks...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~81-~81: Loose punctuation mark.
Context: ...raph engine. - extract_graph_from_data: Generates knowledge graphs from the doc...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~82-~82: Loose punctuation mark.
Context: ... the document chunks. - summarize_text: Extracts a summary for each chunk using...
(UNLIKELY_OPENING_PUNCTUATION)
[misspelling] ~82-~82: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...Extracts a summary for each chunk using an llm.
(EN_A_VS_AN)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
docs/pipelines.md (3)
7-11: Add error handling and type hints to the code example.The code example would be more helpful with error handling and type hints to demonstrate best practices.
Consider updating to:
-pipeline = run_tasks(tasks, documents) -async for result in pipeline: - print(result) +from typing import List, AsyncIterator +from cognee.modules.pipelines.tasks import Task, Document + +async def process_documents(tasks: List[Task], documents: List[Document]) -> None: + try: + pipeline: AsyncIterator = run_tasks(tasks, documents) + async for result in pipeline: + print(f"Processed result: {result}") + except Exception as e: + print(f"Error processing documents: {e}")
22-23: Enhance parameter documentation.The parameter documentation could be more specific about types and requirements.
Consider updating to:
-- `datasets: Union[str, list[str]] = None`: A string or list of dataset names to be processed. -- `user: User = None`: The user requesting the processing. If not provided, the default user is retrieved. +- `datasets: Union[str, list[str]] = None`: A string or list of dataset names to be processed. If None or empty, all datasets owned by the user will be processed. +- `user: User = None`: The user requesting the processing (import from `cognee.models.user`). If not provided, the default user is retrieved using `get_default_user()`.🧰 Tools
🪛 LanguageTool
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. -user: User = None: The user requesting the processing. If ...(UNLIKELY_OPENING_PUNCTUATION)
82-82: Fix article usage before 'LLM'.The article 'an' is incorrectly used before 'llm'. Since 'LLM' is typically capitalized and pronounced as individual letters, it should use 'a'.
-- `summarize_text`: Extracts a summary for each chunk using an llm. +- `summarize_text`: Extracts a summary for each chunk using a LLM.🧰 Tools
🪛 LanguageTool
[uncategorized] ~82-~82: Loose punctuation mark.
Context: ... the document chunks. -summarize_text: Extracts a summary for each chunk using...(UNLIKELY_OPENING_PUNCTUATION)
[misspelling] ~82-~82: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...Extracts a summary for each chunk using an llm.(EN_A_VS_AN)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
docs/pipelines.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/pipelines.md
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. - user: User = None: The user requesting the processing. If ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~77-~77: Loose punctuation mark.
Context: ... the processing: - classify_documents: Converts each of the documents into one...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ssion. - extract_chunks_from_documents: Extracts text chunks based on the docum...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...n the document type. - add_data_points: Creates nodes and edges from the chunks...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~81-~81: Loose punctuation mark.
Context: ...raph engine. - extract_graph_from_data: Generates knowledge graphs from the doc...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~82-~82: Loose punctuation mark.
Context: ... the document chunks. - summarize_text: Extracts a summary for each chunk using...
(UNLIKELY_OPENING_PUNCTUATION)
[misspelling] ~82-~82: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...Extracts a summary for each chunk using an llm.
(EN_A_VS_AN)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
docs/pipelines.md (3)
7-11: Enhance code example with more context.The code example would be more helpful with additional context about the
tasksanddocumentsparameters, along with imports and type hints.Consider expanding the example:
from typing import List, AsyncIterator from cognee.modules.pipelines.tasks import Task from cognee.modules.documents import Document async def process_documents(tasks: List[Task], documents: List[Document]) -> AsyncIterator: pipeline = run_tasks(tasks, documents) async for result in pipeline: print(result)
76-81: Enhance task documentation with configuration details.The task descriptions would be more helpful with additional information about:
- Required configuration for each task type
- Input/output expectations
- Any dependencies or prerequisites
- Example configuration snippets
Would you like me to help generate detailed documentation for each task type?
1-81: Consider adding advanced usage patterns and best practices.To make the documentation more comprehensive, consider adding sections on:
- Error handling and troubleshooting
- Performance optimization tips
- Common usage patterns and anti-patterns
- Integration examples with different document types
This would help users better understand how to effectively use the pipeline system in production environments.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. -user: User = None: The user requesting the processing. If ...(UNLIKELY_OPENING_PUNCTUATION)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
docs/pipelines.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/pipelines.md
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. - user: User = None: The user requesting the processing. If ...
(UNLIKELY_OPENING_PUNCTUATION)
hajdul88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it is a great improvement of the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
docs/pipelines.md (1)
7-11: Enhance code example with context and error handling.The code example would be more helpful with:
- Example of what
tasksanddocumentscontain- Error handling for async operations
```python +# Example tasks and documents +tasks = [ + ClassifyDocumentsTask(), + ExtractChunksTask() +] +documents = [ + Document(content="Example content") +] + pipeline = run_tasks(tasks, documents) -async for result in pipeline: - print(result) +try: + async for result in pipeline: + print(f"Task result: {result}") +except Exception as e: + print(f"Error processing pipeline: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
docs/pipelines.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/pipelines.md
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. - user: User = None: The user requesting the processing. If ...
(UNLIKELY_OPENING_PUNCTUATION)
🔇 Additional comments (4)
docs/pipelines.md (4)
13-19: LGTM!
The main pipeline description is clear and well-documented with appropriate source code references.
20-24: LGTM - Parameter documentation is now consistent.
The parameter documentation correctly uses datasets instead of dataset, addressing the previous inconsistency issue.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ...datasets: Union[str, list[str]] = None: A string or list of dataset names to be...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~23-~23: Loose punctuation mark.
Context: ...s to be processed. - user: User = None: The user requesting the processing. If ...
(UNLIKELY_OPENING_PUNCTUATION)
69-69:
Fix markdown code block closure.
The line appears to be part of the code block above but contains documentation text. It should be outside the code block.
-The `run_cognify_pipeline` function is defined within `cognify` and is responsible for processing a single dataset. This is where most of the heavy lifting occurs. The function processes multiple datasets concurrently using `asyncio.gather`.Move this line outside the code block.
Likely invalid or redundant comment.
76-81: Consider adding links to task implementations.
The task descriptions are clear, but adding links to their implementations would make it easier for developers to explore the codebase.
Example format:
- [`classify_documents`](link-to-implementation): Converts each of the documents...✅ Verification successful
Links to task implementations can be added as suggested
Based on the search results, we can confirm the implementations exist at these locations:
classify_documentscheck_permissions_on_documentsextract_chunks_from_documentsadd_data_pointsextract_graph_from_datasummarize_text
The suggestion to add implementation links is valid and feasible as all the task implementations are present in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Find implementation files for the listed tasks
echo "Searching for task implementations..."
rg -l "class.*Task" | grep -i "classify\|permission\|chunk\|data_point\|graph\|summarize"
Length of output: 176
Script:
#!/bin/bash
# Let's try a broader search for these functions/methods
echo "Searching for implementations..."
rg -l "classify_documents|check_permissions_on_documents|extract_chunks_from_documents|add_data_points|extract_graph_from_data|summarize_text"
# Also search for potential Python files containing these tasks
echo -e "\nSearching for Python files..."
fd -e py | grep -i "classify\|permission\|chunk\|data\|graph\|summarize"
# Search for potential task definitions in any format
echo -e "\nSearching for task definitions..."
rg "def (classify_documents|check_permissions|extract_chunks|add_data_points|extract_graph|summarize_text)"
Length of output: 9776
Updated the cognify pipeline documentation to match current code. Got rid of outdated docs.
Summary by CodeRabbit
cognee.cognify,run_tasks, andrun_tasks_parallel.