-
Notifications
You must be signed in to change notification settings - Fork 955
Updating cognify pipeline documentation #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
2cbbcc5
Updating cognify pipeline documentation
alekszievr 8317eec
typo fix
alekszievr 2870528
Update docs/pipelines.md
borisarzentar 8156c90
removing a minor confusing part
alekszievr eb54830
Merge branch 'main' into docs-COG-515-update-pipeline-documentation
alekszievr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # PIPELINES | ||
|
|
||
| Cognee uses [tasks](https://github.com/topoteretes/cognee/blob/main/cognee/modules/pipelines/tasks/Task.py) grouped into pipelines that populate graph and vector stores. [These tasks](https://github.com/topoteretes/cognee/tree/main/cognee/tasks) analyze and enrich data, enhancing the quality of answers produced by Large Language Models (LLMs). | ||
|
|
||
| The tasks are managed and executed asynchronously using the `run_tasks` and `run_tasks_parallel` functions. | ||
|
|
||
| ```python | ||
| pipeline = run_tasks(tasks, documents) | ||
| async for result in pipeline: | ||
| print(result) | ||
| ``` | ||
|
|
||
| ## Main pipeline: [cognee.cognify](https://github.com/topoteretes/cognee/blob/168cb5d1bf1964b5b0c645b2f3d8638d84554fda/cognee/api/v1/cognify/cognify_v2.py#L38) | ||
|
|
||
| This is the main pipeline currently implemented in cognee. It is designed to process data in a structured way and populate the graph and vector stores. | ||
|
|
||
|
|
||
| This function is the entry point for processing datasets. It handles dataset retrieval, user authorization, and manages the execution of a pipeline of tasks that process documents. | ||
|
|
||
| ### Parameters | ||
|
|
||
| - `datasets: Union[str, list[str]] = None`: A string or list of dataset names to be processed. | ||
| - `user: User = None`: The user requesting the processing. If not provided, the default user is retrieved. | ||
|
|
||
| ### Steps in the Function | ||
|
|
||
| #### User Authentication | ||
|
|
||
| ```python | ||
| if user is None: | ||
| user = await get_default_user() | ||
| ``` | ||
|
|
||
| If no user is provided, the function retrieves the default user. | ||
|
|
||
| #### Handling Empty or String Dataset Input | ||
|
|
||
| ```python | ||
| existing_datasets = await get_datasets(user.id) | ||
| if datasets is None or len(datasets) == 0: | ||
| datasets = existing_datasets | ||
| if type(datasets[0]) == str: | ||
| datasets = await get_datasets_by_name(datasets, user.id) | ||
| ``` | ||
|
|
||
| If no datasets are provided, the function retrieves all datasets owned by the user. If a list of dataset names (strings) is provided, they are converted into dataset objects. | ||
|
|
||
| #### Selecting datasets from the input list that are owned by the user | ||
|
|
||
| ```python | ||
| existing_datasets_map = { | ||
| generate_dataset_name(dataset.name): True for dataset in existing_datasets | ||
| } | ||
| ``` | ||
|
|
||
| #### Run Cognify Pipeline for Each Dataset | ||
|
|
||
| ```python | ||
| awaitables = [] | ||
|
|
||
| for dataset in datasets: | ||
| dataset_name = generate_dataset_name(dataset.name) | ||
|
|
||
| if dataset_name in existing_datasets_map: | ||
| awaitables.append(run_cognify_pipeline(dataset, user)) | ||
|
|
||
| return await asyncio.gather(*awaitables) | ||
|
|
||
| The `run_cognify_pipeline` function is defined within `cognify` and is responsible for processing a single dataset. This is where most of the heavy lifting occurs. The function processes multiple datasets concurrently using `asyncio.gather`. | ||
|
|
||
|
|
||
| #### Pipeline Tasks | ||
|
|
||
| The pipeline consists of several tasks, each responsible for different parts of the processing: | ||
|
|
||
| - `classify_documents`: Converts each of the documents into one of the specific Document types: PdfDocument, AudioDocument, ImageDocument or TextDocument | ||
| - `check_permissions_on_documents`: Checks if the user has the necessary permissions to access the documents. In this case, it checks for "write" permission. | ||
| - `extract_chunks_from_documents`: Extracts text chunks based on the document type. | ||
| - `add_data_points`: Creates nodes and edges from the chunks and their properties. Adds them to the graph engine. | ||
| - `extract_graph_from_data`: Generates knowledge graphs from the document chunks. | ||
| - `summarize_text`: Extracts a summary for each chunk using an llm. | ||
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.