-
Notifications
You must be signed in to change notification settings - Fork 963
feat: Add search by dataset for cognee #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added ability to search by datasets for cognee users Feature COG-912
WalkthroughThe pull request introduces modifications across several files in the Cognee project, focusing on enhancing dataset management, search functionality, and user permissions. The changes primarily involve updating function signatures to support more flexible dataset handling, modifying the Changes
Possibly related PRs
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Removed lock that prevented using multiple datasets in cognify Fix COG-912
Added test to verify getting documents related to datasets intended for search Test COG-912
Resolve issue with default value for graph model in cognify endpoint Fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🔭 Outside diff range comments (1)
cognee/tests/test_pgvector.py (1)
Line range hint
92-96: Enhance search test coverageWhile the current test verifies basic dataset filtering for CHUNKS search, consider:
- Adding dataset filtering tests for INSIGHTS and SUMMARIES search types
- Including negative test cases (e.g., searching in non-existent dataset)
- Testing with multiple datasets in the filter
Example additional test cases:
# Test search with non-existent dataset search_results = await cognee.search( SearchType.CHUNKS, query_text=random_node_name, datasets=["non_existent_dataset"] ) assert len(search_results) == 0, "Should return empty results for non-existent dataset" # Test search with multiple datasets search_results = await cognee.search( SearchType.CHUNKS, query_text=random_node_name, datasets=[dataset_name_1, dataset_name_2] ) assert len(search_results) > 0, "Should return results from both datasets"
🧹 Nitpick comments (3)
cognee/modules/users/permissions/methods/get_document_ids_for_user.py (2)
24-24: Typo in variable name 'documnets_ids_in_dataset'The variable
documnets_ids_in_datasetcontains a typo. It should bedocument_ids_in_dataset.Apply this diff to fix the typo:
-documnets_ids_in_dataset = set() +document_ids_in_dataset = set()Make sure to update all references to this variable within the function.
46-46: Fix grammatical error in commentThere's a grammatical error in the comment. It should read: "If document is related to dataset, add it to return value"
Apply this diff to correct the comment:
-# If document is related to dataset added it to return value +# If document is related to dataset, add it to return valuecognee/tests/test_pgvector.py (1)
48-49: Consider improving test data organizationThe dataset setup is functional but could be better organized. Consider:
- Using constants for dataset names at the module level
- Moving test data to separate fixture files for better maintainability
The current implementation works well for testing multiple dataset functionality.
+ # At the top of the file + DATASET_NATURAL_LANGUAGE = "natural_language" + DATASET_QUANTUM = "quantum" - dataset_name_1 = "natural_language" - dataset_name_2 = "quantum" + dataset_name_1 = DATASET_NATURAL_LANGUAGE + dataset_name_2 = DATASET_QUANTUMAlso applies to: 54-54, 64-64, 66-66
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cognee/api/v1/cognify/cognify_v2.py(1 hunks)cognee/api/v1/cognify/routers/get_cognify_router.py(1 hunks)cognee/api/v1/search/search_v2.py(3 hunks)cognee/modules/data/models/__init__.py(1 hunks)cognee/modules/users/permissions/methods/get_document_ids_for_user.py(2 hunks)cognee/tests/test_pgvector.py(4 hunks)
🔇 Additional comments (2)
cognee/modules/data/models/__init__.py (1)
3-3: Addition of 'DatasetData' import is appropriate
Including DatasetData in the __init__.py file correctly exposes it for use in other parts of the application.
cognee/tests/test_pgvector.py (1)
7-7: LGTM: Import added for user permission testing
The import of get_default_user is appropriately placed and supports the new dataset permission testing functionality.
| if datasets: | ||
| documnets_ids_in_dataset = set() | ||
| # If datasets are specified filter out documents that aren't part of the specified datasets | ||
| for dataset in datasets: | ||
| # Find dataset id for dataset element | ||
| dataset_id = (await session.scalars( | ||
| select(Dataset.id) | ||
| .where( | ||
| Dataset.name == dataset, | ||
| Dataset.owner_id == user_id, | ||
| ) | ||
| )).one_or_none() | ||
|
|
||
| # Check which documents are connected to this dataset | ||
| for document_id in document_ids: | ||
| data_id = (await session.scalars( | ||
| select(DatasetData.data_id) | ||
| .where( | ||
| DatasetData.dataset_id == dataset_id, | ||
| DatasetData.data_id == document_id, | ||
| ) | ||
| )).one_or_none() | ||
|
|
||
| # If document is related to dataset added it to return value | ||
| if data_id: | ||
| documnets_ids_in_dataset.add(document_id) | ||
| return list(documnets_ids_in_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential performance issue due to multiple database queries inside loops
The current implementation executes database queries within nested loops, which can significantly degrade performance due to the large number of database calls.
Consider refactoring to reduce the number of queries by fetching data in bulk. Here's a suggested approach:
- Retrieve all dataset IDs for the specified datasets in a single query.
- Retrieve all document IDs associated with these datasets in a single query.
- Intersect these document IDs with the user's document IDs.
Apply this refactor:
# Get dataset IDs for the specified datasets
dataset_ids = (await session.scalars(
select(Dataset.id).where(
Dataset.name.in_(datasets),
Dataset.owner_id == user_id
)
)).all()
# If no matching datasets are found, return an empty list
if not dataset_ids:
return []
# Get all document IDs associated with these datasets
document_ids_in_datasets = (await session.scalars(
select(DatasetData.data_id).where(
DatasetData.dataset_id.in_(dataset_ids),
DatasetData.data_id.in_(document_ids)
)
)).all()
return list(document_ids_in_datasets)This refactor minimizes the number of database queries and improves performance.
| async def search(query_type: SearchType, query_text: str, user: User = None, | ||
| datasets: Union[list[str], str, None] = None) -> list: | ||
| # We use lists from now on for datasets | ||
| if isinstance(datasets, str): | ||
| datasets = [datasets] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve type annotation and handling of 'datasets' parameter
The datasets parameter is annotated as Union[list[str], str, None]. To enhance type clarity and consistency, consider updating the type annotation and handling.
Apply this diff to improve the type annotation:
-from typing import Callable, Dict, Union
+from typing import Callable, Dict, Union, Optional, List
-async def search(query_type: SearchType, query_text: str, user: User = None,
- datasets: Union[list[str], str, None] = None) -> list:
+async def search(query_type: SearchType, query_text: str, user: User = None,
+ datasets: Optional[Union[List[str], str]] = None) -> list:
# We use lists from now on for datasets
if isinstance(datasets, str):
datasets = [datasets]Alternatively, if you prefer to accept only lists:
-async def search(query_type: SearchType, query_text: str, user: User = None,
- datasets: Union[list[str], str, None] = None) -> list:
+async def search(query_type: SearchType, query_text: str, user: User = None,
+ datasets: Optional[List[str]] = None) -> list:
# Ensure datasets is a list
if datasets is None:
datasets = []This approach simplifies type checking and makes the code more predictable.
Committable suggestion skipped: line range outside the PR's diff.
| # Test getting of documents for search per dataset | ||
| from cognee.modules.users.permissions.methods import get_document_ids_for_user | ||
| user = await get_default_user() | ||
| document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) | ||
| assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1" | ||
|
|
||
| # Test getting of documents for search when no dataset is provided | ||
| from cognee.modules.users.permissions.methods import get_document_ids_for_user | ||
| user = await get_default_user() | ||
| document_ids = await get_document_ids_for_user(user.id) | ||
| assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove duplicate import and improve test structure
The permission tests are comprehensive but have some issues:
- The
get_document_ids_for_userimport is duplicated (lines 71 and 77) - The tests could be better structured as separate test functions
- # Test getting of documents for search per dataset
- from cognee.modules.users.permissions.methods import get_document_ids_for_user
- user = await get_default_user()
- document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
- assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
- # Test getting of documents for search when no dataset is provided
- from cognee.modules.users.permissions.methods import get_document_ids_for_user
- user = await get_default_user()
- document_ids = await get_document_ids_for_user(user.id)
- assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
+ async def test_get_documents_with_dataset_filter():
+ user = await get_default_user()
+ document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
+ assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
+ async def test_get_documents_without_dataset_filter():
+ user = await get_default_user()
+ document_ids = await get_document_ids_for_user(user.id)
+ assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"Committable suggestion skipped: line range outside the PR's diff.
Resolve typo in code chore COG-912
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/tests/test_pgvector.py (1)
70-80: 🛠️ Refactor suggestionRestructure permission tests and remove duplicate import
The permission tests have the same issues as previously noted:
- The
get_document_ids_for_userimport is duplicated- The tests should be structured as separate test functions for better maintainability
-# Test getting of documents for search per dataset -from cognee.modules.users.permissions.methods import get_document_ids_for_user -user = await get_default_user() -document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) -assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1" -# Test getting of documents for search when no dataset is provided -from cognee.modules.users.permissions.methods import get_document_ids_for_user -user = await get_default_user() -document_ids = await get_document_ids_for_user(user.id) -assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2" +async def test_get_documents_with_dataset_filter(): + """Test document retrieval when filtered by dataset""" + user = await get_default_user() + document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) + assert len(document_ids) == 1, f"Expected 1 document, got {len(document_ids)}" +async def test_get_documents_without_dataset_filter(): + """Test document retrieval without dataset filter""" + user = await get_default_user() + document_ids = await get_document_ids_for_user(user.id) + assert len(document_ids) == 2, f"Expected 2 documents, got {len(document_ids)}"
🧹 Nitpick comments (2)
cognee/tests/test_pgvector.py (2)
54-54: Add assertions to verify dataset operationsThe dataset operations (add and cognify) lack assertions to verify their success. Consider adding checks to ensure the operations completed as expected.
await cognee.add([explanation_file_path], dataset_name_1) +assert await cognee.get_dataset_size(dataset_name_1) > 0, "Dataset 1 should not be empty" await cognee.add([text], dataset_name_2) +assert await cognee.get_dataset_size(dataset_name_2) > 0, "Dataset 2 should not be empty" await cognee.cognify([dataset_name_2, dataset_name_1]) +assert await cognee.is_dataset_cognified(dataset_name_1), "Dataset 1 should be cognified" +assert await cognee.is_dataset_cognified(dataset_name_2), "Dataset 2 should be cognified"Also applies to: 64-64, 66-66
Line range hint
91-95: Enhance search test coverageWhile the current test verifies basic search functionality with dataset filtering, consider adding:
- A test case for searching across multiple datasets
- More specific assertions about the search results content
search_results = await cognee.search(SearchType.CHUNKS, query_text = random_node_name, datasets=[dataset_name_2]) -assert len(search_results) != 0, "The search results list is empty." +assert len(search_results) != 0, "The search results list is empty." +assert all("quantum" in result.lower() for result in search_results), "Search results should contain the query term" + +# Test search across multiple datasets +multi_dataset_results = await cognee.search( + SearchType.CHUNKS, + query_text=random_node_name, + datasets=[dataset_name_1, dataset_name_2] +) +assert len(multi_dataset_results) >= len(search_results), "Multi-dataset search should return at least as many results"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/modules/users/permissions/methods/get_document_ids_for_user.py(2 hunks)cognee/tests/test_pgvector.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/modules/users/permissions/methods/get_document_ids_for_user.py
🔇 Additional comments (1)
cognee/tests/test_pgvector.py (1)
7-7: LGTM: Clean import and clear dataset naming
The import is properly placed and the dataset names are descriptive and well-defined.
Also applies to: 48-49
Added ability to search by datasets for cognee users
Feature COG-912
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests