Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test: Add test to verify deletion of local files
Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475
  • Loading branch information
dexters1 committed Dec 19, 2024
commit 1406b6f0781baa1dfab0b3348e36bc30d7368ee5
Original file line number Diff line number Diff line change
Expand Up @@ -128,15 +128,15 @@ async def delete_data_entity(self, data_id: UUID):

# Check if other data objects point to the same raw data location
raw_data_location_entities = (await session.execute(
select(Data).where(Data.raw_data_location == data_entity.raw_data_location))).all()
select(Data.raw_data_location).where(Data.raw_data_location == data_entity.raw_data_location))).all()

# Don't delete local file unless this is the only reference to the file in the database
if len(raw_data_location_entities) == 1:
# delete local file only if it's created by cognee
from cognee.base_config import get_base_config
config = get_base_config()
if config.data_root_directory in raw_data_location_entities[0].raw_data_location:
os.remove(raw_data_location_entities[0])
os.remove(raw_data_location_entities[0].raw_data_location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the file exists? If it doesn't, then we have a bug somewhere and it should be reported to sentry.


await session.execute(delete(Data).where(Data.id == data_id))
await session.commit()
Comment on lines +114 to +152
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add error handling and validation for file deletion operations

The verification revealed several issues with file deletion implementations:

  • In SqlAlchemyAdapter.py, file deletion lacks both existence validation and error handling, which could lead to unhandled exceptions.
  • While LocalStorage.py and networkx/adapter.py properly check file existence before deletion, only the networkx adapter implements proper error handling.

Key locations requiring attention:

  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py: Add file existence check and try-catch block around os.remove
  • cognee/infrastructure/files/storage/LocalStorage.py: Add error handling around file deletion operation
🔗 Analysis chain

Verify file deletion implementation across codebase.

Let's verify that all file deletion scenarios are properly handled and there are no orphaned files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential file handling issues in the codebase

# Look for other file deletion operations that might need similar handling
echo "Checking for other file deletion operations:"
rg -l "os\.remove|unlink" --type py

# Look for file path handling that might need validation
echo -e "\nChecking for file path handling:"
rg -l "raw_data_location|file_path|path\.join" --type py

# Look for TODO/FIXME comments related to file handling
echo -e "\nChecking for TODO/FIXME comments related to file handling:"
rg -i "todo|fixme|hack|xxx" -C 2 --type py | rg -i "file|path|delete"

Length of output: 3747


Script:

#!/bin/bash
# Let's examine the file deletion implementations in the identified files

echo "Checking SqlAlchemyAdapter.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

echo -e "\nChecking LocalStorage.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/files/storage/LocalStorage.py

echo -e "\nChecking networkx adapter implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/graph/networkx/adapter.py

echo -e "\nChecking for file existence validation before deletion:"
rg "os\.path\.exists|os\.path\.isfile" -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py

echo -e "\nChecking for exception handling around file operations:"
rg "try.*:.*os\.remove|except.*os\." -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py

Length of output: 5352

Expand Down
53 changes: 43 additions & 10 deletions cognee/tests/test_pgvector.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,52 @@
import logging
import pathlib
import cognee

from cognee.modules.data.models import Data
from cognee.api.v1.search import SearchType
from cognee.modules.retrieval.brute_force_triplet_search import brute_force_triplet_search
from cognee.modules.users.methods import get_default_user

logging.basicConfig(level=logging.DEBUG)

async def test_local_file_deletion(data_text, file_location):
from sqlalchemy import select
import hashlib
from cognee.infrastructure.databases.relational import get_relational_engine

engine = get_relational_engine()

async with engine.get_async_session() as session:
# Get hash of data contents
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.md5(encoded_text).hexdigest()
Comment on lines +22 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using a more secure hashing algorithm

MD5 is cryptographically weak and susceptible to hash collisions. Consider using a more secure algorithm like SHA-256.

-        encoded_text = data_text.encode("utf-8")
-        data_hash = hashlib.md5(encoded_text).hexdigest()
+        encoded_text = data_text.encode("utf-8")
+        data_hash = hashlib.sha256(encoded_text).hexdigest()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.md5(encoded_text).hexdigest()
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.sha256(encoded_text).hexdigest()

# Get data entry from database based on hash contents
data = (await session.scalars(select(Data).where(Data.content_hash == data_hash))).one()
assert os.path.isfile(data.raw_data_location), f"Data location doesn't exist: {data.raw_data_location}"
# Test deletion of data along with local files created by cognee
await engine.delete_data_entity(data.id)
assert not os.path.exists(data.raw_data_location), f"Data location exists: {data.raw_data_location}"

async with engine.get_async_session() as session:
# Get data entry from database based on file path
data = (await session.scalars(select(Data).where(Data.raw_data_location == file_location))).one()
assert os.path.isfile(data.raw_data_location), f"Data location doesn't exist: {data.raw_data_location}"
# Test local files not created by cognee won't get deleted
await engine.delete_data_entity(data.id)
assert os.path.exists(data.raw_data_location), f"Data location doesn't exists: {data.raw_data_location}"

async def test_getting_of_documents(dataset_name_1):
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"

# Test getting of documents for search when no dataset is provided
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
Comment on lines +40 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve test function structure and assertions

Several improvements needed:

  1. Add docstring
  2. Improve assertion messages
  3. Remove duplicate user retrieval
 async def test_getting_of_documents(dataset_name_1):
+    """
+    Test document retrieval functionality:
+    1. Verify correct document count when filtering by dataset
+    2. Verify correct document count when no dataset filter is applied
+    """
     # Test getting of documents for search per dataset
     from cognee.modules.users.permissions.methods import get_document_ids_for_user
     user = await get_default_user()
     document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
-    assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
+    assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}"

     # Test getting of documents for search when no dataset is provided
-    user = await get_default_user()  # Duplicate user retrieval
     document_ids = await get_document_ids_for_user(user.id)
-    assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
+    assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_getting_of_documents(dataset_name_1):
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
# Test getting of documents for search when no dataset is provided
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
async def test_getting_of_documents(dataset_name_1):
"""
Test document retrieval functionality:
1. Verify correct document count when filtering by dataset
2. Verify correct document count when no dataset filter is applied
"""
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}"
# Test getting of documents for search when no dataset is provided
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}"



async def main():
cognee.config.set_vector_db_config(
Expand Down Expand Up @@ -67,16 +107,7 @@ async def main():

from cognee.infrastructure.databases.vector import get_vector_engine

# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"

# Test getting of documents for search when no dataset is provided
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
await test_getting_of_documents(dataset_name_1)

vector_engine = get_vector_engine()
random_node = (await vector_engine.search("entity_name", "Quantum computer"))[0]
Expand Down Expand Up @@ -106,6 +137,8 @@ async def main():
results = await brute_force_triplet_search('What is a quantum computer?')
assert len(results) > 0

await test_local_file_deletion(text, explanation_file_path)

await cognee.prune.prune_data()
assert not os.path.isdir(data_directory_path), "Local data files are not deleted"

Expand Down