-
Notifications
You must be signed in to change notification settings - Fork 953
fix: Resolve issue with text classification #1704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (2)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughFile type detection logic is enhanced by introducing an optional name parameter through the utility chain. This enables metadata retrieval and type-guessing functions to leverage file names when available. Explicit text extension handling is added to reduce ambiguity for text files. One WAV audio MIME type is added to the audio loader's supported formats. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cognee/infrastructure/files/utils/guess_file_type.py (2)
28-44: Update docstring to document thenameparameter.The function signature now accepts an optional
nameparameter, but the docstring doesn't document it.Apply this diff to update the docstring:
Parameters: ----------- - file (BinaryIO): A binary file stream to analyze for determining the file type. + - name (Optional[str]): Optional filename to use for extension-based type inference. Returns:
58-66: Remove unreachable code.Lines 64-65 are unreachable because line 62 ensures
file_typeis neverNoneat that point. The exception can never be raised.Apply this diff to remove the dead code:
file_type = filetype.guess(file) # If file type could not be determined consider it a plain text file as they don't have magic number encoding if file_type is None: file_type = Type("text/plain", "txt") - if file_type is None: - raise FileTypeException(f"Unknown file detected: {file.name}.") - return file_typecognee/infrastructure/files/utils/get_file_metadata.py (1)
30-56: Update docstring to document thenameparameter.The implementation correctly forwards the
nameparameter toguess_file_type, enabling name-aware file type detection. However, the docstring doesn't document this new parameter.Apply this diff to update the docstring:
Parameters: ----------- - file (BinaryIO): A file-like object from which to extract metadata. + - name (Optional[str]): Optional filename to use for extension-based type inference. Returns:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/infrastructure/files/utils/get_file_metadata.py(3 hunks)cognee/infrastructure/files/utils/guess_file_type.py(3 hunks)cognee/infrastructure/loaders/core/audio_loader.py(1 hunks)cognee/modules/ingestion/data_types/BinaryData.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
{cognee,cognee-mcp,distributed,examples,alembic}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
{cognee,cognee-mcp,distributed,examples,alembic}/**/*.py: Use 4-space indentation; name modules and functions in snake_case; name classes in PascalCase (Python)
Adhere to ruff rules, including import hygiene and configured line length (100)
Keep Python lines ≤ 100 characters
Files:
cognee/infrastructure/files/utils/get_file_metadata.pycognee/infrastructure/loaders/core/audio_loader.pycognee/infrastructure/files/utils/guess_file_type.pycognee/modules/ingestion/data_types/BinaryData.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/**/*.py: Public APIs in the core library should be type-annotated where practical
Prefer explicit, structured error handling and use shared logging utilities from cognee.shared.logging_utils
Files:
cognee/infrastructure/files/utils/get_file_metadata.pycognee/infrastructure/loaders/core/audio_loader.pycognee/infrastructure/files/utils/guess_file_type.pycognee/modules/ingestion/data_types/BinaryData.py
🧬 Code graph analysis (3)
cognee/infrastructure/files/utils/get_file_metadata.py (2)
cognee/infrastructure/files/storage/FileBufferedReader.py (1)
name(11-12)cognee/infrastructure/files/utils/guess_file_type.py (1)
guess_file_type(28-67)
cognee/infrastructure/files/utils/guess_file_type.py (1)
cognee/infrastructure/files/storage/FileBufferedReader.py (1)
name(11-12)
cognee/modules/ingestion/data_types/BinaryData.py (2)
cognee/infrastructure/files/utils/get_file_metadata.py (1)
get_file_metadata(30-79)cognee/infrastructure/files/storage/FileBufferedReader.py (1)
name(11-12)
🪛 Pylint (4.0.2)
cognee/infrastructure/files/utils/guess_file_type.py
[error] 1-1: Unrecognized option found: suggestion-mode
(E0015)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
- GitHub Check: Example Tests / Run Agentic Reasoning Tests
- GitHub Check: Operating System and Python Tests / Integration tests 3.12.x on windows-latest
- GitHub Check: Operating System and Python Tests / Unit tests 3.12.x on macos-15
- GitHub Check: Operating System and Python Tests / Hard Delete test 3.10.x on ubuntu-22.04
🔇 Additional comments (3)
cognee/infrastructure/loaders/core/audio_loader.py (1)
45-45: LGTM! Good addition for WAV compatibility.Adding "audio/x-wav" alongside "audio/wav" improves compatibility as WAV files may be served with either MIME type.
cognee/modules/ingestion/data_types/BinaryData.py (1)
31-36: LGTM! Correct integration of name-aware metadata retrieval.The change properly passes
self.nametoget_file_metadata, enabling name-based file type detection. The existing fallback logic on lines 35-36 ensures the metadata name is always set.cognee/infrastructure/files/utils/guess_file_type.py (1)
1-6: Remove unused imports:io,SpooledTemporaryFile, andAny.These imports are not used in the file and should be removed to comply with import hygiene rules. Keep only the imports that are actively used:
Path,BinaryIO,Optional,filetype, andType.
Description
Resolve issue with text file classification
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.