-
Notifications
You must be signed in to change notification settings - Fork 952
backport: Adds lance-namespace version fix to toml (fixes lancedb issue with 0.2.0 lance-namespace version) + crawler ingetration test url fix #1842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (3)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughThree test files in the web URL crawler integration test suite had their hardcoded test URLs replaced from "https://en.wikipedia.org/wiki/Large_language_model" to "http://example.com/". No test logic, assertions, or control flow were modified; only input data changed. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cognee/tests/integration/web_url_crawler/test_tavily_crawler.py (1)
14-19: Test coverage reduced by switching to example.com.Replacing Wikipedia's complex HTML structure with example.com's minimal page may reduce test effectiveness. Example.com has basic HTML whereas Wikipedia has rich content, nested structures, tables, and complex formatting that better exercises the scraper's capabilities.
Consider:
- Keep example.com for basic connectivity/smoke tests
- Add additional tests with more complex HTML to ensure robust parsing
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)
45-63: HTML validation test may be less meaningful with example.com.This test validates that saved content contains parseable HTML with common elements (html, head, body, div, p). While example.com does have basic HTML structure, it's significantly simpler than Wikipedia's content. The test might pass but provide less confidence that complex real-world pages are handled correctly.
♻️ Duplicate comments (3)
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py (1)
8-13: Duplicate concern: test coverage reduced.Same issue as test_tavily_crawler.py - switching from Wikipedia's complex HTML to example.com's minimal structure reduces test effectiveness for the crawler.
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (2)
17-17: Duplicate concern: test coverage reduced.Same issue flagged in test_tavily_crawler.py.
71-71: Duplicate concern: test coverage reduced.Same URL change concern flagged in earlier files applies to all these test functions.
Also applies to: 87-87, 97-97, 111-111, 124-124, 162-162, 193-193, 220-220, 256-256, 293-293
🧹 Nitpick comments (1)
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)
143-147: Extraction rules test needs adequate HTML structure.This test validates BeautifulSoup's extraction_rules for titles, headings (h1/h2/h3), links, and paragraphs. Example.com has minimal content, so the extraction rules may not be thoroughly exercised. Consider using a test fixture with richer HTML structure to better validate the extraction functionality.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
poetry.lockis excluded by!**/*.lock,!**/*.lockpyproject.tomlis excluded by!**/*.tomluv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (3)
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py(1 hunks)cognee/tests/integration/web_url_crawler/test_tavily_crawler.py(1 hunks)cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py(13 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tests/integration/web_url_crawler/test_tavily_crawler.pycognee/tests/integration/web_url_crawler/test_url_adding_e2e.pycognee/tests/integration/web_url_crawler/test_default_url_crawler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tests/integration/web_url_crawler/test_tavily_crawler.pycognee/tests/integration/web_url_crawler/test_url_adding_e2e.pycognee/tests/integration/web_url_crawler/test_default_url_crawler.py
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/integration/web_url_crawler/test_tavily_crawler.pycognee/tests/integration/web_url_crawler/test_url_adding_e2e.pycognee/tests/integration/web_url_crawler/test_default_url_crawler.py
🧠 Learnings (2)
📚 Learning: 2024-11-13T14:55:05.912Z
Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
Applied to files:
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to cognee/tests/**/*.py : Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Applied to files:
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
🧬 Code graph analysis (1)
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)
cognee/tasks/ingestion/save_data_item_to_storage.py (1)
save_data_item_to_storage(26-99)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: End-to-End Tests / Conversation sessions test
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Formatting Check
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Integration Tests
🔇 Additional comments (1)
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py (1)
8-8: Now let me search more broadly for HTTPS tests and check the git history to understand what changed:
<function_calls>
#!/bin/bashSearch for HTTPS URLs in the entire test suite
rg -n --type=py 'https://' cognee/tests/ | head -20
</function_calls>#!/bin/bash # Check git diff to see what actually changed at line 8 git diff HEAD^ HEAD -- cognee/tests/integration/web_url_crawler/test_default_url_crawler.py </parameter> </invoke> </function_calls> </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Description
Implements a quick fix for the lance-namespace 0.0.21 to 0.2.0 release issue with lancedb. Later this has to be revisited if they fix it on their side, for now we fixed the lance-namespace version to the previous one.
If Lancedb fixes the issue on their side this can be closed
Additionally cherry picking crawler integration test fixes from dev
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.