Skip to content

Conversation

@hajdul88
Copy link
Collaborator

@hajdul88 hajdul88 commented Nov 27, 2025

Description

Implements a quick fix for the lance-namespace 0.0.21 to 0.2.0 release issue with lancedb. Later this has to be revisited if they fix it on their side, for now we fixed the lance-namespace version to the previous one.

If Lancedb fixes the issue on their side this can be closed

Additionally cherry picking crawler integration test fixes from dev

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (3)
  • poetry.lock is excluded by !**/*.lock, !**/*.lock
  • pyproject.toml is excluded by !**/*.toml
  • uv.lock is excluded by !**/*.lock, !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Three test files in the web URL crawler integration test suite had their hardcoded test URLs replaced from "https://en.wikipedia.org/wiki/Large_language_model" to "http://example.com/". No test logic, assertions, or control flow were modified; only input data changed.

Changes

Cohort / File(s) Change Summary
Test URL substitution
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py, cognee/tests/integration/web_url_crawler/test_tavily_crawler.py, cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
Replaced hardcoded test URL from Wikipedia article link to http://example.com/ across all web URL crawler integration tests. All test logic and assertions remain unchanged.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Changes are purely test data substitution with no logic modifications across three related test files
  • Simple find-and-replace pattern; consistent application throughout all affected files

Poem

🐰 A rabbit's ode to simpler tests
URLs swap from wiki's nest,
To example.com so plain,
Tests still pass, logic's not in vain,
Hop along, all looks best! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description lacks human-generated detail about lance-namespace version fix and test changes, relying on brief statements; Type of Change boxes are unchecked; Pre-submission checklist items are unchecked. Provide detailed explanation of the lance-namespace 0.0.21 to 0.2.0 issue, why test URLs were changed to example.com, check relevant Type of Change box, and verify Pre-submission checklist items before submitting.
Title check ❓ Inconclusive The PR title mentions two distinct changes: a lance-namespace version fix and a crawler integration test URL fix, but the summary shows only URL updates in test files without evidence of the primary lance-namespace TOML fix. Clarify whether the lance-namespace TOML version fix is included in this changeset, or update the title to focus solely on the test URL updates if the main fix is in a separate commit.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hajdul88 hajdul88 self-assigned this Nov 27, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cognee/tests/integration/web_url_crawler/test_tavily_crawler.py (1)

14-19: Test coverage reduced by switching to example.com.

Replacing Wikipedia's complex HTML structure with example.com's minimal page may reduce test effectiveness. Example.com has basic HTML whereas Wikipedia has rich content, nested structures, tables, and complex formatting that better exercises the scraper's capabilities.

Consider:

  • Keep example.com for basic connectivity/smoke tests
  • Add additional tests with more complex HTML to ensure robust parsing
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)

45-63: HTML validation test may be less meaningful with example.com.

This test validates that saved content contains parseable HTML with common elements (html, head, body, div, p). While example.com does have basic HTML structure, it's significantly simpler than Wikipedia's content. The test might pass but provide less confidence that complex real-world pages are handled correctly.

♻️ Duplicate comments (3)
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py (1)

8-13: Duplicate concern: test coverage reduced.

Same issue as test_tavily_crawler.py - switching from Wikipedia's complex HTML to example.com's minimal structure reduces test effectiveness for the crawler.

cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (2)

17-17: Duplicate concern: test coverage reduced.

Same issue flagged in test_tavily_crawler.py.


71-71: Duplicate concern: test coverage reduced.

Same URL change concern flagged in earlier files applies to all these test functions.

Also applies to: 87-87, 97-97, 111-111, 124-124, 162-162, 193-193, 220-220, 256-256, 293-293

🧹 Nitpick comments (1)
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)

143-147: Extraction rules test needs adequate HTML structure.

This test validates BeautifulSoup's extraction_rules for titles, headings (h1/h2/h3), links, and paragraphs. Example.com has minimal content, so the extraction rules may not be thoroughly exercised. Consider using a test fixture with richer HTML structure to better validate the extraction functionality.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d99a7ff and 0fd939c.

⛔ Files ignored due to path filters (3)
  • poetry.lock is excluded by !**/*.lock, !**/*.lock
  • pyproject.toml is excluded by !**/*.toml
  • uv.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (3)
  • cognee/tests/integration/web_url_crawler/test_default_url_crawler.py (1 hunks)
  • cognee/tests/integration/web_url_crawler/test_tavily_crawler.py (1 hunks)
  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (13 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tests/integration/web_url_crawler/test_tavily_crawler.py
  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
  • cognee/tests/integration/web_url_crawler/test_default_url_crawler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tests/integration/web_url_crawler/test_tavily_crawler.py
  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
  • cognee/tests/integration/web_url_crawler/test_default_url_crawler.py
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/integration/web_url_crawler/test_tavily_crawler.py
  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
  • cognee/tests/integration/web_url_crawler/test_default_url_crawler.py
🧠 Learnings (2)
📚 Learning: 2024-11-13T14:55:05.912Z
Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Applied to files:

  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to cognee/tests/**/*.py : Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Applied to files:

  • cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py
🧬 Code graph analysis (1)
cognee/tests/integration/web_url_crawler/test_url_adding_e2e.py (1)
cognee/tasks/ingestion/save_data_item_to_storage.py (1)
  • save_data_item_to_storage (26-99)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: End-to-End Tests / Conversation sessions test
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Formatting Check
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Integration Tests
🔇 Additional comments (1)
cognee/tests/integration/web_url_crawler/test_default_url_crawler.py (1)

8-8: Now let me search more broadly for HTTPS tests and check the git history to understand what changed:
<function_calls>

#!/bin/bash

Search for HTTPS URLs in the entire test suite

rg -n --type=py 'https://' cognee/tests/ | head -20


</function_calls>

#!/bin/bash
# Check git diff to see what actually changed at line 8
git diff HEAD^ HEAD -- cognee/tests/integration/web_url_crawler/test_default_url_crawler.py
</parameter>
</invoke>
</function_calls>

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@hajdul88 hajdul88 changed the title backport: Adds lance-namespace version fix to toml (fixes lancedb issue with 0.2.0 lance-namespace version) backport: Adds lance-namespace version fix to toml (fixes lancedb issue with 0.2.0 lance-namespace version) + crawler ingetration test url fix Nov 27, 2025
@hajdul88 hajdul88 requested review from dexters1 and pazone November 27, 2025 12:46
pazone
pazone previously approved these changes Nov 27, 2025
@hajdul88 hajdul88 closed this Nov 27, 2025
@hajdul88 hajdul88 requested a review from pazone November 27, 2025 13:51
@dexters1 dexters1 reopened this Nov 27, 2025
@Vasilije1990 Vasilije1990 merged commit 00b60ae into main Nov 27, 2025
133 of 136 checks passed
@Vasilije1990 Vasilije1990 deleted the backport-lance-namespace-error-fix branch November 27, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants