-
Notifications
You must be signed in to change notification settings - Fork 967
feat: web scraping connector task #1501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: web scraping connector task #1501
Conversation
Please make sure all the checkboxes are checked:
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds a web-scraping and URL-ingestion subsystem: context-level configs for Tavily and a soup crawler, a BeautifulSoup-based async crawler (with optional Playwright and robots.txt), Tavily integration, fetch utilities, storage updates to save fetched HTML, scraping tasks (including cron), tests, and package exports. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant API as add(...)
participant Ctx as ContextVars
participant Ingest as save_data_item_to_storage
participant Utils as fetch_page_content
participant BS4 as BeautifulSoupCrawler
participant Tav as Tavily Client
Note over API,Ctx: context wiring (tavily_config / soup_crawler_config)
User->>API: call add(data, extraction_rules?, tavily_config?, soup_crawler_config?)
API->>Ctx: set tavily_config, set soup_crawler_config
API->>Ingest: enqueue/save data_item
alt data_item is HTTP/HTTPS URL
Ingest->>Utils: fetch_page_content(url, preferred_tool from context/config)
alt preferred_tool = beautifulsoup
Utils->>BS4: fetch_with_bs4(url, extraction_rules)
BS4-->>Utils: {url: extracted_text}
else preferred_tool = tavily
Utils->>Tav: AsyncTavilyClient.extract(urls, depth/timeout)
Tav-->>Utils: {url: content}
end
Utils-->>Ingest: {url: content}
Ingest-->>API: save serialized content to storage
else non-HTTP path
Ingest-->>API: save original content/path
end
sequenceDiagram
autonumber
participant Caller
participant Task as web_scraper_task
participant Utils as fetch_page_content
participant Graph as GraphStore
Caller->>Task: web_scraper_task(urls, configs...)
Task->>Utils: fetch_page_content(urls, configs)
Utils-->>Task: {url: content}
loop each URL
Task->>Graph: upsert WebSite
Task->>Graph: upsert WebPage (content, hash, metadata)
Task->>Graph: upsert ScrapingJob
Task->>Graph: create edges (is_part_of, is_scraping)
Task->>Graph: index datapoints & edges
end
Task-->>Caller: return graph data
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
PR addresses issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
pyproject.tomlis excluded by!**/*.tomluv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (12)
cognee/api/v1/add/add.py(7 hunks)cognee/context_global_variables.py(1 hunks)cognee/tasks/ingestion/ingest_data.py(2 hunks)cognee/tasks/ingestion/save_data_item_to_storage.py(3 hunks)cognee/tasks/web_scraper/__init__.py(1 hunks)cognee/tasks/web_scraper/bs4_crawler.py(1 hunks)cognee/tasks/web_scraper/config.py(1 hunks)cognee/tasks/web_scraper/models.py(1 hunks)cognee/tasks/web_scraper/utils.py(1 hunks)cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)cognee/tests/tasks/web_scraping/web_scraping_test.py(1 hunks)notebooks/data/graphrag(0 hunks)
💤 Files with no reviewable changes (1)
- notebooks/data/graphrag
🧰 Additional context used
🧬 Code graph analysis (9)
cognee/tasks/ingestion/save_data_item_to_storage.py (3)
cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-76)cognee/infrastructure/files/storage/S3FileStorage.py (1)
save_data_to_file(62-74)cognee/modules/ingestion/save_data_to_file.py (1)
save_data_to_file(7-28)
cognee/tasks/web_scraper/__init__.py (3)
cognee/tasks/web_scraper/bs4_crawler.py (1)
BeautifulSoupCrawler(69-549)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-76)cognee/tasks/web_scraper/web_scraper_task.py (2)
web_scraper_task(111-327)cron_web_scraper_task(37-108)
cognee/tests/tasks/web_scraping/web_scraping_test.py (4)
cognee/tasks/web_scraper/config.py (1)
SoupCrawlerConfig(13-23)cognee/tasks/web_scraper/web_scraper_task.py (1)
cron_web_scraper_task(37-108)cognee/api/v1/add/add.py (1)
add(23-218)cognee/modules/search/types/SearchType.py (1)
SearchType(4-19)
cognee/tasks/web_scraper/web_scraper_task.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/tasks/storage/index_graph_edges.py (1)
index_graph_edges(12-81)cognee/tasks/web_scraper/models.py (3)
WebPage(6-20)WebSite(23-34)ScrapingJob(37-47)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-23)TavilyConfig(6-10)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-76)
cognee/tasks/web_scraper/models.py (1)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint(20-220)
cognee/api/v1/add/add.py (3)
cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-10)SoupCrawlerConfig(13-23)cognee/tasks/ingestion/resolve_data_directories.py (1)
resolve_data_directories(10-81)cognee/tasks/ingestion/ingest_data.py (1)
ingest_data(25-194)
cognee/tasks/web_scraper/utils.py (3)
cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/web_scraper/bs4_crawler.py (2)
BeautifulSoupCrawler(69-549)fetch_with_bs4(463-549)cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-10)SoupCrawlerConfig(13-23)
cognee/tasks/web_scraper/bs4_crawler.py (1)
cognee/shared/logging_utils.py (1)
get_logger(182-194)
cognee/tasks/ingestion/ingest_data.py (1)
cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-23)TavilyConfig(6-10)
🪛 Pylint (3.3.8)
cognee/tasks/ingestion/save_data_item_to_storage.py
[refactor] 31-31: Too many local variables (16/15)
(R0914)
[refactor] 31-31: Too many return statements (8/6)
(R0911)
[refactor] 31-31: Too many branches (16/12)
(R0912)
[refactor] 59-59: Consider merging these comparisons with 'in' by using 'parsed_url.scheme in ('http', 'https')'. Use a set instead if elements are hashable.
(R1714)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 37-37: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many local variables (33/15)
(R0914)
[refactor] 111-111: Too many statements (52/50)
(R0915)
[refactor] 389-392: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
cognee/tasks/web_scraper/models.py
[refactor] 6-6: Too few public methods (0/2)
(R0903)
[refactor] 23-23: Too few public methods (0/2)
(R0903)
[refactor] 37-37: Too few public methods (0/2)
(R0903)
cognee/tasks/web_scraper/utils.py
[refactor] 48-76: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
cognee/tasks/web_scraper/bs4_crawler.py
[refactor] 69-69: Too many instance attributes (12/7)
(R0902)
[refactor] 85-85: Too many arguments (8/5)
(R0913)
[refactor] 441-461: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 406-406: Too many branches (15/12)
(R0912)
[refactor] 463-463: Too many arguments (6/5)
(R0913)
[refactor] 463-463: Too many local variables (18/15)
(R0914)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
cognee/tasks/web_scraper/utils.py (1)
62-72: Guardsoup_crawler_configbefore dereferencing itThe condition
if not soup_crawler_config and soup_crawler_config.extraction_rules is None:still dereferencessoup_crawler_configwhen it isNone, raisingAttributeErrorand preventing BeautifulSoup scraping from ever running without an explicit config object. Please short-circuit the check before touching.extraction_rules, e.g.:- if not soup_crawler_config and soup_crawler_config.extraction_rules is None: + if not soup_crawler_config or soup_crawler_config.extraction_rules is None: raise ValueError("extraction_rules must be provided when not using Tavily")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (7)
cognee/api/v1/add/add.py(7 hunks)cognee/tasks/ingestion/save_data_item_to_storage.py(3 hunks)cognee/tasks/web_scraper/config.py(1 hunks)cognee/tasks/web_scraper/models.py(1 hunks)cognee/tasks/web_scraper/utils.py(1 hunks)cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)cognee/tests/tasks/web_scraping/web_scraping_test.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
cognee/tests/tasks/web_scraping/web_scraping_test.py (6)
cognee/tasks/web_scraper/config.py (1)
SoupCrawlerConfig(12-22)cognee/tasks/web_scraper/web_scraper_task.py (1)
cron_web_scraper_task(37-108)cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/api/v1/visualize/visualize.py (1)
visualize_graph(14-27)cognee/api/v1/add/add.py (1)
add(23-225)cognee/modules/search/types/SearchType.py (1)
SearchType(4-19)
cognee/tasks/ingestion/save_data_item_to_storage.py (2)
cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-76)cognee/modules/ingestion/save_data_to_file.py (1)
save_data_to_file(7-28)
cognee/api/v1/add/add.py (3)
cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-9)SoupCrawlerConfig(12-22)cognee/tasks/ingestion/resolve_data_directories.py (1)
resolve_data_directories(10-81)cognee/tasks/ingestion/ingest_data.py (1)
ingest_data(25-194)
cognee/tasks/web_scraper/models.py (1)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint(20-220)
cognee/tasks/web_scraper/utils.py (3)
cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/web_scraper/bs4_crawler.py (2)
BeautifulSoupCrawler(69-549)fetch_with_bs4(463-549)cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-9)SoupCrawlerConfig(12-22)
cognee/tasks/web_scraper/web_scraper_task.py (6)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/storage/index_graph_edges.py (1)
index_graph_edges(12-81)cognee/tasks/web_scraper/models.py (3)
WebPage(6-19)WebSite(22-33)ScrapingJob(36-46)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(12-22)TavilyConfig(6-9)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-76)
🪛 Pylint (3.3.8)
cognee/tests/tasks/web_scraping/web_scraping_test.py
[refactor] 149-169: Unnecessary "else" after "break", remove the "else" and de-indent the code inside it
(R1723)
cognee/tasks/ingestion/save_data_item_to_storage.py
[refactor] 31-31: Too many local variables (17/15)
(R0914)
[refactor] 31-31: Too many return statements (8/6)
(R0911)
[refactor] 31-31: Too many branches (19/12)
(R0912)
[refactor] 59-59: Consider merging these comparisons with 'in' by using 'parsed_url.scheme in ('http', 'https')'. Use a set instead if elements are hashable.
(R1714)
cognee/tasks/web_scraper/models.py
[refactor] 6-6: Too few public methods (0/2)
(R0903)
[refactor] 22-22: Too few public methods (0/2)
(R0903)
[refactor] 36-36: Too few public methods (0/2)
(R0903)
cognee/tasks/web_scraper/utils.py
[refactor] 48-76: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 37-37: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many local variables (34/15)
(R0914)
[refactor] 111-111: Too many statements (54/50)
(R0915)
[refactor] 387-390: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cognee/tasks/web_scraper/bs4_crawler.py (1)
519-524: Enforce rate limiting in the Playwright path.Lines 519-522 call
_render_with_playwrightwithout enforcing rate limiting, whereas the HTTPX path (line 524) calls_fetch_httpxwhich internally enforces rate limiting via_respect_rate_limit(line 313). This inconsistency means Playwright requests ignore robots.txt crawl-delay directives and the configured per-domain throttling, potentially violating site policies and risking IP blocks.Apply this diff to enforce rate limiting before Playwright rendering:
if use_playwright: + crawl_delay = await self._get_crawl_delay(url) + await self._respect_rate_limit(url, crawl_delay) html = await self._render_with_playwright( url, js_wait=playwright_js_wait, timeout=self.timeout )
🧹 Nitpick comments (1)
cognee/tasks/web_scraper/config.py (1)
20-20: Consider making extraction_rules optional.The
extraction_rulesfield has no default value, making it a required field at instantiation. However, the validation logic inutils.py(line 61) checks whetherextraction_rulesisNonebefore using it, suggesting the field should be optional at the config level. Making this field optional would allow partial config construction and improve flexibility.Apply this diff:
- extraction_rules: Dict[str, Any] + extraction_rules: Optional[Dict[str, Any]] = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/tasks/web_scraper/bs4_crawler.py(1 hunks)cognee/tasks/web_scraper/config.py(1 hunks)cognee/tasks/web_scraper/utils.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-06T14:08:59.915Z
Learnt from: Geoff-Robin
PR: topoteretes/cognee#1501
File: cognee/tasks/web_scraper/utils.py:108-108
Timestamp: 2025-10-06T14:08:59.915Z
Learning: In the Tavily Python library, AsyncTavilyClient does not accept a timeout parameter at the constructor level. Instead, the timeout parameter should be passed to the extract method along with other parameters like extract_depth and proxy.
Applied to files:
cognee/tasks/web_scraper/utils.py
📚 Learning: 2025-10-07T15:48:53.427Z
Learnt from: Geoff-Robin
PR: topoteretes/cognee#1501
File: cognee/tasks/web_scraper/bs4_crawler.py:140-167
Timestamp: 2025-10-07T15:48:53.427Z
Learning: In cognee/tasks/web_scraper/bs4_crawler.py, the helper methods `_domain_from_url`, `_get_domain_root`, and `_normalize_rule` should be instance methods with `self` as the first parameter, not static methods.
Applied to files:
cognee/tasks/web_scraper/bs4_crawler.py
🧬 Code graph analysis (2)
cognee/tasks/web_scraper/utils.py (3)
cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/web_scraper/bs4_crawler.py (3)
BeautifulSoupCrawler(69-546)fetch_with_bs4(460-546)close(125-129)cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-10)SoupCrawlerConfig(13-24)
cognee/tasks/web_scraper/bs4_crawler.py (1)
cognee/shared/logging_utils.py (1)
get_logger(182-194)
🪛 Pylint (3.3.8)
cognee/tasks/web_scraper/bs4_crawler.py
[refactor] 69-69: Too many instance attributes (12/7)
(R0902)
[refactor] 85-85: Too many arguments (8/5)
(R0913)
[refactor] 438-458: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 403-403: Too many branches (15/12)
(R0912)
[refactor] 460-460: Too many arguments (6/5)
(R0913)
[refactor] 460-460: Too many local variables (18/15)
(R0914)
… allow None value
<!-- .github/pull_request_template.md --> ## Description Cognee release v0.3.5 ## Type of Change <!-- Please check the relevant option --> - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable) <!-- Add screenshots or videos to help explain your changes --> ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [ ] **I have tested my changes thoroughly before submitting this PR** - [ ] **This PR contains minimal changes necessary to address the issue/feature** - [ ] My code follows the project's coding standards and style guidelines - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [ ] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cognee/tests/tasks/web_scraping/web_scraping_test.py (1)
50-59: Remove unsupportedstructuredparameter.Line 58 passes
structured=TruetoSoupCrawlerConfig, but this field doesn't exist in the model definition (seecognee/tasks/web_scraper/config.pylines 12-23). This will raise a PydanticValidationErrorat runtime.Apply this diff to remove the unsupported parameter:
soup_config = SoupCrawlerConfig( concurrency=1, crawl_delay=0.1, timeout=10.0, max_retries=1, retry_delay_factor=0.5, extraction_rules=rules, use_playwright=False, - structured=True, )
🧹 Nitpick comments (2)
cognee/tasks/web_scraper/web_scraper_task.py (2)
159-183: Clarify variable naming and conditional logic.Line 159 assigns
scraping_job_created = await graph_db.get_node(...), but the variable name is misleading—it actually holds the existing job node (orNoneif it doesn't exist). The conditional at lines 182-183 adds the node only if it already exists, which is confusing since line 319 performs a batchadd_nodes()for all nodes anyway.This pattern repeats at lines 226-227 and 240-241 for website nodes. The incremental
add_node()calls appear redundant given the final batch operation.Rename the variable to clarify its purpose and remove redundant conditional updates:
- scraping_job_created = await graph_db.get_node(uuid5(NAMESPACE_OID, name=job_name)) + existing_scraping_job = await graph_db.get_node(uuid5(NAMESPACE_OID, name=job_name)) # Create description for ScrapingJob scraping_job_description = ( # ... description content ... ) scraping_job = ScrapingJob( # ... scraping job fields ... ) - if scraping_job_created: - await graph_db.add_node(scraping_job) # Update existing scraping jobIf incremental updates are intentional for performance or real-time visibility, add a comment explaining the rationale. Otherwise, rely solely on the batch
add_nodes()at line 319.
386-389: Remove unnecessaryelseafterreturn.Lines 386-389 contain an
elseblock following areturnstatement. Theelseis unnecessary since the code in that block will only execute if theifcondition is false.As per coding guidelines:
Apply this diff to simplify the code:
if full_path.startswith(base_path): return full_path[len(base_path) :].lstrip("/") - else: - return full_path.lstrip("/") + return full_path.lstrip("/")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)cognee/tests/tasks/web_scraping/web_scraping_test.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
cognee/tests/tasks/web_scraping/web_scraping_test.py (3)
cognee/tasks/web_scraper/config.py (1)
SoupCrawlerConfig(13-24)cognee/tasks/web_scraper/web_scraper_task.py (1)
cron_web_scraper_task(37-108)cognee/api/v1/add/add.py (1)
add(23-227)
cognee/tasks/web_scraper/web_scraper_task.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/web_scraper/models.py (3)
WebPage(6-19)WebSite(22-33)ScrapingJob(36-46)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-24)TavilyConfig(6-10)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-86)
🪛 Pylint (3.3.8)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 37-37: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many arguments (7/5)
(R0913)
[refactor] 111-111: Too many local variables (34/15)
(R0914)
[refactor] 111-111: Too many statements (53/50)
(R0915)
[refactor] 386-389: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
🔇 Additional comments (4)
cognee/tasks/web_scraper/web_scraper_task.py (3)
27-32: LGTM: APScheduler import and initialization.The import block correctly guards against missing APScheduler dependency and provides a clear error message. The
BackgroundSchedulerinitialization is appropriate for this use case, though consider switching toAsyncIOSchedulerfor better async integration (see separate comment on lines 77-88).
327-359: LGTM: Argument validation logic.The
check_argumentsfunction correctly validates and configures the scraping tools, with clear precedence logic (Tavily fallback when extraction_rules are absent) and appropriate error handling.
362-384: LGTM: URL path extraction with domain validation.The
get_path_after_basefunction correctly validates that both URLs share the same domain and extracts the relative path, preventing cross-domain path confusion.cognee/tests/tasks/web_scraping/web_scraping_test.py (1)
119-152: Re-run cron web scraper test locally with proper environment set up
Install pytest, setPYTHONPATHto the repo root and run:PYTHONPATH=. pytest cognee/tests/tasks/web_scraping/web_scraping_test.py::test_cron_web_scraper -qIf it fails, add
await cognee.cognify()immediately after thecron_web_scraper_taskcall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
cognee/tasks/web_scraper/web_scraper_task.py (2)
271-271: Refine status logic to handle partial failures.Line 271 sets the job status to
"completed"if any webpages were scraped, otherwise"failed". This doesn't account for partial failures—if some URLs succeed and others fail, the job will still be marked"completed", masking issues.Consider tracking success/failure counts:
+ successful_pages = len(webpages) + failed_pages = len(url) - successful_pages + + if failed_pages == 0: + scraping_job.status = "completed" + elif successful_pages == 0: + scraping_job.status = "failed" + else: + scraping_job.status = "partial" - scraping_job.status = "completed" if webpages else "failed"Update the description to include these counts:
scraping_job.description = ( f"Scraping job: {job_name}\n" f"URLs: {', '.join(url)}\n" f"Status: {scraping_job.status}\n" + f"Successful: {successful_pages}/{len(url)}\n" f"Last run: {now.strftime('%Y-%m-%d %H:%M:%S')}\n" f"Next run: {next_run.strftime('%Y-%m-%d %H:%M:%S') if next_run else 'Not scheduled'}" )
360-387: Remove unused function or document its intended use.The
get_path_after_basefunction is defined but never called within this file or imported in the__init__.py(based on the provided context). This appears to be dead code.Additionally, the
elseclause at lines 384-387 is unnecessary after thereturnstatement.If this function is not needed, remove it:
-def get_path_after_base(base_url: str, url: str) -> str: - """Extract the path after the base URL. - - Args: - base_url: The base URL (e.g., "https://example.com"). - url: The full URL to extract the path from. - - Returns: - str: The path after the base URL, with leading slashes removed. - - Raises: - ValueError: If the base URL and target URL are from different domains. - """ - parsed_base = urlparse(base_url) - parsed_url = urlparse(url) - - # Ensure they have the same netloc (domain) - if parsed_base.netloc != parsed_url.netloc: - raise ValueError("Base URL and target URL are from different domains") - - # Return everything after base_url path - base_path = parsed_base.path.rstrip("/") - full_path = parsed_url.path - - if full_path.startswith(base_path): - return full_path[len(base_path) :].lstrip("/") - else: - return full_path.lstrip("/")If it's intended for future use or external import, add a docstring note explaining its purpose and remove the unnecessary
else:if full_path.startswith(base_path): return full_path[len(base_path) :].lstrip("/") - else: - return full_path.lstrip("/") + + return full_path.lstrip("/")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (1)
cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/tasks/web_scraper/web_scraper_task.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/tasks/storage/index_graph_edges.py (1)
index_graph_edges(12-81)cognee/tasks/web_scraper/models.py (3)
WebPage(6-19)WebSite(22-33)ScrapingJob(36-46)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-24)TavilyConfig(6-10)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-86)
🪛 Pylint (3.3.8)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 36-36: Too many arguments (7/5)
(R0913)
[refactor] 109-109: Too many arguments (7/5)
(R0913)
[refactor] 109-109: Too many local variables (34/15)
(R0914)
[refactor] 109-109: Too many statements (53/50)
(R0915)
[refactor] 384-387: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
🔇 Additional comments (2)
cognee/tasks/web_scraper/web_scraper_task.py (2)
92-93: LGTM: Scheduler lifecycle correctly managed.The conditional check
if not scheduler.running:before callingscheduler.start()correctly preventsSchedulerAlreadyRunningErroron subsequent job registrations.
325-358: LGTM: Argument validation and config construction is correct.The function correctly handles the various parameter combinations:
- Constructs
SoupCrawlerConfigfromextraction_rulesif needed.- Constructs or updates
TavilyConfigfromtavily_api_keyif provided.- Selects
preferred_toolbased on available configurations.- Raises
TypeErrorif neither config can be constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/tasks/web_scraper/web_scraper_task.py (4)
29-29: Update the APScheduler version recommendation.The import error message suggests installing
APScheduler>=3.10, but the latest stable version is 3.11.0 (released November 2024).Apply this diff to update the version recommendation:
- raise ImportError("Please install apscheduler by pip install APScheduler>=3.10") + raise ImportError("Please install apscheduler by pip install APScheduler>=3.11")Based on learnings.
98-98: Remove redundant fallback expression.Line 77 ensures
job_nameis always truthy (either from the parameter or a timestamp-based default), so the fallbackjob_name or f"WebScraper_{uuid5(...)}"at line 98 is redundant and could never use the UUID-based fallback.Apply this diff to simplify:
- name=job_name or f"WebScraper_{uuid5(NAMESPACE_OID, name=job_name)}", + name=job_name,
118-331: Consider breaking down this large function.The
web_scraper_taskfunction is quite long with 34 local variables and 53 statements. It handles multiple responsibilities: validation, fetching content, creating nodes, managing edges, and indexing. This makes the function harder to test and maintain.Consider extracting logical sub-operations into helper functions:
_create_scraping_job_node(...)→ lines 166-187_process_scraped_content(...)→ lines 201-278_build_graph_structure(...)→ lines 290-324This would improve readability and make unit testing easier.
393-396: Remove unnecessaryelseafterreturn.The
elseclause at line 395 is redundant since theifblock at line 393 returns early.Apply this diff:
if full_path.startswith(base_path): return full_path[len(base_path) :].lstrip("/") - else: - return full_path.lstrip("/") + return full_path.lstrip("/")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/tasks/web_scraper/web_scraper_task.py (6)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/shared/logging_utils.py (1)
get_logger(182-194)cognee/tasks/storage/index_graph_edges.py (1)
index_graph_edges(12-81)cognee/tasks/web_scraper/models.py (3)
WebPage(6-19)WebSite(22-33)ScrapingJob(36-46)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-24)TavilyConfig(6-10)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-86)
🪛 Pylint (3.3.8)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 44-44: Too many arguments (7/5)
(R0913)
[refactor] 118-118: Too many arguments (7/5)
(R0913)
[refactor] 118-118: Too many local variables (34/15)
(R0914)
[refactor] 118-118: Too many statements (53/50)
(R0915)
[refactor] 393-396: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/tasks/web_scraper/web_scraper_task.py (4)
166-190: Clarify conditional node update pattern.The logic checks
scraping_job_created(which is truthy when the job already exists) and then callsadd_node()to update it. While this is correct per the learning thatadd_node()performs upserts, the pattern might cause confusion.However, line 326 performs a bulk
add_nodes()operation for all nodes, which also upserts. The intermediate conditional updates (lines 189-190, 233-234, 247-248) become redundant since the final bulk operation will handle both creation and updates.Consider removing the intermediate conditional
add_node()calls and relying solely on the bulk operation at line 326, unless the intermediate updates serve a specific purpose (e.g., immediate consistency for long-running scrapes).
393-396: Remove unnecessary else after return.The
elseblock after thereturnstatement is unnecessary and reduces readability.Apply this diff:
if full_path.startswith(base_path): return full_path[len(base_path) :].lstrip("/") - else: - return full_path.lstrip("/") + return full_path.lstrip("/")
101-103: Add scheduler shutdown cleanup. No calls toscheduler.shutdown()orscheduler.stop()were found; register a cleanup handler (e.g. viaatexit.registeror an explicit teardown) to properly release scheduler resources.
326-329: Avoid indexing fullWebPage.contentin vector embeddings
WebPage.metadata["index_fields"]includes"content", soindex_data_pointswill embed the entire HTML payload (potentially megabytes) and may cause performance or memory issues. Restrictindex_fieldsto["name","description"]or truncate/summarizecontentbefore embedding.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/tasks/web_scraper/models.py(1 hunks)cognee/tasks/web_scraper/web_scraper_task.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
cognee/tasks/web_scraper/web_scraper_task.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/tasks/storage/index_graph_edges.py (1)
index_graph_edges(12-81)cognee/tasks/web_scraper/models.py (3)
WebPage(6-19)WebSite(22-33)ScrapingJob(36-46)cognee/tasks/web_scraper/config.py (2)
SoupCrawlerConfig(13-24)TavilyConfig(6-10)cognee/tasks/web_scraper/utils.py (1)
fetch_page_content(15-86)
cognee/tasks/web_scraper/models.py (1)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint(20-220)
🪛 Pylint (3.3.8)
cognee/tasks/web_scraper/web_scraper_task.py
[refactor] 44-44: Too many arguments (7/5)
(R0913)
[refactor] 118-118: Too many arguments (7/5)
(R0913)
[refactor] 118-118: Too many local variables (35/15)
(R0914)
[refactor] 118-118: Too many statements (54/50)
(R0915)
[refactor] 393-396: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
cognee/tasks/web_scraper/models.py
[refactor] 6-6: Too few public methods (0/2)
(R0903)
[refactor] 22-22: Too few public methods (0/2)
(R0903)
[refactor] 36-36: Too few public methods (0/2)
(R0903)
🔇 Additional comments (6)
cognee/tasks/web_scraper/models.py (3)
6-19: LGTM! Models are well-structured.The
WebPagemodel fields are appropriate for representing scraped web pages. The metadata index_fields now correctly reference existing attributes (name,description,content), addressing the previous concern about misaligned field names.Minor observation:
nameis typed asOptional[str]but is always populated with the page URL in practice (seeweb_scraper_task.pyline 267). Consider making it non-optional (name: str) to better reflect actual usage and prevent potentialNonevalues in the embeddable data.
22-33: LGTM! Website model is well-defined.The
WebSitemodel appropriately captures domain-level metadata. The index_fields align with actual attributes, ensuring proper vector indexing.
36-46: LGTM! Scraping job model is appropriate.The
ScrapingJobmodel correctly represents job configuration with scheduling support. The index_fields are properly aligned with the model's attributes.cognee/tasks/web_scraper/web_scraper_task.py (3)
34-41: Good lazy initialization pattern.The
get_scheduler()function correctly implements lazy initialization, addressing the previous concern about module-level scheduler instantiation. This ensures theAsyncIOScheduleris created only when needed.
253-263: LGTM! Content preview avoids performance issues.The description correctly uses a truncated
content_preview(first 500 characters) instead of embedding the full content, as addressed in previous reviews. The full content remains in thecontentfield where it belongs.
265-267: LGTM! UUID generation prevents collisions.Using the full
page_urlfor UUID generation (instead of just the path) correctly prevents ID collisions across different domains, as addressed in previous reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/api/v1/add/add.py (1)
153-153: Fix the typo in the documentation comment.The documentation still contains a grammatical error flagged in a previous review. The line should read "Make sure to set TAVILY_API_KEY=YOUR_TAVILY_API_KEY as an environment variable".
Apply this diff:
- Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable + Make sure to set TAVILY_API_KEY=YOUR_TAVILY_API_KEY as an environment variable
🧹 Nitpick comments (1)
cognee/api/v1/add/add.py (1)
175-181: Use default TavilyConfig constructor and remove redundant env var guardContextVar.set accepts
Nonewithout error. SinceTavilyConfigalready defaults itsapi_keyfromTAVILY_API_KEY, simplify in cognee/api/v1/add/add.py:- if not tavily_config and os.getenv("TAVILY_API_KEY"): - tavily_config = TavilyConfig(api_key=os.getenv("TAVILY_API_KEY")) + if not tavily_config: + tavily_config = TavilyConfig()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
cognee-mcp/pyproject.tomlis excluded by!**/*.tomlpyproject.tomlis excluded by!**/*.tomluv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (1)
cognee/api/v1/add/add.py(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/api/v1/add/add.py (4)
cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-10)SoupCrawlerConfig(13-24)cognee/modules/pipelines/tasks/task.py (1)
Task(5-97)cognee/tasks/ingestion/resolve_data_directories.py (1)
resolve_data_directories(10-84)cognee/tasks/ingestion/ingest_data.py (1)
ingest_data(24-193)
🔇 Additional comments (3)
cognee/api/v1/add/add.py (3)
2-3: LGTM!The new imports are appropriate for the web scraping functionality and are all used in the code.
Also applies to: 15-20
33-35: LGTM!The new parameters are properly typed and integrate well with the existing signature. Note that a previous review raised a question about whether
extraction_rulesandsoup_crawler_configshould both exist, but this remains a design decision for the maintainers.
183-191: LGTM!The URL detection logic properly handles type safety and works correctly for both single URLs and lists. The
node_setmodification happens before task construction, which addresses a previous review concern. The_is_http_urlhelper withisinstancecheck prevents theTypeErrorthat was flagged in an earlier review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/api/v1/add/add.py (1)
153-153: Fix grammatical error in documentation.The phrase should read "as an environment variable" not "as a environment variable".
Apply this diff:
- Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable + Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as an environment variable
🧹 Nitpick comments (1)
cognee/api/v1/add/add.py (1)
33-35: Clarify precedence when bothextraction_rulesandsoup_crawler_configare provided.Lines 175-176 show that if
soup_crawler_configis provided,extraction_rulesis ignored. This could confuse users who pass both parameters. Consider either:
- Raising a clear error when both are provided, or
- Documenting the precedence explicitly in the docstring
Example validation:
+ if extraction_rules and soup_crawler_config: + raise ValueError( + "Cannot specify both 'extraction_rules' and 'soup_crawler_config'. " + "Pass extraction_rules via soup_crawler_config.extraction_rules instead." + ) if not soup_crawler_config and extraction_rules: soup_crawler_config = SoupCrawlerConfig(extraction_rules=extraction_rules)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/api/v1/add/add.py(7 hunks)cognee/tests/tasks/web_scraping/web_scraping_test.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/tasks/web_scraping/web_scraping_test.py
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/api/v1/add/add.py (3)
cognee/tasks/web_scraper/config.py (2)
TavilyConfig(6-10)SoupCrawlerConfig(13-24)cognee/tasks/ingestion/resolve_data_directories.py (1)
resolve_data_directories(10-84)cognee/tasks/ingestion/ingest_data.py (1)
ingest_data(24-193)
🔇 Additional comments (8)
cognee/api/v1/add/add.py (8)
2-3: LGTM!The import additions are necessary for environment variable access and type annotations used throughout the updated function.
15-20: LGTM!All imported modules are utilized in the function for web scraping configuration and URL handling.
41-41: LGTM!Documentation correctly reflects the new URL ingestion capability.
81-81: LGTM!The parameter documentation clearly explains the purpose and usage of URL inputs and web scraping configurations.
Also applies to: 92-94
171-171: LGTM!The environment variable documentation is correctly updated to include TAVILY_API_KEY.
183-191: LGTM! Past review concerns addressed.The URL detection logic correctly:
- Guards against non-string payloads using
isinstance(item, str)check- Updates
node_setbefore task construction (moved from after, per previous feedback)- Handles both single URLs and lists containing URLs
195-202: LGTM!The task construction correctly uses the updated
node_setvalue after URL detection, ensuring web content is properly tagged during ingestion.
175-181: Context global handling and thread safety verified
ContextVarsupportsNonedefaults and per-context isolation, so no changes are required.
|
@borisarzentar Is it fine if add some more support to extend it for serverless environments? |
|
@Geoff-Robin this PR seems like it went trough a review. How about we open an additional 2 tickets for serverless and refactor and I merge this. Appreciate your help on this one! |
Description
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.