feat: web scraping connector task #1501

Geoff-Robin · 2025-10-05T22:54:52Z

Description

Type of Change

New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

…et to False

pull-checklist · 2025-10-05T22:54:57Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-10-05T22:55:18Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds a web-scraping and URL-ingestion subsystem: context-level configs for Tavily and a soup crawler, a BeautifulSoup-based async crawler (with optional Playwright and robots.txt), Tavily integration, fetch utilities, storage updates to save fetched HTML, scraping tasks (including cron), tests, and package exports.

Changes

Cohort / File(s)	Summary
API surface & wiring `cognee/api/v1/add/add.py`	Extends `add(...)` signature to accept `extraction_rules`, `tavily_config`, `soup_crawler_config`; reads `TAVILY_API_KEY` from env; wires context globals (`tavily_config`, `soup_crawler_config`); detects HTTP/HTTPS and adds `"web_content"` node handling.
Context vars `cognee/context_global_variables.py`	Adds `ContextVar` globals: `soup_crawler_config` and `tavily_config`.
Ingestion storage `cognee/tasks/ingestion/save_data_item_to_storage.py`	Adds `HTMLContent` validator; detects http/https URLs; uses `fetch_page_content` (with context configs) to fetch/serialize page content; selects scraping backend per config; preserves S3/local handling and existing error behavior.
Web scraper exports `cognee/tasks/web_scraper/__init__.py`	Re-exports `BeautifulSoupCrawler`, `fetch_page_content`, `cron_web_scraper_task`, `web_scraper_task`.
BeautifulSoup crawler `cognee/tasks/web_scraper/bs4_crawler.py`	New async `BeautifulSoupCrawler` with robots.txt (Protego) caching, per-domain rate-limiting, retries/backoff, optional Playwright rendering, CSS/XPath extraction rules, and high-level `fetch_with_bs4`.
Scraper configuration `cognee/tasks/web_scraper/config.py`	Adds Pydantic `TavilyConfig` (env-driven API key, extract_depth, proxies, timeout) and `SoupCrawlerConfig` (concurrency, crawl_delay, timeout, retries, headers, extraction_rules, Playwright options).
Scraper models `cognee/tasks/web_scraper/models.py`	Adds DataPoint subclasses: `WebPage`, `WebSite`, `ScrapingJob` with typed fields and metadata.
Scraper utilities `cognee/tasks/web_scraper/utils.py`	Adds `fetch_page_content` (dispatch to Tavily or BeautifulSoup) and `fetch_with_tavily` (uses AsyncTavilyClient); includes dependency checks, error handling, and URL→content mapping.
Scraper tasks `cognee/tasks/web_scraper/web_scraper_task.py`	Adds `web_scraper_task` and `cron_web_scraper_task`: fetch content, compute hashes, upsert `WebPage`/`WebSite`/`ScrapingJob`, create edges, index datapoints/edges; includes argument validation helpers.
Web scraper tests `cognee/tests/tasks/web_scraping/web_scraping_test.py`	Adds async tests covering BeautifulSoup and Tavily flows (incremental and non-incremental) and a cron-job test; includes `main()` for scripted runs.
Docs/data submodule `notebooks/data/graphrag`	Removes a submodule commit metadata line only (no runtime change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant API as add(...)
  participant Ctx as ContextVars
  participant Ingest as save_data_item_to_storage
  participant Utils as fetch_page_content
  participant BS4 as BeautifulSoupCrawler
  participant Tav as Tavily Client
  Note over API,Ctx: context wiring (tavily_config / soup_crawler_config)
  User->>API: call add(data, extraction_rules?, tavily_config?, soup_crawler_config?)
  API->>Ctx: set tavily_config, set soup_crawler_config
  API->>Ingest: enqueue/save data_item
  alt data_item is HTTP/HTTPS URL
    Ingest->>Utils: fetch_page_content(url, preferred_tool from context/config)
    alt preferred_tool = beautifulsoup
      Utils->>BS4: fetch_with_bs4(url, extraction_rules)
      BS4-->>Utils: {url: extracted_text}
    else preferred_tool = tavily
      Utils->>Tav: AsyncTavilyClient.extract(urls, depth/timeout)
      Tav-->>Utils: {url: content}
    end
    Utils-->>Ingest: {url: content}
    Ingest-->>API: save serialized content to storage
  else non-HTTP path
    Ingest-->>API: save original content/path
  end

sequenceDiagram
  autonumber
  participant Caller
  participant Task as web_scraper_task
  participant Utils as fetch_page_content
  participant Graph as GraphStore
  Caller->>Task: web_scraper_task(urls, configs...)
  Task->>Utils: fetch_page_content(urls, configs)
  Utils-->>Task: {url: content}
  loop each URL
    Task->>Graph: upsert WebSite
    Task->>Graph: upsert WebPage (content, hash, metadata)
    Task->>Graph: upsert ScrapingJob
    Task->>Graph: create edges (is_part_of, is_scraping)
    Task->>Graph: index datapoints & edges
  end
  Task-->>Caller: return graph data

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

fix: add pipeline_name to PipelineRun and change logging default #675 — Modifies the same cognee/api/v1/add/add.py surface and behavior; closely related to the API signature changes.
Cog 337 llama index support #186 — Overlaps ingestion/storage logic and URL/HTML handling in save_data_item_to_storage.

Suggested labels

run-checks

Suggested reviewers

borisarzentar
hajdul88

Poem

I nibble links with whiskered cheer,
I hop through robots, scripts, and gear.
With soup or Tavily I pluck the thread,
I hash the lines and tuck them to bed.
A carrot saved for every URL spread. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description is just the unfilled template with no human-written summary, missing the Description section, Type of Change selection, and completed pre‐submission checklist, so it fails to convey any details about the changes or verify readiness.	Please complete the template by adding a clear, human-generated description of the changes, select the appropriate Type of Change, check the relevant checklist items, and link any related issues to ensure the PR meets repository requirements.
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly and concisely identifies the primary change as adding a web scraping connector task, matching the main feature implemented across multiple modules without extraneous detail.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Geoff-Robin · 2025-10-05T22:57:17Z

PR addresses issue

coderabbitai

Actionable comments posted: 11

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18dcab3 and ae740ed.

⛔ Files ignored due to path filters (2)

pyproject.toml is excluded by !**/*.toml
uv.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (12)

cognee/api/v1/add/add.py (7 hunks)
cognee/context_global_variables.py (1 hunks)
cognee/tasks/ingestion/ingest_data.py (2 hunks)
cognee/tasks/ingestion/save_data_item_to_storage.py (3 hunks)
cognee/tasks/web_scraper/__init__.py (1 hunks)
cognee/tasks/web_scraper/bs4_crawler.py (1 hunks)
cognee/tasks/web_scraper/config.py (1 hunks)
cognee/tasks/web_scraper/models.py (1 hunks)
cognee/tasks/web_scraper/utils.py (1 hunks)
cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)
cognee/tests/tasks/web_scraping/web_scraping_test.py (1 hunks)
notebooks/data/graphrag (0 hunks)

💤 Files with no reviewable changes (1)

notebooks/data/graphrag

🧰 Additional context used

🧬 Code graph analysis (9)

cognee/tasks/ingestion/save_data_item_to_storage.py (3)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-76)

cognee/infrastructure/files/storage/S3FileStorage.py (1)

save_data_to_file (62-74)

cognee/modules/ingestion/save_data_to_file.py (1)

save_data_to_file (7-28)

cognee/tasks/web_scraper/__init__.py (3)

cognee/tasks/web_scraper/bs4_crawler.py (1)

BeautifulSoupCrawler (69-549)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-76)

cognee/tasks/web_scraper/web_scraper_task.py (2)

web_scraper_task (111-327)

cron_web_scraper_task (37-108)

cognee/tests/tasks/web_scraping/web_scraping_test.py (4)

cognee/tasks/web_scraper/config.py (1)

SoupCrawlerConfig (13-23)

cognee/tasks/web_scraper/web_scraper_task.py (1)

cron_web_scraper_task (37-108)

cognee/api/v1/add/add.py (1)

add (23-218)

cognee/modules/search/types/SearchType.py (1)

SearchType (4-19)

cognee/tasks/web_scraper/web_scraper_task.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/tasks/storage/index_graph_edges.py (1)

index_graph_edges (12-81)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-20)

WebSite (23-34)

ScrapingJob (37-47)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-23)

TavilyConfig (6-10)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-76)

cognee/tasks/web_scraper/models.py (1)

cognee/infrastructure/engine/models/DataPoint.py (1)

DataPoint (20-220)

cognee/api/v1/add/add.py (3)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-10)

SoupCrawlerConfig (13-23)

cognee/tasks/ingestion/resolve_data_directories.py (1)

resolve_data_directories (10-81)

cognee/tasks/ingestion/ingest_data.py (1)

ingest_data (25-194)

cognee/tasks/web_scraper/utils.py (3)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/web_scraper/bs4_crawler.py (2)

BeautifulSoupCrawler (69-549)

fetch_with_bs4 (463-549)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-10)

SoupCrawlerConfig (13-23)

cognee/tasks/web_scraper/bs4_crawler.py (1)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/ingestion/ingest_data.py (1)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-23)

TavilyConfig (6-10)

🪛 Pylint (3.3.8)

cognee/tasks/ingestion/save_data_item_to_storage.py

[refactor] 31-31: Too many local variables (16/15)

(R0914)

[refactor] 31-31: Too many return statements (8/6)

(R0911)

[refactor] 31-31: Too many branches (16/12)

(R0912)

[refactor] 59-59: Consider merging these comparisons with 'in' by using 'parsed_url.scheme in ('http', 'https')'. Use a set instead if elements are hashable.

(R1714)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 37-37: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many local variables (33/15)

(R0914)

[refactor] 111-111: Too many statements (52/50)

(R0915)

[refactor] 389-392: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

cognee/tasks/web_scraper/models.py

[refactor] 6-6: Too few public methods (0/2)

(R0903)

[refactor] 23-23: Too few public methods (0/2)

(R0903)

[refactor] 37-37: Too few public methods (0/2)

(R0903)

cognee/tasks/web_scraper/utils.py

[refactor] 48-76: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

cognee/tasks/web_scraper/bs4_crawler.py

[refactor] 69-69: Too many instance attributes (12/7)

(R0902)

[refactor] 85-85: Too many arguments (8/5)

(R0913)

[refactor] 441-461: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 406-406: Too many branches (15/12)

(R0912)

[refactor] 463-463: Too many arguments (6/5)

(R0913)

[refactor] 463-463: Too many local variables (18/15)

(R0914)

cognee/api/v1/add/add.py

cognee/tasks/ingestion/save_data_item_to_storage.py

cognee/tasks/web_scraper/bs4_crawler.py

cognee/tasks/web_scraper/config.py

cognee/tasks/web_scraper/models.py

cognee/tasks/web_scraper/utils.py

cognee/tasks/web_scraper/web_scraper_task.py

cognee/tests/tasks/web_scraping/web_scraping_test.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

cognee/tasks/web_scraper/utils.py (1)
62-72: Guard soup_crawler_config before dereferencing it

The condition if not soup_crawler_config and soup_crawler_config.extraction_rules is None: still dereferences soup_crawler_config when it is None, raising AttributeError and preventing BeautifulSoup scraping from ever running without an explicit config object. Please short-circuit the check before touching .extraction_rules, e.g.:
-        if not soup_crawler_config and soup_crawler_config.extraction_rules is None:
+        if not soup_crawler_config or soup_crawler_config.extraction_rules is None:
             raise ValueError("extraction_rules must be provided when not using Tavily")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae740ed and 1b5c099.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (7)

cognee/api/v1/add/add.py (7 hunks)
cognee/tasks/ingestion/save_data_item_to_storage.py (3 hunks)
cognee/tasks/web_scraper/config.py (1 hunks)
cognee/tasks/web_scraper/models.py (1 hunks)
cognee/tasks/web_scraper/utils.py (1 hunks)
cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)
cognee/tests/tasks/web_scraping/web_scraping_test.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

cognee/tests/tasks/web_scraping/web_scraping_test.py (6)

cognee/tasks/web_scraper/config.py (1)

SoupCrawlerConfig (12-22)

cognee/tasks/web_scraper/web_scraper_task.py (1)

cron_web_scraper_task (37-108)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/api/v1/visualize/visualize.py (1)

visualize_graph (14-27)

cognee/api/v1/add/add.py (1)

add (23-225)

cognee/modules/search/types/SearchType.py (1)

SearchType (4-19)

cognee/tasks/ingestion/save_data_item_to_storage.py (2)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-76)

cognee/modules/ingestion/save_data_to_file.py (1)

save_data_to_file (7-28)

cognee/api/v1/add/add.py (3)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-9)

SoupCrawlerConfig (12-22)

cognee/tasks/ingestion/resolve_data_directories.py (1)

resolve_data_directories (10-81)

cognee/tasks/ingestion/ingest_data.py (1)

ingest_data (25-194)

cognee/tasks/web_scraper/models.py (1)

cognee/infrastructure/engine/models/DataPoint.py (1)

DataPoint (20-220)

cognee/tasks/web_scraper/utils.py (3)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/web_scraper/bs4_crawler.py (2)

BeautifulSoupCrawler (69-549)

fetch_with_bs4 (463-549)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-9)

SoupCrawlerConfig (12-22)

cognee/tasks/web_scraper/web_scraper_task.py (6)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/storage/index_graph_edges.py (1)

index_graph_edges (12-81)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-19)

WebSite (22-33)

ScrapingJob (36-46)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (12-22)

TavilyConfig (6-9)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-76)

🪛 Pylint (3.3.8)

cognee/tests/tasks/web_scraping/web_scraping_test.py

[refactor] 149-169: Unnecessary "else" after "break", remove the "else" and de-indent the code inside it

(R1723)

cognee/tasks/ingestion/save_data_item_to_storage.py

[refactor] 31-31: Too many local variables (17/15)

(R0914)

[refactor] 31-31: Too many return statements (8/6)

(R0911)

[refactor] 31-31: Too many branches (19/12)

(R0912)

[refactor] 59-59: Consider merging these comparisons with 'in' by using 'parsed_url.scheme in ('http', 'https')'. Use a set instead if elements are hashable.

(R1714)

cognee/tasks/web_scraper/models.py

[refactor] 6-6: Too few public methods (0/2)

(R0903)

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 36-36: Too few public methods (0/2)

(R0903)

cognee/tasks/web_scraper/utils.py

[refactor] 48-76: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 37-37: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many local variables (34/15)

(R0914)

[refactor] 111-111: Too many statements (54/50)

(R0915)

[refactor] 387-390: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

cognee/api/v1/add/add.py

cognee/tasks/web_scraper/config.py

cognee/tasks/web_scraper/utils.py

cognee/tests/tasks/web_scraping/web_scraping_test.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

cognee/tasks/web_scraper/bs4_crawler.py (1)
519-524: Enforce rate limiting in the Playwright path.

Lines 519-522 call _render_with_playwright without enforcing rate limiting, whereas the HTTPX path (line 524) calls _fetch_httpx which internally enforces rate limiting via _respect_rate_limit (line 313). This inconsistency means Playwright requests ignore robots.txt crawl-delay directives and the configured per-domain throttling, potentially violating site policies and risking IP blocks.

Apply this diff to enforce rate limiting before Playwright rendering:
                     if use_playwright:
+                        crawl_delay = await self._get_crawl_delay(url)
+                        await self._respect_rate_limit(url, crawl_delay)
                         html = await self._render_with_playwright(
                             url, js_wait=playwright_js_wait, timeout=self.timeout
                         )

🧹 Nitpick comments (1)

cognee/tasks/web_scraper/config.py (1)
20-20: Consider making extraction_rules optional.

The extraction_rules field has no default value, making it a required field at instantiation. However, the validation logic in utils.py (line 61) checks whether extraction_rules is None before using it, suggesting the field should be optional at the config level. Making this field optional would allow partial config construction and improve flexibility.

Apply this diff:
-    extraction_rules: Dict[str, Any]
+    extraction_rules: Optional[Dict[str, Any]] = None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d53e8d and 0fd55a7.

📒 Files selected for processing (3)

cognee/tasks/web_scraper/bs4_crawler.py (1 hunks)
cognee/tasks/web_scraper/config.py (1 hunks)
cognee/tasks/web_scraper/utils.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-06T14:08:59.915Z

Learnt from: Geoff-Robin
PR: topoteretes/cognee#1501
File: cognee/tasks/web_scraper/utils.py:108-108
Timestamp: 2025-10-06T14:08:59.915Z
Learning: In the Tavily Python library, AsyncTavilyClient does not accept a timeout parameter at the constructor level. Instead, the timeout parameter should be passed to the extract method along with other parameters like extract_depth and proxy.

Applied to files:

cognee/tasks/web_scraper/utils.py

📚 Learning: 2025-10-07T15:48:53.427Z

Learnt from: Geoff-Robin
PR: topoteretes/cognee#1501
File: cognee/tasks/web_scraper/bs4_crawler.py:140-167
Timestamp: 2025-10-07T15:48:53.427Z
Learning: In cognee/tasks/web_scraper/bs4_crawler.py, the helper methods `_domain_from_url`, `_get_domain_root`, and `_normalize_rule` should be instance methods with `self` as the first parameter, not static methods.

Applied to files:

cognee/tasks/web_scraper/bs4_crawler.py

🧬 Code graph analysis (2)

cognee/tasks/web_scraper/utils.py (3)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/web_scraper/bs4_crawler.py (3)

BeautifulSoupCrawler (69-546)

fetch_with_bs4 (460-546)

close (125-129)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-10)

SoupCrawlerConfig (13-24)

cognee/tasks/web_scraper/bs4_crawler.py (1)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

🪛 Pylint (3.3.8)

cognee/tasks/web_scraper/bs4_crawler.py

[refactor] 69-69: Too many instance attributes (12/7)

(R0902)

[refactor] 85-85: Too many arguments (8/5)

(R0913)

[refactor] 438-458: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 403-403: Too many branches (15/12)

(R0912)

[refactor] 460-460: Too many arguments (6/5)

(R0913)

[refactor] 460-460: Too many local variables (18/15)

(R0914)

cognee/tasks/web_scraper/config.py

cognee/tasks/web_scraper/utils.py

… allow None value

## Description Cognee release v0.3.5 ## Type of Change  - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable)  ## Pre-submission Checklist  - [ ] **I have tested my changes thoroughly before submitting this PR** - [ ] **This PR contains minimal changes necessary to address the issue/feature** - [ ] My code follows the project's coding standards and style guidelines - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [ ] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

…dated doctrings.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

cognee/tests/tasks/web_scraping/web_scraping_test.py (1)
50-59: Remove unsupported structured parameter.

Line 58 passes structured=True to SoupCrawlerConfig, but this field doesn't exist in the model definition (see cognee/tasks/web_scraper/config.py lines 12-23). This will raise a Pydantic ValidationError at runtime.

Apply this diff to remove the unsupported parameter:
     soup_config = SoupCrawlerConfig(
         concurrency=1,
         crawl_delay=0.1,
         timeout=10.0,
         max_retries=1,
         retry_delay_factor=0.5,
         extraction_rules=rules,
         use_playwright=False,
-        structured=True,
     )

🧹 Nitpick comments (2)

cognee/tasks/web_scraper/web_scraper_task.py (2)
159-183: Clarify variable naming and conditional logic.

Line 159 assigns scraping_job_created = await graph_db.get_node(...), but the variable name is misleading—it actually holds the existing job node (or None if it doesn't exist). The conditional at lines 182-183 adds the node only if it already exists, which is confusing since line 319 performs a batch add_nodes() for all nodes anyway.

This pattern repeats at lines 226-227 and 240-241 for website nodes. The incremental add_node() calls appear redundant given the final batch operation.

Rename the variable to clarify its purpose and remove redundant conditional updates:
-    scraping_job_created = await graph_db.get_node(uuid5(NAMESPACE_OID, name=job_name))
+    existing_scraping_job = await graph_db.get_node(uuid5(NAMESPACE_OID, name=job_name))
 
     # Create description for ScrapingJob
     scraping_job_description = (
         # ... description content ...
     )
 
     scraping_job = ScrapingJob(
         # ... scraping job fields ...
     )
 
-    if scraping_job_created:
-        await graph_db.add_node(scraping_job)  # Update existing scraping job
If incremental updates are intentional for performance or real-time visibility, add a comment explaining the rationale. Otherwise, rely solely on the batch add_nodes() at line 319.

386-389: Remove unnecessary else after return.

Lines 386-389 contain an else block following a return statement. The else is unnecessary since the code in that block will only execute if the if condition is false.

As per coding guidelines:

Apply this diff to simplify the code:
     if full_path.startswith(base_path):
         return full_path[len(base_path) :].lstrip("/")
-    else:
-        return full_path.lstrip("/")
+    return full_path.lstrip("/")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49858c5 and 8d27da6.

📒 Files selected for processing (2)

cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)
cognee/tests/tasks/web_scraping/web_scraping_test.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

cognee/tests/tasks/web_scraping/web_scraping_test.py (3)

cognee/tasks/web_scraper/config.py (1)

SoupCrawlerConfig (13-24)

cognee/tasks/web_scraper/web_scraper_task.py (1)

cron_web_scraper_task (37-108)

cognee/api/v1/add/add.py (1)

add (23-227)

cognee/tasks/web_scraper/web_scraper_task.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-19)

WebSite (22-33)

ScrapingJob (36-46)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-24)

TavilyConfig (6-10)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-86)

🪛 Pylint (3.3.8)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 37-37: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many arguments (7/5)

(R0913)

[refactor] 111-111: Too many local variables (34/15)

(R0914)

[refactor] 111-111: Too many statements (53/50)

(R0915)

[refactor] 386-389: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🔇 Additional comments (4)

cognee/tasks/web_scraper/web_scraper_task.py (3)

27-32: LGTM: APScheduler import and initialization.

The import block correctly guards against missing APScheduler dependency and provides a clear error message. The BackgroundScheduler initialization is appropriate for this use case, though consider switching to AsyncIOScheduler for better async integration (see separate comment on lines 77-88).

327-359: LGTM: Argument validation logic.

The check_arguments function correctly validates and configures the scraping tools, with clear precedence logic (Tavily fallback when extraction_rules are absent) and appropriate error handling.

362-384: LGTM: URL path extraction with domain validation.

The get_path_after_base function correctly validates that both URLs share the same domain and extracts the relative path, preventing cross-domain path confusion.
cognee/tests/tasks/web_scraping/web_scraping_test.py (1)
119-152: Re-run cron web scraper test locally with proper environment set up
Install pytest, set PYTHONPATH to the repo root and run:
PYTHONPATH=. pytest cognee/tests/tasks/web_scraping/web_scraping_test.py::test_cron_web_scraper -q
If it fails, add await cognee.cognify() immediately after the cron_web_scraper_task call.

cognee/tasks/web_scraper/web_scraper_task.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

cognee/tasks/web_scraper/web_scraper_task.py (2)

271-271: Refine status logic to handle partial failures.

Line 271 sets the job status to "completed" if any webpages were scraped, otherwise "failed". This doesn't account for partial failures—if some URLs succeed and others fail, the job will still be marked "completed", masking issues.

Consider tracking success/failure counts:

+    successful_pages = len(webpages)
+    failed_pages = len(url) - successful_pages
+    
+    if failed_pages == 0:
+        scraping_job.status = "completed"
+    elif successful_pages == 0:
+        scraping_job.status = "failed"
+    else:
+        scraping_job.status = "partial"
-    scraping_job.status = "completed" if webpages else "failed"

Update the description to include these counts:

     scraping_job.description = (
         f"Scraping job: {job_name}\n"
         f"URLs: {', '.join(url)}\n"
         f"Status: {scraping_job.status}\n"
+        f"Successful: {successful_pages}/{len(url)}\n"
         f"Last run: {now.strftime('%Y-%m-%d %H:%M:%S')}\n"
         f"Next run: {next_run.strftime('%Y-%m-%d %H:%M:%S') if next_run else 'Not scheduled'}"
     )

360-387: Remove unused function or document its intended use.

The get_path_after_base function is defined but never called within this file or imported in the __init__.py (based on the provided context). This appears to be dead code.

Additionally, the else clause at lines 384-387 is unnecessary after the return statement.

If this function is not needed, remove it:

-def get_path_after_base(base_url: str, url: str) -> str:
-    """Extract the path after the base URL.
-
-    Args:
-        base_url: The base URL (e.g., "https://example.com").
-        url: The full URL to extract the path from.
-
-    Returns:
-        str: The path after the base URL, with leading slashes removed.
-
-    Raises:
-        ValueError: If the base URL and target URL are from different domains.
-    """
-    parsed_base = urlparse(base_url)
-    parsed_url = urlparse(url)
-
-    # Ensure they have the same netloc (domain)
-    if parsed_base.netloc != parsed_url.netloc:
-        raise ValueError("Base URL and target URL are from different domains")
-
-    # Return everything after base_url path
-    base_path = parsed_base.path.rstrip("/")
-    full_path = parsed_url.path
-
-    if full_path.startswith(base_path):
-        return full_path[len(base_path) :].lstrip("/")
-    else:
-        return full_path.lstrip("/")

If it's intended for future use or external import, add a docstring note explaining its purpose and remove the unnecessary else:

     if full_path.startswith(base_path):
         return full_path[len(base_path) :].lstrip("/")
-    else:
-        return full_path.lstrip("/")
+    
+    return full_path.lstrip("/")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d27da6 and a3fbbdf.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (1)

cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cognee/tasks/web_scraper/web_scraper_task.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/tasks/storage/index_graph_edges.py (1)

index_graph_edges (12-81)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-19)

WebSite (22-33)

ScrapingJob (36-46)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-24)

TavilyConfig (6-10)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-86)

🪛 Pylint (3.3.8)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 36-36: Too many arguments (7/5)

(R0913)

[refactor] 109-109: Too many arguments (7/5)

(R0913)

[refactor] 109-109: Too many local variables (34/15)

(R0914)

[refactor] 109-109: Too many statements (53/50)

(R0915)

[refactor] 384-387: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🔇 Additional comments (2)

cognee/tasks/web_scraper/web_scraper_task.py (2)

92-93: LGTM: Scheduler lifecycle correctly managed.

The conditional check if not scheduler.running: before calling scheduler.start() correctly prevents SchedulerAlreadyRunningError on subsequent job registrations.

325-358: LGTM: Argument validation and config construction is correct.

The function correctly handles the various parameter combinations:

Constructs SoupCrawlerConfig from extraction_rules if needed.

Constructs or updates TavilyConfig from tavily_api_key if provided.

Selects preferred_tool based on available configurations.

Raises TypeError if neither config can be constructed.

cognee/tasks/web_scraper/web_scraper_task.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

cognee/tasks/web_scraper/web_scraper_task.py (4)
29-29: Update the APScheduler version recommendation.

The import error message suggests installing APScheduler>=3.10, but the latest stable version is 3.11.0 (released November 2024).

Apply this diff to update the version recommendation:
-    raise ImportError("Please install apscheduler by pip install APScheduler>=3.10")
+    raise ImportError("Please install apscheduler by pip install APScheduler>=3.11")
Based on learnings.

98-98: Remove redundant fallback expression.

Line 77 ensures job_name is always truthy (either from the parameter or a timestamp-based default), so the fallback job_name or f"WebScraper_{uuid5(...)}" at line 98 is redundant and could never use the UUID-based fallback.

Apply this diff to simplify:
-            name=job_name or f"WebScraper_{uuid5(NAMESPACE_OID, name=job_name)}",
+            name=job_name,
118-331: Consider breaking down this large function.

The web_scraper_task function is quite long with 34 local variables and 53 statements. It handles multiple responsibilities: validation, fetching content, creating nodes, managing edges, and indexing. This makes the function harder to test and maintain.

Consider extracting logical sub-operations into helper functions:

_create_scraping_job_node(...) → lines 166-187

_process_scraped_content(...) → lines 201-278

_build_graph_structure(...) → lines 290-324

This would improve readability and make unit testing easier.

393-396: Remove unnecessary else after return.

The else clause at line 395 is redundant since the if block at line 393 returns early.

Apply this diff:
     if full_path.startswith(base_path):
         return full_path[len(base_path) :].lstrip("/")
-    else:
-        return full_path.lstrip("/")
+    return full_path.lstrip("/")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3fbbdf and 599ef4a.

📒 Files selected for processing (1)

cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cognee/tasks/web_scraper/web_scraper_task.py (6)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/shared/logging_utils.py (1)

get_logger (182-194)

cognee/tasks/storage/index_graph_edges.py (1)

index_graph_edges (12-81)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-19)

WebSite (22-33)

ScrapingJob (36-46)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-24)

TavilyConfig (6-10)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-86)

🪛 Pylint (3.3.8)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 44-44: Too many arguments (7/5)

(R0913)

[refactor] 118-118: Too many arguments (7/5)

(R0913)

[refactor] 118-118: Too many local variables (34/15)

(R0914)

[refactor] 118-118: Too many statements (53/50)

(R0915)

[refactor] 393-396: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

cognee/tasks/web_scraper/web_scraper_task.py (4)
166-190: Clarify conditional node update pattern.

The logic checks scraping_job_created (which is truthy when the job already exists) and then calls add_node() to update it. While this is correct per the learning that add_node() performs upserts, the pattern might cause confusion.

However, line 326 performs a bulk add_nodes() operation for all nodes, which also upserts. The intermediate conditional updates (lines 189-190, 233-234, 247-248) become redundant since the final bulk operation will handle both creation and updates.

Consider removing the intermediate conditional add_node() calls and relying solely on the bulk operation at line 326, unless the intermediate updates serve a specific purpose (e.g., immediate consistency for long-running scrapes).

393-396: Remove unnecessary else after return.

The else block after the return statement is unnecessary and reduces readability.

Apply this diff:
     if full_path.startswith(base_path):
         return full_path[len(base_path) :].lstrip("/")
-    else:
-        return full_path.lstrip("/")
+    return full_path.lstrip("/")
101-103: Add scheduler shutdown cleanup. No calls to scheduler.shutdown() or scheduler.stop() were found; register a cleanup handler (e.g. via atexit.register or an explicit teardown) to properly release scheduler resources.

326-329: Avoid indexing full WebPage.content in vector embeddings
WebPage.metadata["index_fields"] includes "content", so index_data_points will embed the entire HTML payload (potentially megabytes) and may cause performance or memory issues. Restrict index_fields to ["name","description"] or truncate/summarize content before embedding.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 599ef4a and 6602275.

📒 Files selected for processing (2)

cognee/tasks/web_scraper/models.py (1 hunks)
cognee/tasks/web_scraper/web_scraper_task.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

cognee/tasks/web_scraper/web_scraper_task.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/tasks/storage/index_graph_edges.py (1)

index_graph_edges (12-81)

cognee/tasks/web_scraper/models.py (3)

WebPage (6-19)

WebSite (22-33)

ScrapingJob (36-46)

cognee/tasks/web_scraper/config.py (2)

SoupCrawlerConfig (13-24)

TavilyConfig (6-10)

cognee/tasks/web_scraper/utils.py (1)

fetch_page_content (15-86)

cognee/tasks/web_scraper/models.py (1)

cognee/infrastructure/engine/models/DataPoint.py (1)

DataPoint (20-220)

🪛 Pylint (3.3.8)

cognee/tasks/web_scraper/web_scraper_task.py

[refactor] 44-44: Too many arguments (7/5)

(R0913)

[refactor] 118-118: Too many arguments (7/5)

(R0913)

[refactor] 118-118: Too many local variables (35/15)

(R0914)

[refactor] 118-118: Too many statements (54/50)

(R0915)

[refactor] 393-396: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

cognee/tasks/web_scraper/models.py

[refactor] 6-6: Too few public methods (0/2)

(R0903)

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 36-36: Too few public methods (0/2)

(R0903)

🔇 Additional comments (6)

cognee/tasks/web_scraper/models.py (3)

6-19: LGTM! Models are well-structured.

The WebPage model fields are appropriate for representing scraped web pages. The metadata index_fields now correctly reference existing attributes (name, description, content), addressing the previous concern about misaligned field names.

Minor observation: name is typed as Optional[str] but is always populated with the page URL in practice (see web_scraper_task.py line 267). Consider making it non-optional (name: str) to better reflect actual usage and prevent potential None values in the embeddable data.

22-33: LGTM! Website model is well-defined.

The WebSite model appropriately captures domain-level metadata. The index_fields align with actual attributes, ensuring proper vector indexing.

36-46: LGTM! Scraping job model is appropriate.

The ScrapingJob model correctly represents job configuration with scheduling support. The index_fields are properly aligned with the model's attributes.

cognee/tasks/web_scraper/web_scraper_task.py (3)

34-41: Good lazy initialization pattern.

The get_scheduler() function correctly implements lazy initialization, addressing the previous concern about module-level scheduler instantiation. This ensures the AsyncIOScheduler is created only when needed.

253-263: LGTM! Content preview avoids performance issues.

The description correctly uses a truncated content_preview (first 500 characters) instead of embedding the full content, as addressed in previous reviews. The full content remains in the content field where it belongs.

265-267: LGTM! UUID generation prevents collisions.

Using the full page_url for UUID generation (instead of just the path) correctly prevents ID collisions across different domains, as addressed in previous reviews.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

cognee/api/v1/add/add.py (1)
153-153: Fix the typo in the documentation comment.

The documentation still contains a grammatical error flagged in a previous review. The line should read "Make sure to set TAVILY_API_KEY=YOUR_TAVILY_API_KEY as an environment variable".

Apply this diff:
-        Make sure to set    TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable
+        Make sure to set TAVILY_API_KEY=YOUR_TAVILY_API_KEY as an environment variable

🧹 Nitpick comments (1)

cognee/api/v1/add/add.py (1)
175-181: Use default TavilyConfig constructor and remove redundant env var guard

ContextVar.set accepts None without error. Since TavilyConfig already defaults its api_key from TAVILY_API_KEY, simplify in cognee/api/v1/add/add.py:
- if not tavily_config and os.getenv("TAVILY_API_KEY"):
-     tavily_config = TavilyConfig(api_key=os.getenv("TAVILY_API_KEY"))
+ if not tavily_config:
+     tavily_config = TavilyConfig()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6602275 and e934f80.

⛔ Files ignored due to path filters (3)

cognee-mcp/pyproject.toml is excluded by !**/*.toml
pyproject.toml is excluded by !**/*.toml
uv.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (1)

cognee/api/v1/add/add.py (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cognee/api/v1/add/add.py (4)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-10)

SoupCrawlerConfig (13-24)

cognee/modules/pipelines/tasks/task.py (1)

Task (5-97)

cognee/tasks/ingestion/resolve_data_directories.py (1)

resolve_data_directories (10-84)

cognee/tasks/ingestion/ingest_data.py (1)

ingest_data (24-193)

🔇 Additional comments (3)

cognee/api/v1/add/add.py (3)

2-3: LGTM!

The new imports are appropriate for the web scraping functionality and are all used in the code.

Also applies to: 15-20

33-35: LGTM!

The new parameters are properly typed and integrate well with the existing signature. Note that a previous review raised a question about whether extraction_rules and soup_crawler_config should both exist, but this remains a design decision for the maintainers.

183-191: LGTM!

The URL detection logic properly handles type safety and works correctly for both single URLs and lists. The node_set modification happens before task construction, which addresses a previous review concern. The _is_http_url helper with isinstance check prevents the TypeError that was flagged in an earlier review.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

cognee/api/v1/add/add.py (1)
153-153: Fix grammatical error in documentation.

The phrase should read "as an environment variable" not "as a environment variable".

Apply this diff:
-        Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable
+        Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as an environment variable

🧹 Nitpick comments (1)

cognee/api/v1/add/add.py (1)
33-35: Clarify precedence when both extraction_rules and soup_crawler_config are provided.

Lines 175-176 show that if soup_crawler_config is provided, extraction_rules is ignored. This could confuse users who pass both parameters. Consider either:

Raising a clear error when both are provided, or

Documenting the precedence explicitly in the docstring

Example validation:
+    if extraction_rules and soup_crawler_config:
+        raise ValueError(
+            "Cannot specify both 'extraction_rules' and 'soup_crawler_config'. "
+            "Pass extraction_rules via soup_crawler_config.extraction_rules instead."
+        )
     if not soup_crawler_config and extraction_rules:
         soup_crawler_config = SoupCrawlerConfig(extraction_rules=extraction_rules)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e934f80 and 4058d63.

📒 Files selected for processing (2)

cognee/api/v1/add/add.py (7 hunks)
cognee/tests/tasks/web_scraping/web_scraping_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tests/tasks/web_scraping/web_scraping_test.py

🧰 Additional context used

🧬 Code graph analysis (1)

cognee/api/v1/add/add.py (3)

cognee/tasks/web_scraper/config.py (2)

TavilyConfig (6-10)

SoupCrawlerConfig (13-24)

cognee/tasks/ingestion/resolve_data_directories.py (1)

resolve_data_directories (10-84)

cognee/tasks/ingestion/ingest_data.py (1)

ingest_data (24-193)

🔇 Additional comments (8)

cognee/api/v1/add/add.py (8)

2-3: LGTM!

The import additions are necessary for environment variable access and type annotations used throughout the updated function.

15-20: LGTM!

All imported modules are utilized in the function for web scraping configuration and URL handling.

41-41: LGTM!

Documentation correctly reflects the new URL ingestion capability.

81-81: LGTM!

The parameter documentation clearly explains the purpose and usage of URL inputs and web scraping configurations.

Also applies to: 92-94

171-171: LGTM!

The environment variable documentation is correctly updated to include TAVILY_API_KEY.

183-191: LGTM! Past review concerns addressed.

The URL detection logic correctly:

Guards against non-string payloads using isinstance(item, str) check

Updates node_set before task construction (moved from after, per previous feedback)

Handles both single URLs and lists containing URLs

195-202: LGTM!

The task construction correctly uses the updated node_set value after URL detection, ensuring web content is properly tagged during ingestion.

175-181: Context global handling and thread safety verified
ContextVar supports None defaults and per-context isolation, so no changes are required.

Geoff-Robin · 2025-10-10T17:35:53Z

@borisarzentar Is it fine if add some more support to extend it for serverless environments?
Although the user could set it up himself, considering it's just a wrapper( AWS Lambda could use AWS EventBridge to schedule).
As for the shortening the arguments, Is it fine if i work on that, unless it's preferred that you'd like to merge and work on it later, considering it's a minor refactor.

Vasilije1990 · 2025-10-11T09:25:27Z

@Geoff-Robin this PR seems like it went trough a review. How about we open an additional 2 tickets for serverless and refactor and I merge this.

Appreciate your help on this one!

Geoff-Robin added 23 commits September 30, 2025 20:39

included scraping dependencies

510926f

Created models.py

6348c9d

removed scrapy and added bs4

70a2cc9

Setup models.py and utils.py

925bd38

Added logging

60499c4

switched httpx AsyncClient to fetch webpage

c283977

Added playwright as a dependency

4979f43

first iteration of bs4_connector.py done

edd119e

Changed bs4_connector.py to bs4_crawler.py

1ab9d24

Done with integration with add workflow when incremental_loading is s…

20fb773

…et to False

removed unused Dict import from typing

fbef667

Removed asyncio import

da7ebc4

Added global context for bs4crawler and tavily config

ab6fc65

Tested and Debugged scraping usage in cognee.add() pipeline

2cba31a

removed structured argument

c2aa955

Added APScheduler

77ea7c4

Added support for multiple base_url extraction

f148b1d

Done with scraping_task successfully

f449fce

corrected F402 error pointed out by ruff check

e5633bc

Done adding cron job web scraping

0f64f68

Added Documentation

4d5146c

Added cron job and removed obvious comments

667bbd7

Added related documentation

ae740ed

coderabbitai bot reviewed Oct 5, 2025

View reviewed changes

CodeRabbit reviews solved

1b5c099

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

cognee/api/v1/add/add.py Show resolved Hide resolved

cognee/tasks/web_scraper/config.py Outdated Show resolved Hide resolved

cognee/tasks/web_scraper/utils.py Outdated Show resolved Hide resolved

cognee/tests/tasks/web_scraping/web_scraping_test.py Outdated Show resolved Hide resolved

Solved more nitpick comments

791e38b

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

cognee/tasks/web_scraper/config.py Outdated Show resolved Hide resolved

cognee/tasks/web_scraper/utils.py Outdated Show resolved Hide resolved

Geoff-Robin and others added 6 commits October 7, 2025 22:36

Added await

f59c278

Made api_key field in TavilyConfig models to be Optional[str] type to…

49858c5

… allow None value

Removed print statement logging and used cognee inbuilt logger and up…

ea33854

…dated doctrings.

removed dotenv imports

8d27da6

Trying to resolve uv.lock

af71cba

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

cognee/tasks/web_scraper/web_scraper_task.py Outdated Show resolved Hide resolved

Solved nitpick comments

a3fbbdf

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

cognee/tasks/web_scraper/web_scraper_task.py Outdated Show resolved Hide resolved

cognee/tasks/web_scraper/web_scraper_task.py Show resolved Hide resolved

cognee/tasks/web_scraper/web_scraper_task.py Show resolved Hide resolved

solved nitpick comments

599ef4a

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Geoff-Robin added 2 commits October 8, 2025 16:51

Addressed code rabbit comment on shortening content

6602275

resolving uv.lock conflict

a9d410e

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Merge branch 'main' into feature/web_scraping_connector_task

e934f80

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Geoff-Robin added 2 commits October 10, 2025 02:34

solved nitpick comments

f82dfbe

Added better selectors for testing

4058d63

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Geoff-Robin changed the base branch from main to dev October 10, 2025 10:46

Geoff-Robin added 2 commits October 10, 2025 16:16

Merge branch 'dev' into feature/web_scraping_connector_task

4e5c681

Merge branch 'dev' into feature/web_scraping_connector_task

f316128

Merge branch 'dev' into feature/web_scraping_connector_task

5e69438

Vasilije1990 merged commit 04a0998 into topoteretes:dev Oct 11, 2025
19 of 39 checks passed

daukadolt mentioned this pull request Oct 14, 2025

fix: Resolve issue with MCP #1546

Merged

16 tasks

coderabbitai bot mentioned this pull request Oct 20, 2025

feat: feedback enrichment #1571

Merged

16 tasks

coderabbitai bot mentioned this pull request Dec 2, 2025

refactor: restructure examples and starter kit into new-examples #1862

Merged

16 tasks

feat: web scraping connector task #1501

feat: web scraping connector task #1501

Uh oh!

Conversation

Geoff-Robin commented Oct 5, 2025

Description

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Uh oh!

pull-checklist bot commented Oct 5, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Geoff-Robin commented Oct 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Geoff-Robin commented Oct 10, 2025

Uh oh!

Vasilije1990 commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 5, 2025 •

edited

Loading