Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
510926f
included scraping dependencies
Geoff-Robin Sep 30, 2025
6348c9d
Created models.py
Geoff-Robin Sep 30, 2025
70a2cc9
removed scrapy and added bs4
Geoff-Robin Oct 1, 2025
925bd38
Setup models.py and utils.py
Geoff-Robin Oct 1, 2025
60499c4
Added logging
Geoff-Robin Oct 1, 2025
c283977
switched httpx AsyncClient to fetch webpage
Geoff-Robin Oct 1, 2025
4979f43
Added playwright as a dependency
Geoff-Robin Oct 1, 2025
edd119e
first iteration of bs4_connector.py done
Geoff-Robin Oct 2, 2025
1ab9d24
Changed bs4_connector.py to bs4_crawler.py
Geoff-Robin Oct 3, 2025
20fb773
Done with integration with add workflow when incremental_loading is s…
Geoff-Robin Oct 4, 2025
fbef667
removed unused Dict import from typing
Geoff-Robin Oct 4, 2025
da7ebc4
Removed asyncio import
Geoff-Robin Oct 4, 2025
ab6fc65
Added global context for bs4crawler and tavily config
Geoff-Robin Oct 4, 2025
2cba31a
Tested and Debugged scraping usage in cognee.add() pipeline
Geoff-Robin Oct 4, 2025
c2aa955
removed structured argument
Geoff-Robin Oct 5, 2025
77ea7c4
Added APScheduler
Geoff-Robin Oct 5, 2025
f148b1d
Added support for multiple base_url extraction
Geoff-Robin Oct 5, 2025
f449fce
Done with scraping_task successfully
Geoff-Robin Oct 5, 2025
e5633bc
corrected F402 error pointed out by ruff check
Geoff-Robin Oct 5, 2025
0f64f68
Done adding cron job web scraping
Geoff-Robin Oct 5, 2025
4d5146c
Added Documentation
Geoff-Robin Oct 5, 2025
667bbd7
Added cron job and removed obvious comments
Geoff-Robin Oct 5, 2025
ae740ed
Added related documentation
Geoff-Robin Oct 5, 2025
1b5c099
CodeRabbit reviews solved
Geoff-Robin Oct 6, 2025
791e38b
Solved more nitpick comments
Geoff-Robin Oct 6, 2025
3c9e5f8
Solved more nitpick comments
Geoff-Robin Oct 6, 2025
0a9b624
changed return type for fetch_page_content to Dict[str,str]
Geoff-Robin Oct 6, 2025
7fe1de7
Remove assignment to unused variable graph_db'
Geoff-Robin Oct 6, 2025
d4ce340
Removed unused imports
Geoff-Robin Oct 6, 2025
1c0e0f0
Solved more nitpick comments
Geoff-Robin Oct 6, 2025
54f2580
Solved more nitpick comments
Geoff-Robin Oct 6, 2025
1f36dd3
Solved nitpick comments
Geoff-Robin Oct 6, 2025
5dcd7e5
Changes uv.lock
Geoff-Robin Oct 6, 2025
b5a1957
Regenerate uv.lock after merge
Geoff-Robin Oct 6, 2025
902f9a3
Changed cognee-mcp\pyproject.toml
Geoff-Robin Oct 6, 2025
f71cf77
.
Geoff-Robin Oct 6, 2025
fdf8562
Added uv.lock again
Geoff-Robin Oct 6, 2025
d91ffa2
Removed staticmethod decorator from bs4_crawler.py, kwargs from the f…
Geoff-Robin Oct 7, 2025
3d53e8d
Removed print statement that I used for debugging
Geoff-Robin Oct 7, 2025
fcd91a9
Added self as an argument to all previous methods that were static me…
Geoff-Robin Oct 7, 2025
fc660e4
Closed crawler instance in a finally block
Geoff-Robin Oct 7, 2025
0fd55a7
ruff formatted
Geoff-Robin Oct 7, 2025
f59c278
Added await
Geoff-Robin Oct 7, 2025
49858c5
Made api_key field in TavilyConfig models to be Optional[str] type to…
Geoff-Robin Oct 7, 2025
760a9de
Release v0.3.5 (#1515)
borisarzentar Oct 7, 2025
ea33854
Removed print statement logging and used cognee inbuilt logger and up…
Geoff-Robin Oct 8, 2025
8d27da6
removed dotenv imports
Geoff-Robin Oct 8, 2025
af71cba
Trying to resolve uv.lock
Geoff-Robin Oct 8, 2025
a3fbbdf
Solved nitpick comments
Geoff-Robin Oct 8, 2025
599ef4a
solved nitpick comments
Geoff-Robin Oct 8, 2025
6602275
Addressed code rabbit comment on shortening content
Geoff-Robin Oct 8, 2025
a9d410e
resolving uv.lock conflict
Geoff-Robin Oct 8, 2025
e934f80
Merge branch 'main' into feature/web_scraping_connector_task
Geoff-Robin Oct 8, 2025
f82dfbe
solved nitpick comments
Geoff-Robin Oct 9, 2025
4058d63
Added better selectors for testing
Geoff-Robin Oct 9, 2025
4e5c681
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin Oct 10, 2025
f316128
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin Oct 10, 2025
5e69438
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin Oct 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added support for multiple base_url extraction
  • Loading branch information
Geoff-Robin committed Oct 5, 2025
commit f148b1df89c9f55cc9f5caac4a57dda92415f949
4 changes: 2 additions & 2 deletions cognee/api/v1/add/add.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from uuid import UUID
import os
from typing import Union, BinaryIO, List, Optional, Dict, Literal
from typing import Union, BinaryIO, List, Optional, Dict, Any

from cognee.modules.users.models import User
from cognee.modules.pipelines import Task, run_pipeline
Expand Down Expand Up @@ -30,7 +30,7 @@ async def add(
dataset_id: Optional[UUID] = None,
preferred_loaders: List[str] = None,
incremental_loading: bool = True,
extraction_rules: Optional[Dict[str, str]] = None,
extraction_rules: Optional[Dict[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since soup_crawler_config already uses extraction_rules, why not always pass just the soup_crawler_config?
We want to keep number of arguments to add minimal, this would remove extraction_rules from add arguments.
@dexters1 @hajdul88 Should we introduce config dict here as well like we do for cognify? Number of arguments grows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the user might prefer just passing the extraction_rules if that's the only setting that needs to be configured.
But if it's preferred that soup_crawler_config is enough, I can change that. Just let me know if the change needs to be made now.

tavily_config: Optional[TavilyConfig] = None,
soup_crawler_config: Optional[SoupCrawlerConfig] = None,
):
Expand Down
Loading