-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: web scraping connector task #1501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Vasilije1990
merged 58 commits into
topoteretes:dev
from
Geoff-Robin:feature/web_scraping_connector_task
Oct 11, 2025
Merged
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
510926f
included scraping dependencies
Geoff-Robin 6348c9d
Created models.py
Geoff-Robin 70a2cc9
removed scrapy and added bs4
Geoff-Robin 925bd38
Setup models.py and utils.py
Geoff-Robin 60499c4
Added logging
Geoff-Robin c283977
switched httpx AsyncClient to fetch webpage
Geoff-Robin 4979f43
Added playwright as a dependency
Geoff-Robin edd119e
first iteration of bs4_connector.py done
Geoff-Robin 1ab9d24
Changed bs4_connector.py to bs4_crawler.py
Geoff-Robin 20fb773
Done with integration with add workflow when incremental_loading is s…
Geoff-Robin fbef667
removed unused Dict import from typing
Geoff-Robin da7ebc4
Removed asyncio import
Geoff-Robin ab6fc65
Added global context for bs4crawler and tavily config
Geoff-Robin 2cba31a
Tested and Debugged scraping usage in cognee.add() pipeline
Geoff-Robin c2aa955
removed structured argument
Geoff-Robin 77ea7c4
Added APScheduler
Geoff-Robin f148b1d
Added support for multiple base_url extraction
Geoff-Robin f449fce
Done with scraping_task successfully
Geoff-Robin e5633bc
corrected F402 error pointed out by ruff check
Geoff-Robin 0f64f68
Done adding cron job web scraping
Geoff-Robin 4d5146c
Added Documentation
Geoff-Robin 667bbd7
Added cron job and removed obvious comments
Geoff-Robin ae740ed
Added related documentation
Geoff-Robin 1b5c099
CodeRabbit reviews solved
Geoff-Robin 791e38b
Solved more nitpick comments
Geoff-Robin 3c9e5f8
Solved more nitpick comments
Geoff-Robin 0a9b624
changed return type for fetch_page_content to Dict[str,str]
Geoff-Robin 7fe1de7
Remove assignment to unused variable graph_db'
Geoff-Robin d4ce340
Removed unused imports
Geoff-Robin 1c0e0f0
Solved more nitpick comments
Geoff-Robin 54f2580
Solved more nitpick comments
Geoff-Robin 1f36dd3
Solved nitpick comments
Geoff-Robin 5dcd7e5
Changes uv.lock
Geoff-Robin b5a1957
Regenerate uv.lock after merge
Geoff-Robin 902f9a3
Changed cognee-mcp\pyproject.toml
Geoff-Robin f71cf77
.
Geoff-Robin fdf8562
Added uv.lock again
Geoff-Robin d91ffa2
Removed staticmethod decorator from bs4_crawler.py, kwargs from the f…
Geoff-Robin 3d53e8d
Removed print statement that I used for debugging
Geoff-Robin fcd91a9
Added self as an argument to all previous methods that were static me…
Geoff-Robin fc660e4
Closed crawler instance in a finally block
Geoff-Robin 0fd55a7
ruff formatted
Geoff-Robin f59c278
Added await
Geoff-Robin 49858c5
Made api_key field in TavilyConfig models to be Optional[str] type to…
Geoff-Robin 760a9de
Release v0.3.5 (#1515)
borisarzentar ea33854
Removed print statement logging and used cognee inbuilt logger and up…
Geoff-Robin 8d27da6
removed dotenv imports
Geoff-Robin af71cba
Trying to resolve uv.lock
Geoff-Robin a3fbbdf
Solved nitpick comments
Geoff-Robin 599ef4a
solved nitpick comments
Geoff-Robin 6602275
Addressed code rabbit comment on shortening content
Geoff-Robin a9d410e
resolving uv.lock conflict
Geoff-Robin e934f80
Merge branch 'main' into feature/web_scraping_connector_task
Geoff-Robin f82dfbe
solved nitpick comments
Geoff-Robin 4058d63
Added better selectors for testing
Geoff-Robin 4e5c681
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin f316128
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin 5e69438
Merge branch 'dev' into feature/web_scraping_connector_task
Geoff-Robin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Added support for multiple base_url extraction
- Loading branch information
commit f148b1df89c9f55cc9f5caac4a57dda92415f949
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since
soup_crawler_configalready usesextraction_rules, why not always pass just thesoup_crawler_config?We want to keep number of arguments to
addminimal, this would removeextraction_rulesfrom add arguments.@dexters1 @hajdul88 Should we introduce
configdict here as well like we do for cognify? Number of arguments grows.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the user might prefer just passing the
extraction_rulesif that's the only setting that needs to be configured.But if it's preferred that
soup_crawler_configis enough, I can change that. Just let me know if the change needs to be made now.