-
Notifications
You must be signed in to change notification settings - Fork 18
Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py
arbuckle/python_crawler
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Crawl.py
Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata
about each page it visits.
This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse
a site and scrape pages.
Data collected includes:
- url & parsed url (scheme/netloc/path/etc...)
- page load time in milliseconds
- page size in bytes
- link addresses on the page
- number of links on the page
- number of links within the page domain
- number of links targeting external domains
Links on each page are recorded in the 'url_canonical' table.
Visits to each link are recorded in the 'visit_metadata' table.
Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table.
Planned functionality:
- addition of an optional 'scrape page' function
- global toggle to allow already-visited URLs to be requeued.
- usage guidelines and documentation
Known issues:
- URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests
About
Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published