GitHub - arbuckle/python_crawler: Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

arbuckle / python_crawler Public

Notifications You must be signed in to change notification settings
Fork 18
Star 20

Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

20 stars 18 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README		README
crawl.py		crawl.py

Repository files navigation

Crawl.py

Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata
about each page it visits.

This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse
  a site and scrape pages.

Data collected includes:
    - url & parsed url (scheme/netloc/path/etc...)
    - page load time in milliseconds
    - page size in bytes
    - link addresses on the page
    - number of links on the page
    - number of links within the page domain
    - number of links targeting external domains

Links on each page are recorded in the 'url_canonical' table.
Visits to each link are recorded in the 'visit_metadata' table.
Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table.


Planned functionality:
- addition of an optional 'scrape page' function
- global toggle to allow already-visited URLs to be requeued.
- usage guidelines and documentation

Known issues:
- URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests