Skip to content

Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

Notifications You must be signed in to change notification settings

arbuckle/python_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Crawl.py

Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata
about each page it visits.

This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse
  a site and scrape pages.

Data collected includes:
    - url & parsed url (scheme/netloc/path/etc...)
    - page load time in milliseconds
    - page size in bytes
    - link addresses on the page
    - number of links on the page
    - number of links within the page domain
    - number of links targeting external domains

Links on each page are recorded in the 'url_canonical' table.
Visits to each link are recorded in the 'visit_metadata' table.
Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table.


Planned functionality:
- addition of an optional 'scrape page' function
- global toggle to allow already-visited URLs to be requeued.
- usage guidelines and documentation

Known issues:
- URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests

About

Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages