-
Setup environment using
Pipenv(Python 3.9) -
Add
.env&.env.templatefor environment variables -
Make CLI using
clickorTyper -
Add
mypytests -
Fully utilize
scrapySpyders (currently works withrequests) -
Setup
cronjobs for automatic daily scraping
-
Add type hints
-
Stylize code to be max 80 char.-s per row
-
Convert all
str.format()syntax to f-strings(f"{}") -
Setup proper architecture for: Scraper - DB - DAE - API - Dashboard
-
Add docstrings
-
When svaing to
.json, stylize/prettify the content -
Correct the encoding for arm characters (UTF-8)
-
Add URL to extracted data fields
-
Add Company URL to extracted data fields
-
Add
Foundation_dateandTelephonefields to companies -
Scrape all companies so that new companies can be detected properly
-
Add functionality for scraping company information
-
Change data storing directory (currently in
notebooks/) -
Add progress bars
-
Utilize
collections.defaultdict -
Utilize
urllib.parse.urljoinfor base URL and relative pages' joining -
Need to update
Job_viewsfield daily -
Implement saving to
.csvfunctionality -
Save logs for a daily crawling (appending msg.-s in
mainfunction) -
Need to check previous scraped data for avoiding duplicate crawling
-
Add summarizing daily logs with following fields:
{ "date": { "date": "datetime", "weekday": "str", "postings": "int", "new_postings": "int", "new_companies": "int" } } -
Change behaviour of crawler to store
new h3 fieldmessage's info inAdditional_Infofield, instead of printing in the console -
Change
tqdmmessage so that it prints the URL being scraped -
Email notification if scraping fails for some reason
-
Fix company title fetching (currently
None) -
Store
int-s &float-s properly in.jsonfiles (stored asstr) -
Recover URL-s for previous scraped data
-
Fix company storing data (
list.extend()instead oflist.append()) -
Finalize Company info crawling in
main()function -
Fix bug related to
crawl_all_companies()output (listinstead ofdict) -
Fix Company
Infofield scraping (appends all companies together) -
Strip scraped
strdata (eg.:Company_Title) -
Fix scraping of
Additional_informationfield
-
Setup RDBMS or NoSQL (
PostgreSQL/SQLiteorMongoDB/Redis) -
Setup ORM (Object Relational Mapper:
SQLAlchemy.orm/PeeWee)