BME Software Architectures assignment in 2022. It scrapes MÁV RSS feed, and analyses data with NLP.
To install the sotware you have to first set the enviroment variables in either the .env file, or in your operating system.
The enviroment variables the program expects are:
COGNITIVE_SERVICE_KEY: this is required, it is an Azure API keyCOGNITIVE_SERVICE_BASEdefault='bme-mav-nlp': this is the API endpoint name to use, it is automaticly appended to the full link.COGNITIVE_SERVICE_ENDPOINTdefault='https://COGNITIVE_SERVICE_BASE.cognitiveservices.azure.com/': this is the actual endpoint to use for Azure NLPRSS_FEED_STORAGE_LOCATIONdefault='rss_feed_collection.csv': where data from the RSS feed are storedINCIDENTS_STORAGE_LOCATIONdefault='incidents.csv': where NLP results are storedINCIDENTS_STORAGE_LOCATIONdefault='feed.log': incident storage creates logs, which are stored here
The package requires Python 3.10 or later.
To install the packages simply run:
pip install -r requirements.txt
python -m nltk.downloader stopwords
python -m spacy download en_core_web_mdTo run the main loop there is a cli interface:
python -m scr main-scrape-loop --sleep_time <seconds>where the parameter determines how much time should be between RSS fetches.
This repository uses pip-tools.
To set up a development enviroment run:
pip install -r dev-requirements.txtTo update the requirements of the proejct, modify the correct *requirements.in file, and the run:
pip-compile requirements.in --upgrade --resolver=backtrackingTo upgrade your virtual enviroment run:
pip-syncTo set up production use the requirements.txt.