A comprehensive Apache Airflow-based ETL pipeline for extracting financial data from SEC EDGAR filings. This pipeline automatically downloads, processes, and stores financial statements and key metrics for S&P 500 companies.
- Automated Data Extraction: Downloads 10-K and 10-Q filings from SEC EDGAR
- Financial Statements: Extracts balance sheets, income statements, and statements of equity
- Key Metrics: Captures EPS (basic/diluted), revenue, and other financial facts
- Incremental Processing: Tracks last execution date to avoid duplicate extractions
- S&P 500 Coverage: Processes all S&P 500 companies
- Structured Output: Saves data in organized CSV format with proper directory structure
Financial_ETL/
βββ dags/
β βββ extraction_dag.py # Main Airflow DAG for financial data extraction
βββ data/
β βββ sp500_companies.csv # S&P 500 company list
β βββ extraction_log.json # Execution tracking log
β βββ [TICKER]/ # Company-specific data
β βββ 10k/ # 10-K filings
β β βββ balance_sheet/
β β βββ income_statement/
β β βββ statement_of_equity/
β β βββ basic_eps/
β β βββ diluted_eps/
β β βββ revenue/
β βββ 10q/ # 10-Q filings
β βββ balance_sheet/
β βββ income_statement/
β βββ statement_of_equity/
β βββ basic_eps/
β βββ diluted_eps/
β βββ revenue/
βββ models/
β βββ staging/ # Staging area for processed data
βββ airflow/ # Airflow home directory
βββ logs/ # Airflow logs
βββ venv/ # Python virtual environment
βββ requirements.txt # Python dependencies
βββ setup_env.py # Environment setup script
βββ extraction.env # Environment variables configuration
βββ sp500_companies.csv # S&P 500 companies list
βββ README.md # This file
- Python 3.10+
- Apache Airflow 3.0.2+
- Access to SEC EDGAR API
-
Clone the repository:
git clone <repository-url> cd Financial_ETL
-
Create and activate virtual environment:
python3.10 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install Airflow:
pip install "apache-airflow[celery]==3.0.2" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.0.2/constraints-3.9.txt"
-
Set up environment:
# Run the environment setup script python setup_env.pyThis script will:
- Load environment variables from
extraction.env - Create the necessary data directory structure
- Set up ticker-specific directories for all S&P 500 companies
- Load environment variables from
-
Set up Airflow:
airflow db init airflow users create \ --username admin \ --firstname Admin \ --lastname User \ --role Admin \ --email [email protected] \ --password admin
The project uses extraction.env to manage all environment variables. This file contains:
# Airflow Configuration
AIRFLOW_HOME=/Users/kuot/Documents/Financial_ETL/airflow
AIRFLOW__CORE__DAGS_FOLDER=/Users/kuot/Documents/Financial_ETL/dags
AIRFLOW__EMAIL__FROM_EMAIL=David Kuo <[email protected]>
# Data Directory Configuration
OUTPUT_DATA_DIR=./dataImportant:
- Update the
AIRFLOW_HOMEandAIRFLOW__CORE__DAGS_FOLDERpaths inextraction.envto match your system - Update the
AIRFLOW__EMAIL__FROM_EMAILwith your identity for SEC EDGAR API access - Set the
sec_identityAirflow Variable with your identity (e.g., 'John Doe [email protected]')
Set the required Airflow variable for SEC API access:
airflow variables set sec_identity "Your Name [email protected]"The setup_env.py script automatically:
- Loads environment variables from
extraction.env - Creates the data directory structure
- Sets up ticker-specific directories for all S&P 500 companies
- Provides fallback values if environment variables are not set
-
Start Airflow webserver:
airflow webserver --port 8080
-
Start Airflow scheduler:
airflow scheduler
-
Set up the environment (if not already done):
python setup_env.py
-
Start Airflow services (if not already running):
airflow webserver --port 8080 & airflow scheduler &
-
Access Airflow UI: Open http://localhost:8080 in your browser
-
Enable the DAG: In the Airflow UI, find
financial_data_extractionand toggle it on -
Trigger manual run (optional): Click "Trigger DAG" to run immediately
The pipeline consists of several tasks:
get_tickers: Reads S&P 500 company list from CSVget_filing_data: Manages extraction log and determines filing date rangeextract_financial_data: Downloads and processes financial data
The pipeline extracts the following data for each company:
- Balance Sheet: Assets, liabilities, and equity
- Income Statement: Revenue, expenses, and net income
- Statement of Equity: Changes in shareholders' equity
- Basic EPS: Earnings per share (basic)
- Diluted EPS: Earnings per share (diluted)
- Revenue: Total revenue
- Concept-based: Raw financial concepts and values
- Label-based: Human-readable financial labels and values
data/
βββ AAPL/
β βββ 10k/
β β βββ balance_sheet/
β β β βββ 2023-09-30_concept.csv
β β β βββ 2023-09-30_label.csv
β β βββ income_statement/
β β βββ statement_of_equity/
β β βββ basic_eps/
β β β βββ 2023-09-30.csv
β β βββ diluted_eps/
β β βββ revenue/
β βββ 10q/
β βββ [similar structure]
- Concept files: Raw XBRL concepts with date columns
- Label files: Human-readable labels with date columns
- Single CSV per filing date: Contains concept, value, and metadata
edgartools: SEC EDGAR data extraction librarypandas>=1.5.0: Data manipulation and analysisrequests>=2.28.0: HTTP library for API callspython-dotenv>=0.19.0: Environment variable management
apache-airflow[celery]==3.0.2: Workflow orchestration platform
The pipeline maintains data/extraction_log.json to track:
- Last execution date
- Creation and update timestamps
- Execution history
- Task-specific logs available in Airflow UI
- Error handling and retry mechanisms
- Execution status tracking
The pipeline includes comprehensive error handling:
- Missing Environment Variables: Clear error messages for configuration issues
- API Failures: Graceful handling of SEC API timeouts and errors
- Data Processing: Skips problematic filings while continuing with others
- File System: Handles missing directories and file permission issues
- DAG ID:
financial_data_extraction - Schedule:
@daily(runs every day) - Start Date: January 1, 2025
- Catchup: Disabled
- Tags:
["extraction_dag"]
The DAG is currently configured to process only AAPL for testing purposes. To process all S&P 500 companies, update the last line in extraction_dag.py:
# Current (test mode):
extract_financial_data(filing_data_info, ['AAPL']);
# For full S&P 500 processing:
extract_financial_data(filing_data_info, tickers);- Update
data/sp500_companies.csvwith new ticker symbols - The pipeline will automatically process new companies
- Modify
extract_financial_facts()function - Add new concepts to the extraction list
- Update directory structure as needed
The pipeline uses incremental processing based on the last execution date. To modify the date range:
- Edit the
extraction_log.jsonfile - Update the
last_execution_datefield - Or delete the file to start fresh from 2015-01-01