Financial ETL Pipeline

A comprehensive Apache Airflow-based ETL pipeline for extracting financial data from SEC EDGAR filings. This pipeline automatically downloads, processes, and stores financial statements and key metrics for S&P 500 companies.

Features

Automated Data Extraction: Downloads 10-K and 10-Q filings from SEC EDGAR
Financial Statements: Extracts balance sheets, income statements, and statements of equity
Key Metrics: Captures EPS (basic/diluted), revenue, and other financial facts
Incremental Processing: Tracks last execution date to avoid duplicate extractions
S&P 500 Coverage: Processes all S&P 500 companies
Structured Output: Saves data in organized CSV format with proper directory structure

Project Structure

Financial_ETL/
├── dags/
│   └── extraction_dag.py         # Main Airflow DAG for financial data extraction
├── data/
│   ├── sp500_companies.csv       # S&P 500 company list
│   ├── extraction_log.json       # Execution tracking log
│   └── [TICKER]/                 # Company-specific data
│       ├── 10k/                  # 10-K filings
│       │   ├── balance_sheet/
│       │   ├── income_statement/
│       │   ├── statement_of_equity/
│       │   ├── basic_eps/
│       │   ├── diluted_eps/
│       │   └── revenue/
│       └── 10q/                  # 10-Q filings
│           ├── balance_sheet/
│           ├── income_statement/
│           ├── statement_of_equity/
│           ├── basic_eps/
│           ├── diluted_eps/
│           └── revenue/
├── models/
│   └── staging/                  # Staging area for processed data
├── airflow/                      # Airflow home directory
├── logs/                         # Airflow logs
├── venv/                         # Python virtual environment
├── requirements.txt              # Python dependencies
├── setup_env.py                  # Environment setup script
├── extraction.env                # Environment variables configuration
├── sp500_companies.csv           # S&P 500 companies list
└── README.md                     # This file

Prerequisites

Python 3.10+
Apache Airflow 3.0.2+
Access to SEC EDGAR API

Installation

Clone the repository:

git clone <repository-url>
cd Financial_ETL

Create and activate virtual environment:

python3.10 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Install Airflow:

pip install "apache-airflow[celery]==3.0.2" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.0.2/constraints-3.9.txt"

Set up environment:
```
# Run the environment setup script
python setup_env.py
```
This script will:
- Load environment variables from extraction.env
- Create the necessary data directory structure
- Set up ticker-specific directories for all S&P 500 companies

Set up Airflow:

airflow db init
airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email [email protected] \
    --password admin

Configuration

Environment Variables

The project uses extraction.env to manage all environment variables. This file contains:

# Airflow Configuration
AIRFLOW_HOME=/Users/kuot/Documents/Financial_ETL/airflow
AIRFLOW__CORE__DAGS_FOLDER=/Users/kuot/Documents/Financial_ETL/dags
AIRFLOW__EMAIL__FROM_EMAIL=David Kuo <[email protected]>

# Data Directory Configuration
OUTPUT_DATA_DIR=./data

Important:

Update the AIRFLOW_HOME and AIRFLOW__CORE__DAGS_FOLDER paths in extraction.env to match your system
Update the AIRFLOW__EMAIL__FROM_EMAIL with your identity for SEC EDGAR API access
Set the sec_identity Airflow Variable with your identity (e.g., 'John Doe [email protected]')

Airflow Variables

Set the required Airflow variable for SEC API access:

airflow variables set sec_identity "Your Name [email protected]"

Environment Setup

The setup_env.py script automatically:

Loads environment variables from extraction.env
Creates the data directory structure
Sets up ticker-specific directories for all S&P 500 companies
Provides fallback values if environment variables are not set

Airflow Configuration

Start Airflow webserver:
```
airflow webserver --port 8080
```
Start Airflow scheduler:
```
airflow scheduler
```

Usage

Running the Pipeline

Set up the environment (if not already done):
```
python setup_env.py
```

Start Airflow services (if not already running):

airflow webserver --port 8080 &
airflow scheduler &

Access Airflow UI: Open http://localhost:8080 in your browser
Enable the DAG: In the Airflow UI, find financial_data_extraction and toggle it on
Trigger manual run (optional): Click "Trigger DAG" to run immediately

Pipeline Components

The pipeline consists of several tasks:

get_tickers: Reads S&P 500 company list from CSV
get_filing_data: Manages extraction log and determines filing date range
extract_financial_data: Downloads and processes financial data

Data Extraction

The pipeline extracts the following data for each company:

Financial Statements

Balance Sheet: Assets, liabilities, and equity
Income Statement: Revenue, expenses, and net income
Statement of Equity: Changes in shareholders' equity

Key Metrics

Basic EPS: Earnings per share (basic)
Diluted EPS: Earnings per share (diluted)
Revenue: Total revenue

File Formats

Concept-based: Raw financial concepts and values
Label-based: Human-readable financial labels and values

Data Output

Directory Structure

data/
├── AAPL/
│   ├── 10k/
│   │   ├── balance_sheet/
│   │   │   ├── 2023-09-30_concept.csv
│   │   │   └── 2023-09-30_label.csv
│   │   ├── income_statement/
│   │   ├── statement_of_equity/
│   │   ├── basic_eps/
│   │   │   └── 2023-09-30.csv
│   │   ├── diluted_eps/
│   │   └── revenue/
│   └── 10q/
│       └── [similar structure]

File Formats

Statement Files

Concept files: Raw XBRL concepts with date columns
Label files: Human-readable labels with date columns

Fact Files

Single CSV per filing date: Contains concept, value, and metadata

Dependencies

Core Dependencies

edgartools: SEC EDGAR data extraction library
pandas>=1.5.0: Data manipulation and analysis
requests>=2.28.0: HTTP library for API calls
python-dotenv>=0.19.0: Environment variable management

Airflow Dependencies

apache-airflow[celery]==3.0.2: Workflow orchestration platform

Monitoring and Logging

Extraction Log

The pipeline maintains data/extraction_log.json to track:

Last execution date
Creation and update timestamps
Execution history

Airflow Logs

Task-specific logs available in Airflow UI
Error handling and retry mechanisms
Execution status tracking

Error Handling

The pipeline includes comprehensive error handling:

Missing Environment Variables: Clear error messages for configuration issues
API Failures: Graceful handling of SEC API timeouts and errors
Data Processing: Skips problematic filings while continuing with others
File System: Handles missing directories and file permission issues

Current Configuration

DAG Configuration

DAG ID: financial_data_extraction
Schedule: @daily (runs every day)
Start Date: January 1, 2025
Catchup: Disabled
Tags: ["extraction_dag"]

Current Test Configuration

The DAG is currently configured to process only AAPL for testing purposes. To process all S&P 500 companies, update the last line in extraction_dag.py:

# Current (test mode):
extract_financial_data(filing_data_info, ['AAPL']);

# For full S&P 500 processing:
extract_financial_data(filing_data_info, tickers);

Customization

Adding New Companies

Update data/sp500_companies.csv with new ticker symbols
The pipeline will automatically process new companies

Adding New Financial Metrics

Modify extract_financial_facts() function
Add new concepts to the extraction list
Update directory structure as needed

Modifying Extraction Date Range

The pipeline uses incremental processing based on the last execution date. To modify the date range:

Edit the extraction_log.json file
Update the last_execution_date field
Or delete the file to start fresh from 2015-01-01

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dags		dags
logs/dag_processor		logs/dag_processor
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
create_directory.py		create_directory.py
edgar_quickstart.ipynb		edgar_quickstart.ipynb
extraction.env		extraction.env
requirements.txt		requirements.txt
setup_env.py		setup_env.py
sp500_companies.csv		sp500_companies.csv

twdavidkuo/financial_etl

Folders and files

Latest commit

History

Repository files navigation

Financial ETL Pipeline

Features

Project Structure

Prerequisites

Installation

Configuration

Environment Variables

Airflow Variables

Environment Setup

Airflow Configuration

Usage

Running the Pipeline

Pipeline Components

Data Extraction

Financial Statements

Key Metrics

File Formats

Data Output

Directory Structure

File Formats

Statement Files

Fact Files

Dependencies

Core Dependencies

Airflow Dependencies

Monitoring and Logging

Extraction Log

Airflow Logs

Error Handling

Current Configuration

DAG Configuration

Current Test Configuration

Customization

Adding New Companies

Adding New Financial Metrics

Modifying Extraction Date Range

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages