Skip to content

twdavidkuo/financial_etl

Repository files navigation

Financial ETL Pipeline

A comprehensive Apache Airflow-based ETL pipeline for extracting financial data from SEC EDGAR filings. This pipeline automatically downloads, processes, and stores financial statements and key metrics for S&P 500 companies.

Features

  • Automated Data Extraction: Downloads 10-K and 10-Q filings from SEC EDGAR
  • Financial Statements: Extracts balance sheets, income statements, and statements of equity
  • Key Metrics: Captures EPS (basic/diluted), revenue, and other financial facts
  • Incremental Processing: Tracks last execution date to avoid duplicate extractions
  • S&P 500 Coverage: Processes all S&P 500 companies
  • Structured Output: Saves data in organized CSV format with proper directory structure

Project Structure

Financial_ETL/
β”œβ”€β”€ dags/
β”‚   └── extraction_dag.py         # Main Airflow DAG for financial data extraction
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sp500_companies.csv       # S&P 500 company list
β”‚   β”œβ”€β”€ extraction_log.json       # Execution tracking log
β”‚   └── [TICKER]/                 # Company-specific data
β”‚       β”œβ”€β”€ 10k/                  # 10-K filings
β”‚       β”‚   β”œβ”€β”€ balance_sheet/
β”‚       β”‚   β”œβ”€β”€ income_statement/
β”‚       β”‚   β”œβ”€β”€ statement_of_equity/
β”‚       β”‚   β”œβ”€β”€ basic_eps/
β”‚       β”‚   β”œβ”€β”€ diluted_eps/
β”‚       β”‚   └── revenue/
β”‚       └── 10q/                  # 10-Q filings
β”‚           β”œβ”€β”€ balance_sheet/
β”‚           β”œβ”€β”€ income_statement/
β”‚           β”œβ”€β”€ statement_of_equity/
β”‚           β”œβ”€β”€ basic_eps/
β”‚           β”œβ”€β”€ diluted_eps/
β”‚           └── revenue/
β”œβ”€β”€ models/
β”‚   └── staging/                  # Staging area for processed data
β”œβ”€β”€ airflow/                      # Airflow home directory
β”œβ”€β”€ logs/                         # Airflow logs
β”œβ”€β”€ venv/                         # Python virtual environment
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ setup_env.py                  # Environment setup script
β”œβ”€β”€ extraction.env                # Environment variables configuration
β”œβ”€β”€ sp500_companies.csv           # S&P 500 companies list
└── README.md                     # This file

Prerequisites

  • Python 3.10+
  • Apache Airflow 3.0.2+
  • Access to SEC EDGAR API

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd Financial_ETL
  2. Create and activate virtual environment:

    python3.10 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Install Airflow:

    pip install "apache-airflow[celery]==3.0.2" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.0.2/constraints-3.9.txt"
  5. Set up environment:

    # Run the environment setup script
    python setup_env.py

    This script will:

    • Load environment variables from extraction.env
    • Create the necessary data directory structure
    • Set up ticker-specific directories for all S&P 500 companies
  6. Set up Airflow:

    airflow db init
    airflow users create \
        --username admin \
        --firstname Admin \
        --lastname User \
        --role Admin \
        --email [email protected] \
        --password admin

Configuration

Environment Variables

The project uses extraction.env to manage all environment variables. This file contains:

# Airflow Configuration
AIRFLOW_HOME=/Users/kuot/Documents/Financial_ETL/airflow
AIRFLOW__CORE__DAGS_FOLDER=/Users/kuot/Documents/Financial_ETL/dags
AIRFLOW__EMAIL__FROM_EMAIL=David Kuo <[email protected]>

# Data Directory Configuration
OUTPUT_DATA_DIR=./data

Important:

  1. Update the AIRFLOW_HOME and AIRFLOW__CORE__DAGS_FOLDER paths in extraction.env to match your system
  2. Update the AIRFLOW__EMAIL__FROM_EMAIL with your identity for SEC EDGAR API access
  3. Set the sec_identity Airflow Variable with your identity (e.g., 'John Doe [email protected]')

Airflow Variables

Set the required Airflow variable for SEC API access:

airflow variables set sec_identity "Your Name [email protected]"

Environment Setup

The setup_env.py script automatically:

  • Loads environment variables from extraction.env
  • Creates the data directory structure
  • Sets up ticker-specific directories for all S&P 500 companies
  • Provides fallback values if environment variables are not set

Airflow Configuration

  1. Start Airflow webserver:

    airflow webserver --port 8080
  2. Start Airflow scheduler:

    airflow scheduler

Usage

Running the Pipeline

  1. Set up the environment (if not already done):

    python setup_env.py
  2. Start Airflow services (if not already running):

    airflow webserver --port 8080 &
    airflow scheduler &
  3. Access Airflow UI: Open http://localhost:8080 in your browser

  4. Enable the DAG: In the Airflow UI, find financial_data_extraction and toggle it on

  5. Trigger manual run (optional): Click "Trigger DAG" to run immediately

Pipeline Components

The pipeline consists of several tasks:

  1. get_tickers: Reads S&P 500 company list from CSV
  2. get_filing_data: Manages extraction log and determines filing date range
  3. extract_financial_data: Downloads and processes financial data

Data Extraction

The pipeline extracts the following data for each company:

Financial Statements

  • Balance Sheet: Assets, liabilities, and equity
  • Income Statement: Revenue, expenses, and net income
  • Statement of Equity: Changes in shareholders' equity

Key Metrics

  • Basic EPS: Earnings per share (basic)
  • Diluted EPS: Earnings per share (diluted)
  • Revenue: Total revenue

File Formats

  • Concept-based: Raw financial concepts and values
  • Label-based: Human-readable financial labels and values

Data Output

Directory Structure

data/
β”œβ”€β”€ AAPL/
β”‚   β”œβ”€β”€ 10k/
β”‚   β”‚   β”œβ”€β”€ balance_sheet/
β”‚   β”‚   β”‚   β”œβ”€β”€ 2023-09-30_concept.csv
β”‚   β”‚   β”‚   └── 2023-09-30_label.csv
β”‚   β”‚   β”œβ”€β”€ income_statement/
β”‚   β”‚   β”œβ”€β”€ statement_of_equity/
β”‚   β”‚   β”œβ”€β”€ basic_eps/
β”‚   β”‚   β”‚   └── 2023-09-30.csv
β”‚   β”‚   β”œβ”€β”€ diluted_eps/
β”‚   β”‚   └── revenue/
β”‚   └── 10q/
β”‚       └── [similar structure]

File Formats

Statement Files

  • Concept files: Raw XBRL concepts with date columns
  • Label files: Human-readable labels with date columns

Fact Files

  • Single CSV per filing date: Contains concept, value, and metadata

Dependencies

Core Dependencies

  • edgartools: SEC EDGAR data extraction library
  • pandas>=1.5.0: Data manipulation and analysis
  • requests>=2.28.0: HTTP library for API calls
  • python-dotenv>=0.19.0: Environment variable management

Airflow Dependencies

  • apache-airflow[celery]==3.0.2: Workflow orchestration platform

Monitoring and Logging

Extraction Log

The pipeline maintains data/extraction_log.json to track:

  • Last execution date
  • Creation and update timestamps
  • Execution history

Airflow Logs

  • Task-specific logs available in Airflow UI
  • Error handling and retry mechanisms
  • Execution status tracking

Error Handling

The pipeline includes comprehensive error handling:

  • Missing Environment Variables: Clear error messages for configuration issues
  • API Failures: Graceful handling of SEC API timeouts and errors
  • Data Processing: Skips problematic filings while continuing with others
  • File System: Handles missing directories and file permission issues

Current Configuration

DAG Configuration

  • DAG ID: financial_data_extraction
  • Schedule: @daily (runs every day)
  • Start Date: January 1, 2025
  • Catchup: Disabled
  • Tags: ["extraction_dag"]

Current Test Configuration

The DAG is currently configured to process only AAPL for testing purposes. To process all S&P 500 companies, update the last line in extraction_dag.py:

# Current (test mode):
extract_financial_data(filing_data_info, ['AAPL']);

# For full S&P 500 processing:
extract_financial_data(filing_data_info, tickers);

Customization

Adding New Companies

  1. Update data/sp500_companies.csv with new ticker symbols
  2. The pipeline will automatically process new companies

Adding New Financial Metrics

  1. Modify extract_financial_facts() function
  2. Add new concepts to the extraction list
  3. Update directory structure as needed

Modifying Extraction Date Range

The pipeline uses incremental processing based on the last execution date. To modify the date range:

  1. Edit the extraction_log.json file
  2. Update the last_execution_date field
  3. Or delete the file to start fresh from 2015-01-01

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published