Skip to content

ShinDongWoon/LGBMhurdle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LGBM Hurdle Model for Time Series Forecasting

Build Status License: MIT Python Version wSMAPE Score

πŸ“œ Project Overview

This project is engineered to tackle the complex challenge of forecasting time series data characterized by intermittent demand. Intermittent demand, where sales data is dominated by frequent zero values, poses a significant hurdle for traditional forecasting models. Standard algorithms often struggle to accurately predict sporadic, non-zero events.

To address this, we employ a sophisticated Hurdle Model architecture. This approach strategically decomposes the forecasting problem into two distinct, more manageable sub-problems:

  1. A Binary Classification task to predict whether a sale will occur (i.e., hurdle the zero-demand barrier).
  2. A Regression task to predict how much will be sold, conditioned on a sale actually happening.

Both models are implemented using LightGBM, a high-performance gradient boosting framework renowned for its speed and accuracy, ensuring a robust and efficient solution.


πŸš€ Key Features

  • Hurdle Model Architecture: By decoupling the prediction of sale occurrence from sale quantity, the model architecture is specifically tailored to handle zero-inflated datasets, maximizing predictive accuracy for intermittent demand patterns.
  • High-Performance LightGBM Engine: Utilizes LightGBM as the backbone for both classification and regression tasks, leveraging its exceptional speed and state-of-the-art performance in gradient boosting.
  • Advanced Feature Engineering: Implements a comprehensive suite of engineered features to capture complex temporal dynamics:
    • Calendar-based Features: Extracts signals from timestamps (e.g., day of week, week of year, month, quarter).
    • Lag and Rolling Window Statistics: Incorporates historical trends and patterns using lagged values and rolling statistical aggregations (mean, std, min, max).
    • Fourier Transform Features: Precisely models complex seasonality (e.g., weekly, yearly cycles) by transforming temporal features into the frequency domain.
  • Recursive Forecasting Pipeline: Employs a recursive, multi-step forecasting strategy. For long prediction horizons, the model uses its own previous predictions as inputs to generate features for subsequent steps, ensuring stable and accurate long-term forecasts.
  • Efficient Caching System: A built-in caching mechanism serializes the results of computationally expensive feature engineering steps. This dramatically reduces runtime during iterative experiments and model tuning, accelerating the development cycle.
  • Configuration-Driven Workflow: The entire pipeline is managed via YAML configuration files. This approach decouples logic from parameters, ensuring experimental reproducibility and simplifying hyperparameter tuning and feature selection.

πŸ“Š Data Schema Definition

For the model to operate correctly, the input and output data must adhere to the following specifications.

1. Input Data Schema

All source data for training (train.csv) and inference (TEST_*.csv) must be structured in a long format. Each row should represent a single observation for a specific item on a specific date.

Required Features:

Column Name Data Type Description
Date Date or String The date of the sales record (e.g., YYYY-MM-DD).
Series_Identifier String A unique identifier for each distinct time series (e.g., a combination of store and item ID).
Sales_Quantity Numeric (Integer) The actual quantity sold on that date. This serves as the target variable for the model to predict.

2. Output Data Schema

The final prediction output (sample_submission.csv) must be pivoted into a wide-format structure. In this format, each time series (Series_Identifier) becomes a separate column.

Output Table Structure:

  • Index: The rows are indexed by Date.
  • Column Headers: The first column is Date, followed by columns for every unique Series_Identifier.
  • Values: Each cell contains the predicted Sales_Quantity for the corresponding Date (row) and Series_Identifier (column).

Example Structure:

Date StoreA_ItemX StoreA_ItemY StoreB_ItemZ ...
2025-09-16 15 0 21 ...
2025-09-17 12 3 18 ...
... ... ... ... ...

This standard format facilitates easy analysis and submission of time series forecasting results.


βš™οΈ Code Pipeline Analysis

The codebase is modularly structured into three core stages: Data Preprocessing & Feature Engineering, Model Training, and Recursive Inference.

1. End-to-End Pipeline Flow

  1. Configuration Loading (configs/): The pipeline initializes by loading hyperparameters, file paths, and feature definitions from YAML configuration files (e.g., base.yaml, korean.yaml).
  2. Data Ingestion & Preprocessing (fe/preprocess.py): Raw data is loaded, and initial cleaning is performed, including data type casting and handling of missing values.
  3. Feature Engineering (fe/): A rich set of features is constructed from the preprocessed data.
    • calendar.py: Generates calendar-based features.
    • lags_rolling.py: Creates lag and rolling window statistical features.
    • fourier.py: Produces Fourier term features for seasonality modeling.
  4. Model Training (pipeline/train.py):
    • Data Splitting: Utilizes Time Series Cross-Validation (cv/tscv.py) to create robust training and validation splits that respect the temporal order of the data.
    • Classifier Training (model/classifier.py): A LightGBM classifier is trained on a binary target (1 if sales > 0, else 0).
    • Regressor Training (model/regressor.py): A LightGBM regressor is trained exclusively on data where sales > 0 to predict the actual sales quantity.
  5. Model Serialization: The trained classifier and regressor model artifacts are saved to disk.
  6. Inference (pipeline/predict.py & recursion.py):
    • The pipeline iterates one step at a time over the prediction horizon.
    • Classification Prediction: The classifier predicts the probability of a sale occurring on the next day.
    • Hurdle Application: If the predicted probability exceeds a predefined threshold, the regressor predicts the sales quantity. Otherwise, the prediction is set to 0.
    • Recursive Update: The new prediction is used to update the lag and rolling features for the subsequent day's forecast. This loop continues until the entire prediction horizon is covered.
  7. Submission Generation: The final predictions are formatted to match the required sample_submission.csv schema.

2. Module Contributions to Performance

Module Path Core Function Contribution to Performance
g2_hurdle/fe/ Feature Engineering Fundamentally improves model accuracy by translating complex temporal patterns (trends, seasonality, autocorrelation) into a format the model can learn. Lag and rolling features are critical.
g2_hurdle/utils/cache.py Feature Caching Drastically reduces experiment turnaround time by caching the results of expensive feature computations. This accelerates development and hyperparameter tuning by avoiding redundant work.
g2_hurdle/model/ Hurdle Model Implementation Directly addresses the intermittent demand problem by splitting it into classification and regression. This prevents the regression model from being biased by zero-inflation, significantly improving the final wSMAPE score.
g2_hurdle/cv/tscv.py Time Series Cross-Validation Ensures a robust and reliable evaluation of the model's generalization performance by preventing data leakage (i.e., training on future data).
g2_hurdle/pipeline/recursion.py Recursive Forecasting Logic Enables accurate multi-step forecasting by dynamically updating features at each step with the latest available information (i.e., previous predictions). This yields more stable and precise long-range predictions.
g2_hurdle/configs/ Configuration Management Guarantees experimental reproducibility and flexibility by externalizing all hyperparameters and settings. This allows for rapid, code-free iteration on different model configurations.

πŸ“Š Results

Rigorous evaluation of this codebase demonstrated a highly competitive performance, achieving a wSMAPE score of 0.5550080421. This result validates the effectiveness of the Hurdle Model architecture combined with sophisticated feature engineering for solving intermittent demand forecasting problems.

  • wSMAPE (Weighted Symmetric Mean Absolute Percentage Error): An industry-standard metric that weights errors based on the actual demand volume, preventing high-percentage errors on low-volume items from disproportionately penalizing the model's score.

πŸš€ Getting Started on Colab

1. Prerequisites

  • Python 3.8+
  • Nvidia GPU (Recommended for training)

2. Quickstart (Colab)

# 1. Clone the repository
!git clone [https://github.com/shindongwoon/lgbmhurdle.git](https://github.com/shindongwoon/lgbmhurdle.git)
%cd lgbmhurdle

# 2. Install dependencies
!python dependency.py

# 3. Run model training
!python train.py

# 4. Run prediction
!python predict.py

πŸ“œ ν”„λ‘œμ νŠΈ κ°œμš” (Project Overview)

λ³Έ ν”„λ‘œμ νŠΈλŠ” 간헐적 μˆ˜μš”(Intermittent Demand) νŠΉμ„±μ„ κ°€μ§„ μ‹œκ³„μ—΄ λ°μ΄ν„°μ˜ 미래 νŒλ§€λŸ‰μ„ μ˜ˆμΈ‘ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€. 간헐적 μˆ˜μš”λž€ '0' 값이 λΉˆλ²ˆν•˜κ²Œ λ‚˜νƒ€λ‚˜λŠ” 데이터λ₯Ό μ˜λ―Έν•˜λ©°, 일반적인 μ‹œκ³„μ—΄ 예츑 λͺ¨λΈλ‘œλŠ” μ •ν™•ν•œ 예츑이 μ–΄λ ΅μŠ΅λ‹ˆλ‹€.

μ΄λŸ¬ν•œ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ³Έ ν”„λ‘œμ νŠΈμ—μ„œλŠ” ν—ˆλ“€ λͺ¨λΈ(Hurdle Model) 접근법을 μ±„νƒν–ˆμŠ΅λ‹ˆλ‹€. ν—ˆλ“€ λͺ¨λΈμ€ 'νŒλ§€κ°€ λ°œμƒν• μ§€ μ—¬λΆ€'λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 이진 λΆ„λ₯˜(Binary Classification) λ¬Έμ œμ™€ 'νŒλ§€κ°€ λ°œμƒν–ˆμ„ λ•Œ μ–Όλ§ˆλ‚˜ νŒ”λ¦΄μ§€'λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” νšŒκ·€(Regression) 문제둜 λ‚˜λˆ„μ–΄ μ ‘κ·Όν•©λ‹ˆλ‹€. 두 λͺ¨λΈ λͺ¨λ‘ κ°•λ ₯ν•œ μ„±λŠ₯을 μžλž‘ν•˜λŠ” LightGBM을 기반으둜 κ΅¬ν˜„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.


πŸš€ μ£Όμš” νŠΉμ§• (Key Features)

  • ν—ˆλ“€ λͺ¨λΈ μ•„ν‚€ν…μ²˜ (Hurdle Model Architecture): 판맀 λ°œμƒ 여뢀와 νŒλ§€λŸ‰μ„ λΆ„λ¦¬ν•˜μ—¬ μ˜ˆμΈ‘ν•¨μœΌλ‘œμ¨ '0'이 λ§Žμ€ 데이터에 λŒ€ν•œ 예츑 μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•©λ‹ˆλ‹€.
  • κ³ μ„±λŠ₯ LightGBM ν™œμš© (High-Performance LightGBM): λΉ λ₯Έ ν•™μŠ΅ 속도와 높은 예츑 정확도λ₯Ό μžλž‘ν•˜λŠ” LightGBM을 λΆ„λ₯˜ 및 νšŒκ·€ λͺ¨λΈμ˜ 백본으둜 μ‚¬μš©ν•©λ‹ˆλ‹€.
  • κ³ κΈ‰ ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§ (Advanced Feature Engineering):
    • μ‹œκ°„ 기반 ν”Όμ²˜: λ‚ μ§œ, μš”μΌ, μ›”, μ£Όμ°¨ λ“± μ‹œκ°„ 정보λ₯Ό ν™œμš©ν•œ ν”Όμ²˜
    • μ§€μ—° 및 둀링 톡계 ν”Όμ²˜: κ³Όκ±° λ°μ΄ν„°μ˜ 좔세와 νŒ¨ν„΄μ„ ν•™μŠ΅ν•˜κΈ° μœ„ν•œ Lag 및 Rolling window ν”Όμ²˜
    • 푸리에 λ³€ν™˜ ν”Όμ²˜: κ³„μ ˆμ„±(Seasonality)을 μ •κ΅ν•˜κ²Œ λͺ¨λΈλ§ν•˜κΈ° μœ„ν•œ Fourier-transform ν”Όμ²˜
  • μž¬κ·€μ  예츑 νŒŒμ΄ν”„λΌμΈ (Recursive Prediction Pipeline): 예츑 λŒ€μƒ 기간이 κΈΈμ–΄μ§ˆ λ•Œ, 이전 μ˜ˆμΈ‘κ°’μ„ λ‹€μ‹œ ν”Όμ²˜λ‘œ μ‚¬μš©ν•˜μ—¬ λ‹€μŒ μ‹œμ μ˜ 값을 μ˜ˆμΈ‘ν•˜λŠ” μž¬κ·€μ  ꡬ쑰λ₯Ό μ±„νƒν•˜μ—¬ μ•ˆμ •μ μΈ μž₯κΈ° μ˜ˆμΈ‘μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  • 효율적인 캐싱 μ‹œμŠ€ν…œ (Efficient Caching System): ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§ κ³Όμ •μ—μ„œ μƒμ„±λœ 데이터λ₯Ό μΊμ‹±ν•˜μ—¬ 반볡적인 ν•™μŠ΅ 및 μ‹€ν—˜ μ‹œμ˜ μ—°μ‚° λΉ„μš©μ„ 획기적으둜 μ€„μž…λ‹ˆλ‹€.
  • μ„€μ • 파일 기반 관리 (Configuration File Management): YAML ν˜•μ‹μ˜ μ„€μ • νŒŒμΌμ„ 톡해 λͺ¨λΈ ν•˜μ΄νΌνŒŒλΌλ―Έν„°, ν”Όμ²˜ λͺ©λ‘ 등을 μ²΄κ³„μ μœΌλ‘œ κ΄€λ¦¬ν•˜μ—¬ μ‹€ν—˜μ˜ μž¬ν˜„μ„±μ„ 보μž₯ν•©λ‹ˆλ‹€.

Gemini_Generated_Image_2m0zym2m0zym2m0z Gemini_Generated_Image_2m0zyn2m0zyn2m0z Gemini_Generated_Image_2m0zyo2m0zyo2m0z

πŸ“Š 데이터 μŠ€ν‚€λ§ˆ μ •μ˜ (Data Schema Definition)

λ³Έ 예츑 λͺ¨λΈμ΄ μ •μƒμ μœΌλ‘œ λ™μž‘ν•˜κΈ° μœ„ν•΄μ„œλŠ” μž…λ ₯ 데이터와 좜λ ₯ 데이터가 λ‹€μŒ λͺ…세에 따라 μ •ν™•ν•˜κ²Œ κ΅¬μ„±λ˜μ–΄μ•Ό ν•©λ‹ˆλ‹€.

1. μž…λ ₯ 데이터 μŠ€ν‚€λ§ˆ (Input Data Schema)

ν•™μŠ΅(train.csv) 및 평가(TEST_*.csv)에 μ‚¬μš©λ˜λŠ” λͺ¨λ“  원본 λ°μ΄ν„°λŠ” Long-Format 데이터 ꡬ쑰λ₯Ό 따라야 ν•©λ‹ˆλ‹€. 각 행은 νŠΉμ • μ˜μ—…μΌμ˜ νŠΉμ • μƒν’ˆμ— λŒ€ν•œ 판맀 기둝을 λ‚˜νƒ€λ‚΄λŠ” κ°œλ³„ κ΄€μΈ‘μΉ˜(Observation)μž…λ‹ˆλ‹€.

ν•„μˆ˜ ν”Όμ²˜(Features):

컬럼λͺ… (Column Name) 데이터 νƒ€μž… (Data Type) μ„€λͺ… (Description)
μ˜μ—…μΌμž Date or String 맀좜이 λ°œμƒν•œ λ‚ μ§œμž…λ‹ˆλ‹€. (예: YYYY-MM-DD)
μ˜μ—…μž₯λͺ…_메뉴λͺ… String 각 μƒν’ˆ(Item)을 κ³ μœ ν•˜κ²Œ μ‹λ³„ν•˜λŠ” **μ‹λ³„μž(Identifier)**μž…λ‹ˆλ‹€.
λ§€μΆœμˆ˜λŸ‰ Numeric (Integer) ν•΄λ‹Ή μ˜μ—…μΌμ— λ°œμƒν•œ μƒν’ˆμ˜ μ‹€μ œ νŒλ§€λŸ‰μœΌλ‘œ, λͺ¨λΈμ΄ μ˜ˆμΈ‘ν•˜κ³ μž ν•˜λŠ” **νƒ€κ²Ÿ λ³€μˆ˜(Target Variable)**μž…λ‹ˆλ‹€.

2. 좜λ ₯ 데이터 μŠ€ν‚€λ§ˆ (Output Data Schema)

μ΅œμ’… 예츑 결과물인 sample_submission.csv νŒŒμΌμ€ μž…λ ₯ 데이터와 달리 Wide-Format 데이터 ꡬ쑰둜 피봇(Pivot)된 ν˜•νƒœμ—¬μ•Ό ν•©λ‹ˆλ‹€. μ΄λŠ” 각 μƒν’ˆ(μ˜μ—…μž₯λͺ…_메뉴λͺ…)이 κ°œλ³„μ μΈ 컬럼이 λ˜λŠ” 맀트릭슀(Matrix) ꡬ쑰λ₯Ό μ˜λ―Έν•©λ‹ˆλ‹€.

좜λ ₯ ν…Œμ΄λΈ” ꡬ쑰:

  • 인덱슀 (Index): ν…Œμ΄λΈ”μ˜ ν–‰(Row)은 μ˜μ—…μΌμžκ°€ 기쀀이 λ©λ‹ˆλ‹€.
  • 컬럼 헀더 (Column Headers): ν…Œμ΄λΈ”μ˜ 첫 번째 열은 μ˜μ—…μΌμžμ΄λ©°, 두 번째 μ—΄λΆ€ν„°λŠ” μ˜μ—…μž₯λͺ…_메뉴λͺ…μ˜ λͺ¨λ“  κ³ μœ κ°’(Unique Values)이 컬럼λͺ…μœΌλ‘œ μœ„μΉ˜ν•©λ‹ˆλ‹€.
  • κ°’ (Values): ν…Œμ΄λΈ”μ˜ 각 μ…€(Cell)μ—λŠ” ν•΄λ‹Ή μ˜μ—…μΌμž(ν–‰)와 μ˜μ—…μž₯λͺ…_메뉴λͺ…(μ—΄)에 ν•΄λ‹Ήν•˜λŠ” **예츑된 λ§€μΆœμˆ˜λŸ‰**이 κΈ°μž…λ©λ‹ˆλ‹€.

ꡬ쑰 μ˜ˆμ‹œ:

μ˜μ—…μΌμž StoreA_ItemX StoreA_ItemY StoreB_ItemZ ...
2025-09-16 15 0 21 ...
2025-09-17 12 3 18 ...
... ... ... ... ...

μ΄λŸ¬ν•œ κ΅¬μ‘°λŠ” 각 μƒν’ˆμ˜ 일별 예츑 νŒλ§€λŸ‰μ„ ν•œλˆˆμ— νŒŒμ•…ν•˜κΈ° μš©μ΄ν•œ ν‘œμ€€μ μΈ μ‹œκ³„μ—΄ 예츑 κ²°κ³Ό 제좜 ν˜•μ‹μž…λ‹ˆλ‹€.


βš™οΈ μ½”λ“œ νŒŒμ΄ν”„λΌμΈ 뢄석 (Code Pipeline Analysis)

λ³Έ μ½”λ“œλ² μ΄μŠ€λŠ” 크게 데이터 μ „μ²˜λ¦¬ 및 ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§, λͺ¨λΈ ν•™μŠ΅, μž¬κ·€μ  예츑의 μ„Έ λ‹¨κ³„λ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€.

1. 전체 μ½”λ“œ νŒŒμ΄ν”„λΌμΈ 흐름

  1. 초기 μ„€μ • (configs/): base.yaml, korean.yaml λ“±μ˜ μ„€μ • νŒŒμΌμ„ λ‘œλ“œν•˜μ—¬ ν”„λ‘œμ νŠΈ μ „λ°˜μ˜ ν•˜μ΄νΌνŒŒλΌλ―Έν„°, 데이터 경둜, μ‚¬μš©ν•  ν”Όμ²˜ λͺ©λ‘ 등을 μ •μ˜ν•©λ‹ˆλ‹€.
  2. 데이터 λ‘œλ”© 및 μ „μ²˜λ¦¬ (fe/preprocess.py): 원본 데이터λ₯Ό λ‘œλ“œν•˜κ³ , 기본적인 데이터 νƒ€μž… λ³€ν™˜ 및 결츑치 처리 λ“±μ˜ μ „μ²˜λ¦¬ 과정을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  3. ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§ (fe/): μ „μ²˜λ¦¬λœ 데이터λ₯Ό 기반으둜 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ  λ‹€μ–‘ν•œ ν”Όμ²˜λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.
    • calendar.py: μ‹œκ°„ κ΄€λ ¨ ν”Όμ²˜ 생성
    • lags_rolling.py: Lag 및 Rolling 톡계 ν”Όμ²˜ 생성
    • fourier.py: κ³„μ ˆμ„± ν”Όμ²˜ 생성
  4. λͺ¨λΈ ν•™μŠ΅ (pipeline/train.py):
    • 데이터 λΆ„ν• : μ‹œκ³„μ—΄ λ°μ΄ν„°μ˜ νŠΉμ„±μ„ κ³ λ €ν•˜μ—¬ Time Series Cross-Validation λ°©μ‹μœΌλ‘œ ν•™μŠ΅/검증 데이터λ₯Ό λΆ„ν• ν•©λ‹ˆλ‹€ (cv/tscv.py).
    • λΆ„λ₯˜ λͺ¨λΈ ν•™μŠ΅ (model/classifier.py): νƒ€κ²Ÿ 값이 0인지 μ•„λ‹Œμ§€λ₯Ό λ ˆμ΄λΈ”λ‘œ ν•˜μ—¬ LightGBM λΆ„λ₯˜κΈ°λ₯Ό ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€.
    • νšŒκ·€ λͺ¨λΈ ν•™μŠ΅ (model/regressor.py): νƒ€κ²Ÿ 값이 0이 μ•„λ‹Œ λ°μ΄ν„°λ§Œμ„ λŒ€μƒμœΌλ‘œ, μ‹€μ œ νŒλ§€λŸ‰μ„ μ˜ˆμΈ‘ν•˜λŠ” LightGBM νšŒκ·€ λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€.
  5. λͺ¨λΈ μ €μž₯: ν•™μŠ΅λœ λΆ„λ₯˜κΈ°μ™€ νšŒκ·€κΈ° λͺ¨λΈ 객체λ₯Ό μ§€μ •λœ κ²½λ‘œμ— μ €μž₯ν•©λ‹ˆλ‹€.
  6. 예츑 (pipeline/predict.py & recursion.py):
    • ν…ŒμŠ€νŠΈ 데이터에 λŒ€ν•΄ ν•˜λ£¨(one-step)μ”© μ˜ˆμΈ‘μ„ μ§„ν–‰ν•©λ‹ˆλ‹€.
    • λΆ„λ₯˜κΈ° 예츑: 내일 νŒλ§€κ°€ λ°œμƒν•  ν™•λ₯ μ„ μ˜ˆμΈ‘ν•©λ‹ˆλ‹€.
    • ν—ˆλ“€ 적용: 예츑된 ν™•λ₯ μ΄ 사전에 μ •μ˜λœ μž„κ³„κ°’(Threshold)을 λ„˜μœΌλ©΄ νšŒκ·€ λͺ¨λΈμ„ 톡해 νŒλ§€λŸ‰μ„ μ˜ˆμΈ‘ν•˜κ³ , λ„˜μ§€ μ•ŠμœΌλ©΄ 0으둜 μ˜ˆμΈ‘ν•©λ‹ˆλ‹€.
    • μž¬κ·€ μ—…λ°μ΄νŠΈ: 예츑된 값을 기반으둜 λ‹€μŒ λ‚  μ˜ˆμΈ‘μ— ν•„μš”ν•œ Lag, Rolling ν”Όμ²˜λ₯Ό μ—…λ°μ΄νŠΈν•˜κ³  이 과정을 예츑 기간이 끝날 λ•ŒκΉŒμ§€ λ°˜λ³΅ν•©λ‹ˆλ‹€.
  7. κ²°κ³Ό 제좜: μ΅œμ’… 예츑 κ²°κ³Όλ₯Ό sample_submission.csv ν˜•μ‹μ— λ§žμΆ”μ–΄ μƒμ„±ν•©λ‹ˆλ‹€.

2. 각 λΆ€λΆ„μ˜ κΈ°λŠ₯ 및 μ„±λŠ₯ ν–₯상 κΈ°μ—¬

λͺ¨λ“ˆ 경둜 (Module Path) 핡심 κΈ°λŠ₯ (Core Function) μ„±λŠ₯ ν–₯상 κΈ°μ—¬ 방식 (Contribution to Performance)
g2_hurdle/fe/ ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§ μ‹œκ³„μ—΄ λ°μ΄ν„°μ˜ λ³΅μž‘ν•œ νŒ¨ν„΄(μΆ”μ„Έ, κ³„μ ˆμ„±, μžκΈ°μƒκ΄€μ„±)을 λͺ¨λΈμ΄ ν•™μŠ΅ν•  수 μžˆλŠ” ν˜•νƒœλ‘œ λ³€ν™˜ν•˜μ—¬ 예츑 정확도λ₯Ό 근본적으둜 ν–₯μƒμ‹œν‚΅λ‹ˆλ‹€. 특히 Lag, Rolling ν”Όμ²˜λŠ” μ‹œκ³„μ—΄ 예츑의 ν•΅μ‹¬μž…λ‹ˆλ‹€.
g2_hurdle/utils/cache.py ν”Όμ²˜ 캐싱 λŒ€μš©λŸ‰ 데이터에 λŒ€ν•œ ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§μ€ λ§Žμ€ μ‹œκ°„μ΄ μ†Œμš”λ©λ‹ˆλ‹€. μƒμ„±λœ ν”Όμ²˜λ₯Ό 파일둜 μ €μž₯ν•˜κ³  μž¬μ‚¬μš©ν•¨μœΌλ‘œμ¨, 반볡 μ‹€ν—˜ μ‹œ 전체 μ‹€ν–‰ μ‹œκ°„μ„ 극적으둜 λ‹¨μΆ•μ‹œμΌœ 개발 νš¨μœ¨μ„±μ„ λ†’μž…λ‹ˆλ‹€.
g2_hurdle/model/ ν—ˆλ“€ λͺ¨λΈ κ΅¬ν˜„ Classifier와 Regressor둜 역할을 λΆ„λ‹΄ν•˜μ—¬ 간헐적 μˆ˜μš” λ¬Έμ œμ— νŠΉν™”λœ 접근을 ν•©λ‹ˆλ‹€. μ΄λŠ” 단일 νšŒκ·€ λͺ¨λΈμ΄ '0' 값에 μ˜ν•΄ ν•™μŠ΅μ΄ μ™œκ³‘λ˜λŠ” 것을 λ°©μ§€ν•˜κ³ , 두 κ°€μ§€ 문제λ₯Ό 각각 μ΅œμ ν™”ν•˜μ—¬ wSMAPE 점수λ₯Ό 크게 κ°œμ„ ν•©λ‹ˆλ‹€.
g2_hurdle/cv/tscv.py μ‹œκ³„μ—΄ ꡐ차 검증 미래의 데이터가 κ³Όκ±° λ°μ΄ν„°μ˜ ν•™μŠ΅μ— μ‚¬μš©λ˜λŠ” 것을 λ°©μ§€(Data Leakage λ°©μ§€)ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 λͺ¨λΈμ˜ μΌλ°˜ν™” μ„±λŠ₯을 보닀 μ •ν™•ν•˜κ³  μ‹ λ’°μ„± 있게 평가할 수 μžˆμŠ΅λ‹ˆλ‹€.
g2_hurdle/pipeline/recursion.py μž¬κ·€μ  예츑 둜직 닀쀑 μ‹œμ  예츑(Multi-step Forecasting) μ‹œ, λ§€ μ‹œμ λ§ˆλ‹€ μ΅œμ‹  정보λ₯Ό λ°˜μ˜ν•œ ν”Όμ²˜λ₯Ό μƒμ„±ν•˜μ—¬ μ˜ˆμΈ‘μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€. μ΄λŠ” λ‹¨μˆœνžˆ λͺ¨λΈ ν•˜λ‚˜λ‘œ 전체 기간을 μ˜ˆμΈ‘ν•˜λŠ” 것보닀 훨씬 μ •κ΅ν•˜κ³  μ•ˆμ •μ μΈ μž₯κΈ° μ˜ˆμΈ‘μ„ κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€.
g2_hurdle/configs/ μ„€μ • 관리 λͺ¨λ“  ν•˜μ΄νΌνŒŒλΌλ―Έν„°μ™€ 섀정을 μ½”λ“œκ°€ μ•„λ‹Œ μ™ΈλΆ€ 파일둜 λΆ„λ¦¬ν•˜μ—¬ κ΄€λ¦¬ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 μ‹€ν—˜μ˜ μž¬ν˜„μ„±μ„ ν™•λ³΄ν•˜κ³ , μ½”λ“œ μˆ˜μ • 없이 λ‹€μ–‘ν•œ 쑰건으둜 μ†μ‰½κ²Œ μ‹€ν—˜μ„ μ§„ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

πŸ“Š μ‹€ν—˜ κ²°κ³Ό (Results)

λ³Έ μ½”λ“œλ² μ΄μŠ€λ₯Ό μ‚¬μš©ν•˜μ—¬ κ°κ΄€μ μœΌλ‘œ μ„±λŠ₯을 κ²€μ¦ν•œ κ²°κ³Ό, wSMAPE 점수 0.5550080421λ₯Ό λ‹¬μ„±ν–ˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” ν—ˆλ“€ λͺ¨λΈκ³Ό μ •κ΅ν•œ ν”Όμ²˜ μ—”μ§€λ‹ˆμ–΄λ§μ΄ 간헐적 μˆ˜μš” 예츑 λ¬Έμ œμ— 맀우 νš¨κ³Όμ μž„μ„ μž…μ¦ν•˜λŠ” κ²°κ³Όμž…λ‹ˆλ‹€.

  • wSMAPE (Weighted Symmetric Mean Absolute Percentage Error): μˆ˜μš”λŸ‰μ˜ 크기에 따라 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜λŠ” ν‰κ°€μ§€ν‘œλ‘œ, μˆ˜μš”κ°€ 적은 ν•­λͺ©μ˜ μ˜€μ°¨μ— κ³Όλ„ν•œ νŽ˜λ„ν‹°λ₯Ό μ£ΌλŠ” 것을 λ°©μ§€ν•©λ‹ˆλ‹€.

πŸš€ Colab μ—μ„œ μ‹œμž‘ν•˜κΈ° (Getting Started)

1. μš”κ΅¬μ‚¬ν•­ (Prerequisites)

  • Python 3.8+
  • Nvidia GPU

2. Quickstart (Colab)

# 1. μ €μž₯μ†Œ 클둠
!git clone [https://github.com/shindongwoon/lgbmhurdle.git](https://github.com/shindongwoon/lgbmhurdle.git)
cd lgbmhurdle

# 2. μ˜μ‘΄μ„± μ„€μΉ˜
!python dependency.py

# 3. λͺ¨λΈ ν›ˆλ ¨ μ§„ν–‰
!python train.py

# 4. λͺ¨λΈ 예츑 μ§„ν–‰
!python predict.py

All imports are relative; drop this folder as project root and run the commands.

By default, both scripts load the configuration from g2_hurdle/configs/korean.yaml. train.py reads data/train.csv and stores model artifacts in ./artifacts. predict.py consumes the artifacts, expects test files in data/test with a data/sample_submission.csv, and writes predictions to outputs/submission.csv.

GPU configuration

To enable GPU acceleration, set runtime.use_gpu to true in the YAML configuration. The pipeline will automatically set device_type: gpu for the LightGBM models. The older device parameter is deprecated and should not be used.

Data configuration

Columns used by the toolkit can be provided directly in the data section of the YAML config:

data:
  date_col: ds
  target_col: y
  id_cols: [series_id]

If these keys are omitted, resolve_schema falls back to the corresponding *_col_candidates lists to infer column names.

To clip negative values before feature engineering, list the columns under non_negative_cols:

data:
  non_negative_cols: [sales]

Any negative values in these columns will be replaced with zero during both training and prediction.

About

Time-Series-Forecasting with Enhanced-LightGBM Framework

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published