This project is engineered to tackle the complex challenge of forecasting time series data characterized by intermittent demand. Intermittent demand, where sales data is dominated by frequent zero values, poses a significant hurdle for traditional forecasting models. Standard algorithms often struggle to accurately predict sporadic, non-zero events.
To address this, we employ a sophisticated Hurdle Model architecture. This approach strategically decomposes the forecasting problem into two distinct, more manageable sub-problems:
- A Binary Classification task to predict whether a sale will occur (i.e., hurdle the zero-demand barrier).
- A Regression task to predict how much will be sold, conditioned on a sale actually happening.
Both models are implemented using LightGBM, a high-performance gradient boosting framework renowned for its speed and accuracy, ensuring a robust and efficient solution.
- Hurdle Model Architecture: By decoupling the prediction of sale occurrence from sale quantity, the model architecture is specifically tailored to handle zero-inflated datasets, maximizing predictive accuracy for intermittent demand patterns.
- High-Performance LightGBM Engine: Utilizes LightGBM as the backbone for both classification and regression tasks, leveraging its exceptional speed and state-of-the-art performance in gradient boosting.
- Advanced Feature Engineering: Implements a comprehensive suite of engineered features to capture complex temporal dynamics:
- Calendar-based Features: Extracts signals from timestamps (e.g., day of week, week of year, month, quarter).
- Lag and Rolling Window Statistics: Incorporates historical trends and patterns using lagged values and rolling statistical aggregations (mean, std, min, max).
- Fourier Transform Features: Precisely models complex seasonality (e.g., weekly, yearly cycles) by transforming temporal features into the frequency domain.
- Recursive Forecasting Pipeline: Employs a recursive, multi-step forecasting strategy. For long prediction horizons, the model uses its own previous predictions as inputs to generate features for subsequent steps, ensuring stable and accurate long-term forecasts.
- Efficient Caching System: A built-in caching mechanism serializes the results of computationally expensive feature engineering steps. This dramatically reduces runtime during iterative experiments and model tuning, accelerating the development cycle.
- Configuration-Driven Workflow: The entire pipeline is managed via
YAMLconfiguration files. This approach decouples logic from parameters, ensuring experimental reproducibility and simplifying hyperparameter tuning and feature selection.
For the model to operate correctly, the input and output data must adhere to the following specifications.
All source data for training (train.csv) and inference (TEST_*.csv) must be structured in a long format. Each row should represent a single observation for a specific item on a specific date.
Required Features:
| Column Name | Data Type | Description |
|---|---|---|
Date |
Date or String |
The date of the sales record (e.g., YYYY-MM-DD). |
Series_Identifier |
String |
A unique identifier for each distinct time series (e.g., a combination of store and item ID). |
Sales_Quantity |
Numeric (Integer) |
The actual quantity sold on that date. This serves as the target variable for the model to predict. |
The final prediction output (sample_submission.csv) must be pivoted into a wide-format structure. In this format, each time series (Series_Identifier) becomes a separate column.
Output Table Structure:
- Index: The rows are indexed by
Date. - Column Headers: The first column is
Date, followed by columns for every uniqueSeries_Identifier. - Values: Each cell contains the predicted
Sales_Quantityfor the correspondingDate(row) andSeries_Identifier(column).
Example Structure:
| Date | StoreA_ItemX | StoreA_ItemY | StoreB_ItemZ | ... |
|---|---|---|---|---|
| 2025-09-16 | 15 | 0 | 21 | ... |
| 2025-09-17 | 12 | 3 | 18 | ... |
| ... | ... | ... | ... | ... |
This standard format facilitates easy analysis and submission of time series forecasting results.
The codebase is modularly structured into three core stages: Data Preprocessing & Feature Engineering, Model Training, and Recursive Inference.
- Configuration Loading (
configs/): The pipeline initializes by loading hyperparameters, file paths, and feature definitions fromYAMLconfiguration files (e.g.,base.yaml,korean.yaml). - Data Ingestion & Preprocessing (
fe/preprocess.py): Raw data is loaded, and initial cleaning is performed, including data type casting and handling of missing values. - Feature Engineering (
fe/): A rich set of features is constructed from the preprocessed data.calendar.py: Generates calendar-based features.lags_rolling.py: Creates lag and rolling window statistical features.fourier.py: Produces Fourier term features for seasonality modeling.
- Model Training (
pipeline/train.py):- Data Splitting: Utilizes Time Series Cross-Validation (
cv/tscv.py) to create robust training and validation splits that respect the temporal order of the data. - Classifier Training (
model/classifier.py): A LightGBM classifier is trained on a binary target (1 if sales > 0, else 0). - Regressor Training (
model/regressor.py): A LightGBM regressor is trained exclusively on data where sales > 0 to predict the actual sales quantity.
- Data Splitting: Utilizes Time Series Cross-Validation (
- Model Serialization: The trained classifier and regressor model artifacts are saved to disk.
- Inference (
pipeline/predict.py&recursion.py):- The pipeline iterates one step at a time over the prediction horizon.
- Classification Prediction: The classifier predicts the probability of a sale occurring on the next day.
- Hurdle Application: If the predicted probability exceeds a predefined threshold, the regressor predicts the sales quantity. Otherwise, the prediction is set to 0.
- Recursive Update: The new prediction is used to update the lag and rolling features for the subsequent day's forecast. This loop continues until the entire prediction horizon is covered.
- Submission Generation: The final predictions are formatted to match the required
sample_submission.csvschema.
| Module Path | Core Function | Contribution to Performance |
|---|---|---|
g2_hurdle/fe/ |
Feature Engineering | Fundamentally improves model accuracy by translating complex temporal patterns (trends, seasonality, autocorrelation) into a format the model can learn. Lag and rolling features are critical. |
g2_hurdle/utils/cache.py |
Feature Caching | Drastically reduces experiment turnaround time by caching the results of expensive feature computations. This accelerates development and hyperparameter tuning by avoiding redundant work. |
g2_hurdle/model/ |
Hurdle Model Implementation | Directly addresses the intermittent demand problem by splitting it into classification and regression. This prevents the regression model from being biased by zero-inflation, significantly improving the final wSMAPE score. |
g2_hurdle/cv/tscv.py |
Time Series Cross-Validation | Ensures a robust and reliable evaluation of the model's generalization performance by preventing data leakage (i.e., training on future data). |
g2_hurdle/pipeline/recursion.py |
Recursive Forecasting Logic | Enables accurate multi-step forecasting by dynamically updating features at each step with the latest available information (i.e., previous predictions). This yields more stable and precise long-range predictions. |
g2_hurdle/configs/ |
Configuration Management | Guarantees experimental reproducibility and flexibility by externalizing all hyperparameters and settings. This allows for rapid, code-free iteration on different model configurations. |
Rigorous evaluation of this codebase demonstrated a highly competitive performance, achieving a wSMAPE score of 0.5550080421. This result validates the effectiveness of the Hurdle Model architecture combined with sophisticated feature engineering for solving intermittent demand forecasting problems.
- wSMAPE (Weighted Symmetric Mean Absolute Percentage Error): An industry-standard metric that weights errors based on the actual demand volume, preventing high-percentage errors on low-volume items from disproportionately penalizing the model's score.
- Python 3.8+
- Nvidia GPU (Recommended for training)
# 1. Clone the repository
!git clone [https://github.com/shindongwoon/lgbmhurdle.git](https://github.com/shindongwoon/lgbmhurdle.git)
%cd lgbmhurdle
# 2. Install dependencies
!python dependency.py
# 3. Run model training
!python train.py
# 4. Run prediction
!python predict.pyλ³Έ νλ‘μ νΈλ κ°νμ μμ(Intermittent Demand) νΉμ±μ κ°μ§ μκ³μ΄ λ°μ΄ν°μ λ―Έλ νλ§€λμ μμΈ‘νλ κ²μ λͺ©νλ‘ ν©λλ€. κ°νμ μμλ '0' κ°μ΄ λΉλ²νκ² λνλλ λ°μ΄ν°λ₯Ό μλ―Ένλ©°, μΌλ°μ μΈ μκ³μ΄ μμΈ‘ λͺ¨λΈλ‘λ μ νν μμΈ‘μ΄ μ΄λ ΅μ΅λλ€.
μ΄λ¬ν λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ λ³Έ νλ‘μ νΈμμλ νλ€ λͺ¨λΈ(Hurdle Model) μ κ·Όλ²μ μ±ννμ΅λλ€. νλ€ λͺ¨λΈμ 'νλ§€κ° λ°μν μ§ μ¬λΆ'λ₯Ό μμΈ‘νλ μ΄μ§ λΆλ₯(Binary Classification) λ¬Έμ μ 'νλ§€κ° λ°μνμ λ μΌλ§λ ν릴μ§'λ₯Ό μμΈ‘νλ νκ·(Regression) λ¬Έμ λ‘ λλμ΄ μ κ·Όν©λλ€. λ λͺ¨λΈ λͺ¨λ κ°λ ₯ν μ±λ₯μ μλνλ LightGBMμ κΈ°λ°μΌλ‘ ꡬνλμμ΅λλ€.
- νλ€ λͺ¨λΈ μν€ν μ² (Hurdle Model Architecture): νλ§€ λ°μ μ¬λΆμ νλ§€λμ λΆλ¦¬νμ¬ μμΈ‘ν¨μΌλ‘μ¨ '0'μ΄ λ§μ λ°μ΄ν°μ λν μμΈ‘ μ±λ₯μ κ·Ήλνν©λλ€.
- κ³ μ±λ₯ LightGBM νμ© (High-Performance LightGBM): λΉ λ₯Έ νμ΅ μλμ λμ μμΈ‘ μ νλλ₯Ό μλνλ LightGBMμ λΆλ₯ λ° νκ· λͺ¨λΈμ λ°±λ³ΈμΌλ‘ μ¬μ©ν©λλ€.
- κ³ κΈ νΌμ² μμ§λμ΄λ§ (Advanced Feature Engineering):
- μκ° κΈ°λ° νΌμ²: λ μ§, μμΌ, μ, μ£Όμ°¨ λ± μκ° μ 보λ₯Ό νμ©ν νΌμ²
- μ§μ° λ° λ‘€λ§ ν΅κ³ νΌμ²: κ³Όκ±° λ°μ΄ν°μ μΆμΈμ ν¨ν΄μ νμ΅νκΈ° μν Lag λ° Rolling window νΌμ²
- νΈλ¦¬μ λ³ν νΌμ²: κ³μ μ±(Seasonality)μ μ κ΅νκ² λͺ¨λΈλ§νκΈ° μν Fourier-transform νΌμ²
- μ¬κ·μ μμΈ‘ νμ΄νλΌμΈ (Recursive Prediction Pipeline): μμΈ‘ λμ κΈ°κ°μ΄ κΈΈμ΄μ§ λ, μ΄μ μμΈ‘κ°μ λ€μ νΌμ²λ‘ μ¬μ©νμ¬ λ€μ μμ μ κ°μ μμΈ‘νλ μ¬κ·μ ꡬ쑰λ₯Ό μ±ννμ¬ μμ μ μΈ μ₯κΈ° μμΈ‘μ μνν©λλ€.
- ν¨μ¨μ μΈ μΊμ± μμ€ν (Efficient Caching System): νΌμ² μμ§λμ΄λ§ κ³Όμ μμ μμ±λ λ°μ΄ν°λ₯Ό μΊμ±νμ¬ λ°λ³΅μ μΈ νμ΅ λ° μ€ν μμ μ°μ° λΉμ©μ νκΈ°μ μΌλ‘ μ€μ λλ€.
- μ€μ νμΌ κΈ°λ° κ΄λ¦¬ (Configuration File Management):
YAMLνμμ μ€μ νμΌμ ν΅ν΄ λͺ¨λΈ νμ΄νΌνλΌλ―Έν°, νΌμ² λͺ©λ‘ λ±μ 체κ³μ μΌλ‘ κ΄λ¦¬νμ¬ μ€νμ μ¬νμ±μ 보μ₯ν©λλ€.
λ³Έ μμΈ‘ λͺ¨λΈμ΄ μ μμ μΌλ‘ λμνκΈ° μν΄μλ μ λ ₯ λ°μ΄ν°μ μΆλ ₯ λ°μ΄ν°κ° λ€μ λͺ μΈμ λ°λΌ μ ννκ² κ΅¬μ±λμ΄μΌ ν©λλ€.
νμ΅(train.csv) λ° νκ°(TEST_*.csv)μ μ¬μ©λλ λͺ¨λ μλ³Έ λ°μ΄ν°λ Long-Format λ°μ΄ν° ꡬ쑰λ₯Ό λ°λΌμΌ ν©λλ€. κ° νμ νΉμ μμ
μΌμ νΉμ μνμ λν νλ§€ κΈ°λ‘μ λνλ΄λ κ°λ³ κ΄μΈ‘μΉ(Observation)μ
λλ€.
νμ νΌμ²(Features):
| 컬λΌλͺ (Column Name) | λ°μ΄ν° νμ (Data Type) | μ€λͺ (Description) |
|---|---|---|
μμ
μΌμ |
Date or String |
λ§€μΆμ΄ λ°μν λ μ§μ
λλ€. (μ: YYYY-MM-DD) |
μμ
μ₯λͺ
_λ©λ΄λͺ
|
String |
κ° μν(Item)μ κ³ μ νκ² μλ³νλ **μλ³μ(Identifier)**μ λλ€. |
λ§€μΆμλ |
Numeric (Integer) |
ν΄λΉ μμ μΌμ λ°μν μνμ μ€μ νλ§€λμΌλ‘, λͺ¨λΈμ΄ μμΈ‘νκ³ μ νλ **νκ² λ³μ(Target Variable)**μ λλ€. |
μ΅μ’
μμΈ‘ κ²°κ³Όλ¬ΌμΈ sample_submission.csv νμΌμ μ
λ ₯ λ°μ΄ν°μ λ¬λ¦¬ Wide-Format λ°μ΄ν° κ΅¬μ‘°λ‘ νΌλ΄(Pivot)λ ννμ¬μΌ ν©λλ€. μ΄λ κ° μν(μμ
μ₯λͺ
_λ©λ΄λͺ
)μ΄ κ°λ³μ μΈ μ»¬λΌμ΄ λλ λ§€νΈλ¦μ€(Matrix) ꡬ쑰λ₯Ό μλ―Έν©λλ€.
μΆλ ₯ ν μ΄λΈ ꡬ쑰:
- μΈλ±μ€ (Index): ν
μ΄λΈμ ν(Row)μ
μμ μΌμκ° κΈ°μ€μ΄ λ©λλ€. - μ»¬λΌ ν€λ (Column Headers): ν
μ΄λΈμ 첫 λ²μ§Έ μ΄μ
μμ μΌμμ΄λ©°, λ λ²μ§Έ μ΄λΆν°λμμ μ₯λͺ _λ©λ΄λͺμ λͺ¨λ κ³ μ κ°(Unique Values)μ΄ μ»¬λΌλͺ μΌλ‘ μμΉν©λλ€. - κ° (Values): ν
μ΄λΈμ κ° μ
(Cell)μλ ν΄λΉ
μμ μΌμ(ν)μμμ μ₯λͺ _λ©λ΄λͺ(μ΄)μ ν΄λΉνλ **μμΈ‘λλ§€μΆμλ**μ΄ κΈ°μ λ©λλ€.
ꡬ쑰 μμ:
| μμ μΌμ | StoreA_ItemX | StoreA_ItemY | StoreB_ItemZ | ... |
|---|---|---|---|---|
| 2025-09-16 | 15 | 0 | 21 | ... |
| 2025-09-17 | 12 | 3 | 18 | ... |
| ... | ... | ... | ... | ... |
μ΄λ¬ν ꡬ쑰λ κ° μνμ μΌλ³ μμΈ‘ νλ§€λμ νλμ νμ νκΈ° μ©μ΄ν νμ€μ μΈ μκ³μ΄ μμΈ‘ κ²°κ³Ό μ μΆ νμμ λλ€.
λ³Έ μ½λλ² μ΄μ€λ ν¬κ² λ°μ΄ν° μ μ²λ¦¬ λ° νΌμ² μμ§λμ΄λ§, λͺ¨λΈ νμ΅, μ¬κ·μ μμΈ‘μ μΈ λ¨κ³λ‘ ꡬμ±λ©λλ€.
- μ΄κΈ° μ€μ (
configs/):base.yaml,korean.yamlλ±μ μ€μ νμΌμ λ‘λνμ¬ νλ‘μ νΈ μ λ°μ νμ΄νΌνλΌλ―Έν°, λ°μ΄ν° κ²½λ‘, μ¬μ©ν νΌμ² λͺ©λ‘ λ±μ μ μν©λλ€. - λ°μ΄ν° λ‘λ© λ° μ μ²λ¦¬ (
fe/preprocess.py): μλ³Έ λ°μ΄ν°λ₯Ό λ‘λνκ³ , κΈ°λ³Έμ μΈ λ°μ΄ν° νμ λ³ν λ° κ²°μΈ‘μΉ μ²λ¦¬ λ±μ μ μ²λ¦¬ κ³Όμ μ μνν©λλ€. - νΌμ² μμ§λμ΄λ§ (
fe/): μ μ²λ¦¬λ λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ λͺ¨λΈ νμ΅μ μ¬μ©λ λ€μν νΌμ²λ₯Ό μμ±ν©λλ€.calendar.py: μκ° κ΄λ ¨ νΌμ² μμ±lags_rolling.py: Lag λ° Rolling ν΅κ³ νΌμ² μμ±fourier.py: κ³μ μ± νΌμ² μμ±
- λͺ¨λΈ νμ΅ (
pipeline/train.py):- λ°μ΄ν° λΆν : μκ³μ΄ λ°μ΄ν°μ νΉμ±μ κ³ λ €νμ¬ Time Series Cross-Validation λ°©μμΌλ‘ νμ΅/κ²μ¦ λ°μ΄ν°λ₯Ό λΆν ν©λλ€ (
cv/tscv.py). - λΆλ₯ λͺ¨λΈ νμ΅ (
model/classifier.py): νκ² κ°μ΄ 0μΈμ§ μλμ§λ₯Ό λ μ΄λΈλ‘ νμ¬ LightGBM λΆλ₯κΈ°λ₯Ό νμ΅μν΅λλ€. - νκ· λͺ¨λΈ νμ΅ (
model/regressor.py): νκ² κ°μ΄ 0μ΄ μλ λ°μ΄ν°λ§μ λμμΌλ‘, μ€μ νλ§€λμ μμΈ‘νλ LightGBM νκ· λͺ¨λΈμ νμ΅μν΅λλ€.
- λ°μ΄ν° λΆν : μκ³μ΄ λ°μ΄ν°μ νΉμ±μ κ³ λ €νμ¬ Time Series Cross-Validation λ°©μμΌλ‘ νμ΅/κ²μ¦ λ°μ΄ν°λ₯Ό λΆν ν©λλ€ (
- λͺ¨λΈ μ μ₯: νμ΅λ λΆλ₯κΈ°μ νκ·κΈ° λͺ¨λΈ κ°μ²΄λ₯Ό μ§μ λ κ²½λ‘μ μ μ₯ν©λλ€.
- μμΈ‘ (
pipeline/predict.py&recursion.py):- ν μ€νΈ λ°μ΄ν°μ λν΄ ν루(one-step)μ© μμΈ‘μ μ§νν©λλ€.
- λΆλ₯κΈ° μμΈ‘: λ΄μΌ νλ§€κ° λ°μν νλ₯ μ μμΈ‘ν©λλ€.
- νλ€ μ μ©: μμΈ‘λ νλ₯ μ΄ μ¬μ μ μ μλ μκ³κ°(Threshold)μ λμΌλ©΄ νκ· λͺ¨λΈμ ν΅ν΄ νλ§€λμ μμΈ‘νκ³ , λμ§ μμΌλ©΄ 0μΌλ‘ μμΈ‘ν©λλ€.
- μ¬κ· μ λ°μ΄νΈ: μμΈ‘λ κ°μ κΈ°λ°μΌλ‘ λ€μ λ μμΈ‘μ νμν Lag, Rolling νΌμ²λ₯Ό μ λ°μ΄νΈνκ³ μ΄ κ³Όμ μ μμΈ‘ κΈ°κ°μ΄ λλ λκΉμ§ λ°λ³΅ν©λλ€.
- κ²°κ³Ό μ μΆ: μ΅μ’
μμΈ‘ κ²°κ³Όλ₯Ό
sample_submission.csvνμμ λ§μΆμ΄ μμ±ν©λλ€.
| λͺ¨λ κ²½λ‘ (Module Path) | ν΅μ¬ κΈ°λ₯ (Core Function) | μ±λ₯ ν₯μ κΈ°μ¬ λ°©μ (Contribution to Performance) |
|---|---|---|
g2_hurdle/fe/ |
νΌμ² μμ§λμ΄λ§ | μκ³μ΄ λ°μ΄ν°μ 볡μ‘ν ν¨ν΄(μΆμΈ, κ³μ μ±, μκΈ°μκ΄μ±)μ λͺ¨λΈμ΄ νμ΅ν μ μλ ννλ‘ λ³ννμ¬ μμΈ‘ μ νλλ₯Ό κ·Όλ³Έμ μΌλ‘ ν₯μμν΅λλ€. νΉν Lag, Rolling νΌμ²λ μκ³μ΄ μμΈ‘μ ν΅μ¬μ λλ€. |
g2_hurdle/utils/cache.py |
νΌμ² μΊμ± | λμ©λ λ°μ΄ν°μ λν νΌμ² μμ§λμ΄λ§μ λ§μ μκ°μ΄ μμλ©λλ€. μμ±λ νΌμ²λ₯Ό νμΌλ‘ μ μ₯νκ³ μ¬μ¬μ©ν¨μΌλ‘μ¨, λ°λ³΅ μ€ν μ μ 체 μ€ν μκ°μ κ·Ήμ μΌλ‘ λ¨μΆμμΌ κ°λ° ν¨μ¨μ±μ λμ λλ€. |
g2_hurdle/model/ |
νλ€ λͺ¨λΈ ꡬν | Classifierμ Regressorλ‘ μν μ λΆλ΄νμ¬ κ°νμ μμ λ¬Έμ μ νΉνλ μ κ·Όμ ν©λλ€. μ΄λ λ¨μΌ νκ· λͺ¨λΈμ΄ '0' κ°μ μν΄ νμ΅μ΄ μ곑λλ κ²μ λ°©μ§νκ³ , λ κ°μ§ λ¬Έμ λ₯Ό κ°κ° μ΅μ ννμ¬ wSMAPE μ μλ₯Ό ν¬κ² κ°μ ν©λλ€. |
g2_hurdle/cv/tscv.py |
μκ³μ΄ κ΅μ°¨ κ²μ¦ | λ―Έλμ λ°μ΄ν°κ° κ³Όκ±° λ°μ΄ν°μ νμ΅μ μ¬μ©λλ κ²μ λ°©μ§(Data Leakage λ°©μ§)ν©λλ€. μ΄λ₯Ό ν΅ν΄ λͺ¨λΈμ μΌλ°ν μ±λ₯μ λ³΄λ€ μ ννκ³ μ λ’°μ± μκ² νκ°ν μ μμ΅λλ€. |
g2_hurdle/pipeline/recursion.py |
μ¬κ·μ μμΈ‘ λ‘μ§ | λ€μ€ μμ μμΈ‘(Multi-step Forecasting) μ, λ§€ μμ λ§λ€ μ΅μ μ 보λ₯Ό λ°μν νΌμ²λ₯Ό μμ±νμ¬ μμΈ‘μ μνν©λλ€. μ΄λ λ¨μν λͺ¨λΈ νλλ‘ μ 체 κΈ°κ°μ μμΈ‘νλ κ²λ³΄λ€ ν¨μ¬ μ κ΅νκ³ μμ μ μΈ μ₯κΈ° μμΈ‘μ κ°λ₯νκ² ν©λλ€. |
g2_hurdle/configs/ |
μ€μ κ΄λ¦¬ | λͺ¨λ νμ΄νΌνλΌλ―Έν°μ μ€μ μ μ½λκ° μλ μΈλΆ νμΌλ‘ λΆλ¦¬νμ¬ κ΄λ¦¬ν©λλ€. μ΄λ₯Ό ν΅ν΄ μ€νμ μ¬νμ±μ ν보νκ³ , μ½λ μμ μμ΄ λ€μν 쑰건μΌλ‘ μμ½κ² μ€νμ μ§νν μ μμ΅λλ€. |
λ³Έ μ½λλ² μ΄μ€λ₯Ό μ¬μ©νμ¬ κ°κ΄μ μΌλ‘ μ±λ₯μ κ²μ¦ν κ²°κ³Ό, wSMAPE μ μ 0.5550080421λ₯Ό λ¬μ±νμ΅λλ€. μ΄λ νλ€ λͺ¨λΈκ³Ό μ κ΅ν νΌμ² μμ§λμ΄λ§μ΄ κ°νμ μμ μμΈ‘ λ¬Έμ μ λ§€μ° ν¨κ³Όμ μμ μ μ¦νλ κ²°κ³Όμ λλ€.
- wSMAPE (Weighted Symmetric Mean Absolute Percentage Error): μμλμ ν¬κΈ°μ λ°λΌ κ°μ€μΉλ₯Ό λΆμ¬νλ νκ°μ§νλ‘, μμκ° μ μ νλͺ©μ μ€μ°¨μ κ³Όλν νλν°λ₯Ό μ£Όλ κ²μ λ°©μ§ν©λλ€.
- Python 3.8+
- Nvidia GPU
# 1. μ μ₯μ ν΄λ‘
!git clone [https://github.com/shindongwoon/lgbmhurdle.git](https://github.com/shindongwoon/lgbmhurdle.git)
cd lgbmhurdle
# 2. μμ‘΄μ± μ€μΉ
!python dependency.py
# 3. λͺ¨λΈ νλ ¨ μ§ν
!python train.py
# 4. λͺ¨λΈ μμΈ‘ μ§ν
!python predict.pyAll imports are relative; drop this folder as project root and run the commands.
By default, both scripts load the configuration from g2_hurdle/configs/korean.yaml.
train.py reads data/train.csv and stores model artifacts in ./artifacts.
predict.py consumes the artifacts, expects test files in data/test with a
data/sample_submission.csv, and writes predictions to outputs/submission.csv.
To enable GPU acceleration, set runtime.use_gpu to true in the YAML
configuration. The pipeline will automatically set device_type: gpu for the
LightGBM models. The older device parameter is deprecated and should not be
used.
Columns used by the toolkit can be provided directly in the data section of the
YAML config:
data:
date_col: ds
target_col: y
id_cols: [series_id]If these keys are omitted, resolve_schema falls back to the corresponding
*_col_candidates lists to infer column names.
To clip negative values before feature engineering, list the columns under
non_negative_cols:
data:
non_negative_cols: [sales]Any negative values in these columns will be replaced with zero during both training and prediction.