- This project aims to predict customer satisfaction from the Santander Customer Satisfaction Kaggle Challenge.
- The goal is to predict the probability that a customer is unsatisfied, using the area under the ROC curve (AUC) as the evaluation metric. This project formulates the problem as a binary classification task. A Logistic Regression was applied as a baseline model. To improve performance, more advanced models such as XGBoost and Random Forest were used. The best model was able to predict customer dissatisfaction 67% of the time. Currently the best performance on Kaggle is 82%.
Data
- A train CSV file containing anonymized features.
- Target: A binary classification indicating customer satisfaction. TARGET column (1 for dissatisfied, 0 for satisfied).
- Size
- Data: 76020 rows, 371 columns
- Train, Test Split & Cross Validation:
- 80% of the data (60723 samples) for training), 20% of data (15181 samples) used for testing.
GridSearchCVwas used for hyperparameter tuning to find the best model.
1. Handling Missing Data
- Columns with missing values were replaced with the mode, since missing values were categorized as -99999.
2. Feature Selection/Scaling
- Dropped irrelevant ID column.
- Used
StandardScalerto scale numeric features.
3. Encoding
- Converted categorical features into numeric using One-Hot Encoding.
Class Balance
- The dataset is imbalanced — most customers are satisfied (TARGET = 0), and only a small portion are dissatisfied (TARGET = 1).
Categorical Features
- Table to better represent categorical features:

- Observation:
- Certain features such as num_var1 are dominated by a single values (0.0 appears in over 98% of cases). This suggests that some features have low variability and may not carry strong predictive power.
- Although certain columns include the name 'num' this does not mean the data is truly numeric in nature. For this project, categorical features where selected based on number of unique values (3-9).
XGBoost Top Features
- The top features XGBoost used for interpreting the dataset.

- Observation: var15 was also a top feature for Logistic Regression and Random Forest.
var15 Histogram
- Distribution for var 15 (age):

- Observation: The majority of satisfied customers have lower var15 values, especially clustered around 20-30. Unsatisfied customers are also concentrated in the same area but show more spread across higher values.
- Predicting Customer dissatisfaction using binary target variable.
- Models:
- Logistic Regression:
- A baseline model for binary classification. Used because it is quick to train and simple.
- Random Forest:
- Used to handle many features, most of which are anonymized and possibly irrelevant.
- XGBoosting:
- Handles large data sets well and does not need outliers removed to perform well.
- Logistic Regression:
Optimizer
- XGBoost:
binary:logisticobjective
Hyperparameters
- Logistic Regression:
class_weight='balanced' - Random Forest:
class_weight=None,max_depth=20,n_estimator=200 - XGBoosting:
learning_rate=0.1,max_depth=3,n_estimator=50
-
Software: Python 3.x
-
Train Stopping:
- Logistic Regression: maximum of 3000 iterations.
- XGBoost: 50 trees.
- Random Forest: 200 trees.
- None of the models had early stopping.
-
Problems:
- Over 90% of customers in the dataset were labeled as satisfied, making it difficult to accurately predict the minority class (unsatisfied).
- I attempted hyperparameter tuning using GridSearchCV to improve model performance, but the imbalance still posed a major challenge.
- Logistic Regression:
- Resulted in moderate performance based on AUC score, but struggled the most with class imbalance.
- Random Forest:
- Resulted in a better performance than logistic regression, but was slower training.
- AUC improved slightly, but sensitivity to the minority class (unsatisfied customers) was poor.
- XGBoosting:
- Best AUC score among models.
- Class imbalance still affected performance; most predictions leaned toward the majority class imbalance.
- AUC Score Comparison before Hyperparameter tuning:

- AUC Score after Hyperparameter tuning (XGBoost & Random Forest)

- XGBoost outperformed both Logistic Regression and Random Forest in terms of AUC score, making it the most effective model for identifying dissatisfied customers.
- Handle Class Imbalance More Effectively:
- Although using GridSearch helped somewhat, class imbalance remains the key challenge. In the future, exploring techniques like SMOTE to create a more balanced dataset may help with this.
- Prepare the data
- Format your data as a pandas DataFrame. Ensure that your dataset has features (input variables) and a target column (output variable) that the model will predict.
- Modify Configuration
- Adjust the code, hyperparameters, (e.g., in train_model.py or model_training.ipynb) to reflect your dataset's structure and target variable.
- Train the Model
- Use the provided training csv to train models on your data.
- Evaluate the Model
- Use the provided evaluation metrics (e.g., AUC, accuracy, confusion matrix) to assess your model’s performance.
- Resources
- For running and training the models Google Colab was used, a free cloud-based environment.
- Data-Visualization-Cleaning.ipynb: Data analysis and visualization. Cleaning train.csv file from Kaggle project
- Models_Before_Tuning.ipynb: Contains all models used on a clean the cleaned trained csv file. Includes GridSearchCV.
- Updated_Models_with_Tuning.ipynb: Includes retrained XGBoosting and Random Forest models based on best parameters from GridSearchCV.
- Images: Contains all images used in this README.md file.
- pandas, numpy, xgboost, scikit-learn, matplotlib
- The Santander Customer Satisfaction datasets can be downloaded on Kaggle right here.
- Training the model involves selecting a model (e.g., Logistic Regression, Random Forest, XGBoost), training it on the training data, and optionally tuning hyperparameters for better performance.
- Evaluating performance includes using AUC to assess how well the model predicts the target variable.