CS-433 Machine Learning Project 1

This README.md file serves as an instructional document for the codebase that is created for the first project of the course CS-433 Machine Learning of EPFL for the Fall 2024 semester.

Placement of the Data Folder

In order to reproduce the results presented in the report, make sure that the files x_train.csv, x_test.csv, and y_train.csv are placed under a folder named dataset inside the project repository for the data loading process to run successfully.

Breakdown of the Files Inside the Codebase

Here is a quick breakdown of the files present inside the codebase. Each file within the repository will be listed along with their purpose and function respectively.

implementations.py

This file consists of the implementations of six different machine learning methods that have been mentioned in the project description document.

utils.py

utils.py includes some utility functions that are needed by the methods in implementations.py for certain computations.

helpers.py

The exact same file featuring some useful helper functions with the one already provided to us with the project announcement.

preprocess.py

This file is the class description file of the Python class Preprocess, which is simply used to clean and to preprocess the project dataset by utilizing several strategies, which are meticulously explained inside the method docstrings of the class. To name a few, it handles both continous and categorical features, removes unrelated columns from the dataset, and replaces NaN values by sampling values from a Gaussian distribution with mean and standard deviation of the column having NaN values.

Before starting a training or a hyperparameter optimization run, a Preprocess object is created just after loading training and testing data. Then, the dataset is analyzed and feature engineering is performed via the method Preprocess.process().

KCV.py

This file includes the class description of the Python class K_Fold_CV, which is simply used to perform a K-Fold Cross Validation on the training set with some additional options.

Since we found out that regularized logistic regression is the top-performing model among the other simple models implemented in implementations.py, K_Fold_CV uses regularized logistic regression by default.

To create an instance of K_Fold_CV, one needs to pass the preprocessed versions of x_train and y_train, and also the hyperparameter value k. After an instance of the class is created, one can simply call the K_Fold_CV.train_and_evaluate() method to initiate a cross validation, and call K_Fold_CV.write_mean_metrics() to write out the training results to a file named metrics.txt.

hyperparameter_optimization.py

As can be inferred from the name, this file is used to start a hyperparameter optimization run with grid search for our best-performing model, regularized logistic regression. The hyperparameter values used for the search can be seen inside the file:

K = 3
max_iters = 600
gammas = [1e-3,1e-2,1e-1,1]
lambdas = [1e-4,1e-3,1e-2,1e-1]

To obtain the results presented in the project report regarding the hyperparameter search procedure, one can simply run this file with python hyperparameter_optimization.py. Note that it is assumed the files x_train.csv, x_test.csv, and y_train.csv are placed under a folder named dataset inside the project repository for the data loading process to run successfully. After the run terminates, a file named metrics.txt will be created within the project scope and it will contain the cross-validation metrics for different hyperparameter combinations.

run.py

This is the file responsible for generating the file submission.csv which consists of the predictions of our best performing model with hyperparameters discovered via a hyperparameter optimization run. One can simply run the file without changing anything, and obtain the predictions that are submitted by us to the aicrowd.com online competition system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-433 Machine Learning Project 1

Placement of the Data Folder

Breakdown of the Files Inside the Codebase

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
KCV.py		KCV.py
README.md		README.md
get_results.py		get_results.py
helpers.py		helpers.py
hyperparameter_optimization.py		hyperparameter_optimization.py
implementations.py		implementations.py
main_test.ipynb		main_test.ipynb
preprocess.py		preprocess.py
report.pdf		report.pdf
run.py		run.py
test_functions.ipynb		test_functions.ipynb
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

CS-433 Machine Learning Project 1

Placement of the Data Folder

Breakdown of the Files Inside the Codebase

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages