Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update 04_preprocessing_and_training_LinaAbdullahi.ipynb
  • Loading branch information
Lina-abd committed Aug 3, 2023
commit fcf0b67f3f18ee7192c676799646db23f2a6096d
20 changes: 13 additions & 7 deletions Notebooks/04_preprocessing_and_training_LinaAbdullahi.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3988,7 +3988,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -4013,9 +4013,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 111,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Directory ../models was created.\n",
"Writing file. \"../models\\ski_resort_pricing_model.pkl\"\n"
]
}
],
"source": [
"# save the model\n",
"\n",
Expand Down Expand Up @@ -4050,14 +4059,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this preprocessing step, we first evaluated the usefulness of the mean value as a predictor. Then, constructed a data processing pipeline that imputed missing values, scaled the data, selected the most relevant features using the `select k best`strategy, and trained a linear regression model while employing cross-validation to estimate its performance.\n",
"\n",
"In our preprocessing, we later compared imputing missing values with both the median and mean approaches and assessed their performance using a linear regression model. Interestingly, the choice of imputation method (median or mean) didn't have a significant impact on the model's accuracy. On average, the model could estimate a ticket price within approximately 9 dollars of the actual price, which was much better than a simple guess based on the average value (with a variance of 19 dollars).\n",
"In this preprocessing step, we first evaluated the usefulness of the mean value as a predictor. we then compared imputing missing values with both the median and mean approaches and assessed their performance using a linear regression model. Interestingly, the choice of imputation method (median or mean) didn't have a significant impact on the model's accuracy. On average, the model could estimate a ticket price within approximately 9 dollars of the actual price, which was much better than a simple guess based on the average value (with a variance of 19 dollars).\n",
"\n",
"Next, we employed a pipeline that involved imputing missing values, scaling the data, and performing linear regression in a single process. To identify the most influential/dominant features, we utilized the `select k best` strategy. After experimenting with setting different values of K, we did a hyperparameter search using GridSearchCV and found that 8 was the best value for k (8 number of features). The top 8 features that the linear regression model identified as the most important for predicting ticket prices were: `vertical_drop`, `Snow Making_ac`, `total_chairs`, `fastQuads`, `Runs`, `LongestRun_mi`, `trams`, and `SkiableTerrain_ac`.\n",
"\n",
"Subsequently, we explored the random forest model using cross-validation and found that it aligned with the linear model, highlighting the importance of four features: `fastQuads`, `Runs`, `Snow Making_ac`, and `vertical_drop`.\n",
"\n",
"Comparing the performance of the `linear regression` and `random forest models`, we observed that the random forest model exhibited a lower cross-validation mean absolute error, almost 1 dollar less. As a result, we conclude that the preferred choice for this project is the `random forest model`, selected for its superior stability and lower cross-validation mean absolute error.\n",
"\n",
"Validating the performance on the test set yielded consistent results, affirming the reliability and effectiveness of the random forest model for predicting ticket prices."
Expand Down