Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update 04_preprocessing_and_training.ipynb
  • Loading branch information
is7649 committed Apr 1, 2023
commit 7b14d857e1f4e24f513a10d64f454903b839460c
19 changes: 14 additions & 5 deletions Notebooks/04_preprocessing_and_training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3783,14 +3783,14 @@
},
{
"cell_type": "code",
"execution_count": 128,
"execution_count": 134,
"metadata": {},
"outputs": [],
"source": [
"#Just what version model have you just loaded to reuse? What version of `sklearn` created it? \n",
"#Let's call this model version '1.0'\n",
"best_model = rf_grid_cv.best_estimator_\n",
"best_model.version = 1.0\n",
"best_model.version = '1.0'\n",
"#Assign the pandas version number (`pd.__version__`) to the `pandas_version` attribute,\n",
"best_model.pandas_version = pd.__version__\n",
"#the numpy version (`np.__version__`) to the `numpy_version` attribute,\n",
Expand All @@ -3804,14 +3804,16 @@
},
{
"cell_type": "code",
"execution_count": 129,
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Directory ../models was created.\n",
"A file already exists with this name.\n",
"\n",
"Do you want to overwrite? (Y/N)Y\n",
"Writing file. \"../models/ski_resort_pricing_model.pkl\"\n"
]
}
Expand Down Expand Up @@ -3841,8 +3843,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The data was divided into two parts (70/30). The 70% portion of the data was named ‘X_train’, and the 30% portion of the data was named ‘X_test’. The mean of ‘y_train’, which is part of the ‘X_train’ data, was found to be approximately 63.81. A dummy regressor was used to fit the ‘X_train’ and ‘y_train’ data. It returned a mean of approximately 63.81 as well. When using the average value as the prediction, R-squared was 0 on the training set and approximately -0.00312 on the test set. To determine how close the predictions are, the mean absolute error and mean squared error were calculated. The mean absolute error was approximately 19.14. This means that on average it is expected that actual ticket prices are ~$19.14 off from the predicted ticket prices based on the average of known values. When using the mean to fill in missing values, the mean absolute error was ~$9.00, which is significantly better than the MAE with guessing using average. Another linear model was built by imputing values using the mean. The MAE did not appear to be much different from the model using the median. When using cross validation with CV=5 and default k, the mean r-squared was found to be ~0.633 +/- 0.095, which is consistent with the previous models. The cross validation model expects 95% of the r-squares are expected to be between 0.44 and 0.82. Further analysis showed that using a k=8 would ensure a higher r-squared with an even smaller error. A random forest regressor was tried. It found that the best parameter was median and that scaling the data did not help. Due to its lower cross validation mean and smaller variability and consistent test results with the cross validation data, the random forest regression model was chosen. "
"'''The data was divided into two parts (70/30). The 70% portion of the data was named ‘X_train’, and the 30% portion of the data was named ‘X_test’. The mean of ‘y_train’, which is part of the ‘X_train’ data, was found to be approximately 63.81. A dummy regressor was used to fit the ‘X_train’ and ‘y_train’ data. It returned a mean of approximately 63.81 as well. When using the average value as the prediction, R-squared was 0 on the training set and approximately -0.00312 on the test set. To determine how close the predictions are, the mean absolute error and mean squared error were calculated. The mean absolute error was approximately 19.14. This means that on average it is expected that actual ticket prices are approximately 19.14 dollars away from the predicted ticket prices based on the average of known values. When using the mean to fill in missing values, the mean absolute error was approximately $9.00, which is significantly better than the MAE with guessing using average. Another linear model was built by imputing values using the mean. The MAE did not appear to be much different from the model using the median. When using cross validation with CV=5 and default k, the mean r-squared was found to be ~0.633 +/- 0.095, which is consistent with the previous models. The cross validation model expects 95% of the r-squares are expected to be between 0.44 and 0.82. Further analysis showed that using a k=8 would ensure a higher r-squared with an even smaller error. A random forest regressor was tried. It found that the best parameter was median and that scaling the data did not help. Due to its lower cross validation mean and smaller variability and consistent test results with the cross validation data, the random forest regression model was chosen. '''"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down