Update 04_preprocessing_and_training.ipynb

springboard-curriculum · is7649 · Feb 21, 2023 · Feb 24, 2023 · Mar 28, 2023 · Apr 1, 2023
commit 7b14d857e1f4e24f513a10d64f454903b839460c
diff --git a/Notebooks/04_preprocessing_and_training.ipynb b/Notebooks/04_preprocessing_and_training.ipynb
@@ -3783,14 +3783,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 128,
+   "execution_count": 134,
    "metadata": {},
    "outputs": [],
    "source": [
     "#Just what version model have you just loaded to reuse? What version of `sklearn` created it? \n",
     "#Let's call this model version '1.0'\n",
     "best_model = rf_grid_cv.best_estimator_\n",
-    "best_model.version = 1.0\n",
+    "best_model.version = '1.0'\n",
     "#Assign the pandas version number (`pd.__version__`) to the `pandas_version` attribute,\n",
     "best_model.pandas_version = pd.__version__\n",
     "#the numpy version (`np.__version__`) to the `numpy_version` attribute,\n",
@@ -3804,14 +3804,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 129,
+   "execution_count": 135,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Directory ../models was created.\n",
+      "A file already exists with this name.\n",
+      "\n",
+      "Do you want to overwrite? (Y/N)Y\n",
       "Writing file.  \"../models/ski_resort_pricing_model.pkl\"\n"
      ]
     }
@@ -3841,8 +3843,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The data was divided into two parts (70/30). The 70% portion of the data was named ‘X_train’, and the 30% portion of the data was named ‘X_test’. The mean of ‘y_train’, which is part of the ‘X_train’ data, was found to be approximately 63.81. A dummy regressor was used to fit the ‘X_train’ and ‘y_train’ data. It returned a mean of approximately 63.81 as well. When using the average value as the prediction, R-squared was 0 on the training set and approximately -0.00312 on the test set. To determine how close the predictions are, the mean absolute error and mean squared error were calculated. The mean absolute error was approximately 19.14. This means that on average it is expected that actual ticket prices are ~$19.14 off from the predicted ticket prices based on the average of known values. When using the mean to fill in missing values, the mean absolute error was ~$9.00, which is significantly better than the MAE with guessing using average.  Another linear model was built by imputing values using the mean. The MAE did not appear to be much different from the model using the median. When using cross validation with CV=5 and default k, the mean r-squared was found to be ~0.633 +/- 0.095, which is consistent with the previous models. The cross validation model expects 95% of the r-squares are expected to be between 0.44 and 0.82. Further analysis showed that using a k=8 would ensure a higher r-squared with an even smaller error. A random forest regressor was tried. It found that the best parameter was median and that scaling the data did not help. Due to its lower cross validation mean and smaller variability and consistent test results with the cross validation data, the random forest regression model was chosen. "
+    "'''The data was divided into two parts (70/30). The 70% portion of the data was named ‘X_train’, and the 30% portion of the data was named ‘X_test’. The mean of ‘y_train’, which is part of the ‘X_train’ data, was found to be approximately 63.81. A dummy regressor was used to fit the ‘X_train’ and ‘y_train’ data. It returned a mean of approximately 63.81 as well. When using the average value as the prediction, R-squared was 0 on the training set and approximately -0.00312 on the test set. To determine how close the predictions are, the mean absolute error and mean squared error were calculated. The mean absolute error was approximately 19.14. This means that on average it is expected that actual ticket prices are approximately 19.14 dollars  away from the predicted ticket prices based on the average of known values. When using the mean to fill in missing values, the mean absolute error was approximately $9.00, which is significantly better than the MAE with guessing using average.  Another linear model was built by imputing values using the mean. The MAE did not appear to be much different from the model using the median. When using cross validation with CV=5 and default k, the mean r-squared was found to be ~0.633 +/- 0.095, which is consistent with the previous models. The cross validation model expects 95% of the r-squares are expected to be between 0.44 and 0.82. Further analysis showed that using a k=8 would ensure a higher r-squared with an even smaller error. A random forest regressor was tried. It found that the best parameter was median and that scaling the data did not help. Due to its lower cross validation mean and smaller variability and consistent test results with the cross validation data, the random forest regression model was chosen. '''"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {