assignment06/assignment07.qmd at main · BenjaminFReese/assignment06 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
title: "Assignment07: Machine Learning"
author: "Benjamin Reese"
format: html
self-contained: true
---

# Exercise 01

## Runing Code to Create Dataset

```{r datasetupchunk, warning=FALSE, message=FALSE}
## Loading Packages
library(tidyverse)
library(lubridate)
library(tidymodels)

# use this url to download the data directly into R
df <- read_csv("https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv")

# clean names with janitor
sampled_df <- df %>%
  janitor::clean_names()

# create an inspection year variable
sampled_df <- sampled_df %>%
  mutate(inspection_date = mdy(inspection_date)) %>%
  mutate(inspection_year = year(inspection_date))

# get most-recent inspection
sampled_df <- sampled_df %>%
  group_by(camis) %>%
  filter(inspection_date == max(inspection_date)) %>%
  ungroup()

# subset the data
sampled_df <- sampled_df %>%
  select(camis, boro, zipcode, cuisine_description, inspection_date,
         action, violation_code, violation_description, grade,
         inspection_type, latitude, longitude, council_district,
         census_tract, inspection_year, critical_flag) %>%
  filter(complete.cases(.)) %>%
  filter(inspection_year >= 2017) %>%
  filter(grade %in% c("A", "B", "C"))

# create the binary target variable
sampled_df <- sampled_df %>%
  mutate(grade = if_else(grade == "A", "A", "Not A")) %>%
  mutate(grade = as.factor(grade))

# create extra predictors
sampled_df <- sampled_df %>%
  group_by(boro, zipcode, cuisine_description, inspection_date,
           action, violation_code, violation_description, grade,
           inspection_type, latitude, longitude, council_district,
           census_tract, inspection_year)  %>%
  mutate(vermin = str_detect(violation_description, pattern = "mice|rats|vermin|roaches")) %>%
  summarize(violations = n(),
            vermin_types = sum(vermin),
            critical_flags = sum(critical_flag == "Y")) %>%
  ungroup()

# write the data
write_csv(sampled_df, "restaurant_grades.csv")

```


## 1.

In this exercise, I split `sampled_df`, the restaurant data, into training and testing datasets, create a recipe, and estimate a decision tree model.

### a. Setting the Seed and Splitting the Dataset

```{r seedsetting, warning=FALSE, message=FALSE}
## Setting Seed
set.seed(20201020)

## Splitting the Sample
split <- initial_split(sampled_df, prop = 0.8)

## Training and Testing
restaurant_train <- training(split)
restaurant_test <- testing(split)
```

### b. Creating the Recipe

```{r restaurantrecipe, warning=FALSE, message=FALSE}
## Creating the Recipe
restaurant_rec <-
  recipe(grade ~ ., data = restaurant_train) %>%
  themis::step_downsample(grade)
```

### c. Code for the Decision Tree

```{r decisiontree, warning=FALSE, message=FALSE}
## Creating the CART Model
cart_mod <-
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "classification")

## Workflow
cart_wf <- workflow() %>%
  add_recipe(restaurant_rec) %>%
  add_model(cart_mod)

## Estimating the Model
cart_fit <- cart_wf %>%
  fit(data = restaurant_train)
```

## 2.

The code below evaluates the model by generating a confusion matrix and calculates precision and sensitivity. A written description of the quality of the model is also included.

### Creating predictions tibble

```{r gradepredictions, warning=FALSE, message=FALSE}
## Selecting the Predictions
predictions <- bind_cols(
  restaurant_test,
  predict(object = cart_fit, new_data = restaurant_test),
  predict(object = cart_fit, new_data = restaurant_test, type = "prob")
) %>%
  mutate(color=as.factor(grade))

## The Predictions
select(predictions, grade, starts_with(".pred"))
```

### a. Confusion Matrix

```{r confusion, warning=FALSE, message=FALSE}
## Confusion matrix
conf_mat(data = predictions, truth = grade, estimate = .pred_class)
```

### b. Calculating Precision & Recall

```{r precisionrecall, warning=FALSE, message=FALSE}
## Precision
precision(data = predictions, truth = grade, estimate = .pred_class)

## Recall/Sensitivity
recall(data = predictions, truth = grade, estimate = .pred_class)

```

### c.

The model's precision, or how often the classifier is correct when it predicts an event, is very high, .998. This high level of precision means that the model is very good at predicting an event when there is an event. In this case, our model is highly precise when predicting a restaurant has an A rating when it actually has an A rating. The model's sensitivity, or how often the classifier is correct when there is an event is not as high as its precision. Overall, the quality of the model depends on its intended use and the relative costs associated with its use. If the model predicts an A rating, then we can be pretty sure the result is not a false positive, but we will see many false negatives.

## 3.

Using a *KNN*, or K-nearest neighbors, algorithm instead of the decision tree algorithm may improve the quality of the model. Instead of probabilistically dividing the data into two sets, based on predictors, as the decision tree model does, a *KNN* algorithm stores all available cases and classifies new cases by taking a majority vote of its *k* closest neighbors. This majority vote may prove more accurate than the binary dividing of the decision tree model.

## 4. Plot for Variable Importance

```{r gradeimportance, warning=FALSE, message=FALSE}
library(vip)

cart_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

```

Variable, or feature, importance in decision tree models is calculated by finding the decrease in the Gini impurity - the quality of the split - weighted by the probability of reaching a node. Node probability is the proportion of samples that reach a node out of the total samples. The most important variables in a decision tree model are the variables associated with the most Gini impurity being eliminated - highest quality split - at each branch of the decision tree. In other words, the variable with the highest value of the decrease in impurity, weighted by the probability of the sample making it to a node, is the most important feature.

In the model above, the most important feature is the inspection type. Location, defined by the Census tract the restaurant is located in is the second most important variable. These two are followed by a series of variables related to the types of violations associated with a restaurant, the type of cuisine, and the date of the inspection. The most important feature, inspection type, is a categorical variable largely divided between initial and re-inspections. It is unsurprising that the most important predictor of a rating is the type of inspection because the rating is a direct result of an inspection, and the exact circumstances of an inspection surely dictate the ensuing ratings. It is also possible that restaurants fix violations and receive A ratings on subsequent visits and re-inspections, meaning re-inspection will predict A ratings.

## 5.

Restaurants that receive low ratings will want to correct their violations and try to raise their grade, but this requires an initial inspection. If restaurants correct after a rating, then the focus should be on casting a large net and reaching as many restaurants as possible, even foregoing continued visits at established businesses, in order to encourage restaurants to improve their health safety conditions. Also, location is one of the most important predictors of grade as well, so spending time inspecting restaurants in locations that are associated with low health inspection ratings will be an optimal use of health department resources.

# Exercise 2

## Running Code From Assignment

```{r copyingcode, warning=FALSE, message=FALSE}
Chicago_modeling <- Chicago %>%
slice(1:5678)

Chicago_implementation <- Chicago %>%
slice(5679:5698) %>%
select(-ridership)
```

## 1. Converting Date into Useable Variable

```{r convertingdate, warning=FALSE, message=FALSE}
## Loading Lubridate
library(lubridate)

## Converting Dates
Chicago_modeling <- Chicago_modeling %>%
  mutate(weekday = wday(date, label = TRUE),
         month = month(date, label = TRUE),
         yearday = yday(date)
  ) %>%
  janitor::clean_names() ## cleaning names

```

## 2. Setting up the testing enviroment

### a. Setting the Seed and Splitting Modeling Data into Training & Testing

```{r settingseed2, warning=FALSE, message=FALSE}
## Setting seed
set.seed(20211101)

## Splitting
split <- initial_split(data = Chicago_modeling)

## Training and Testing
chicago_train <- training(x = split)
chicago_test <- testing(x = split)
```

### b. Exploratory Data Analysis

```{r eda, warning=FALSE, message=FALSE}
## Counting Daily Ridership
chicago_train %>%
  group_by(weekday) %>%
  summarise(avg_daily_ridership = mean(ridership))


## Daily Average Ridership Plot
chicago_train %>%
  group_by(weekday) %>%
  summarise(avg_daily_ridership = mean(ridership)) %>%
  ggplot(aes(x=weekday, y=avg_daily_ridership, fill=weekday)) +
  geom_col() +
  theme_minimal() +
  labs(title = "Daily Ridership for Chicago's Clark-Lake Station",
       x = "Weekday", y = "Average Daily Ridership", fill="Weekday",
       subtitle = "Ridership Peaks During Working Week, Drops off During Weekend",
       caption = "Data Source: Chicago Transity Authority")


## Ridership During Games Plot
chicago_train %>%
  mutate(game = case_when(
    blackhawks_home == 1 ~ "Blackhawks",
    bulls_home == 1 ~ "Bulls",
    bears_home == 1 ~ "Bears",
    white_sox_home == 1 ~ "White Sox",
    cubs_home == 1 ~ "Cubs",
    blackhawks_home == 0 ~ "Away",
    bulls_home == 0 ~ "Away",
    bears_home == 0 ~ "Away",
    white_sox_home == 0 ~ "Away",
    cubs_home == 0 ~ "Away",
  )) %>%
  group_by(game) %>%
  summarise(avg_gametime_ridership = mean(ridership)) %>%
  ggplot(aes(x=game, y=avg_gametime_ridership, fill=game)) +
  geom_col() +
  theme_minimal() +
  labs(title = "Gametime Ridership for Chicago's Clark-Lake Station",
       x = "Team", y = "Average Ridership", fill="Team",
       caption = "Data Source: Chicago Transity Authority")

## Ridership During Bears Games
chicago_train %>%
  mutate(game = case_when(
    blackhawks_home == 1 ~ "Blackhawks",
    bulls_home == 1 ~ "Bulls",
    bears_home == 1 ~ "Bears",
    white_sox_home == 1 ~ "White Sox",
    cubs_home == 1 ~ "Cubs",
    blackhawks_home == 0 ~ "Away",
    bulls_home == 0 ~ "Away",
    bears_home == 0 ~ "Away",
    white_sox_home == 0 ~ "Away",
    cubs_home == 0 ~ "Away",
  )) %>%
  filter(weekday=="Sun") %>%
  group_by(bears_home) %>%
  summarise(avg_gametime_ridership = mean(ridership))

## Ridership During Inclement Weather
chicago_train %>%
  mutate(weather = case_when(
    weather_rain > 0 ~ "Rain",
    weather_snow > 0 ~ "Snow",
    weather_cloud > 0 ~ "Cloud",
    weather_storm > 0 ~ "Storm",
    weather_rain == 0 ~ "Sunny",
    weather_snow == 0 ~ "Sunny",
    weather_cloud == 0 ~ "Sunny",
    weather_storm == 0 ~ "Sunny",
  )) %>%
  group_by(weather) %>%
  summarise(weather_ridership = mean(ridership))

chicago_train %>%
  mutate(weather = case_when(
    weather_rain > 0 ~ "Rain",
    weather_snow > 0 ~ "Snow",
    weather_cloud > 0 ~ "Cloud",
    weather_storm > 0 ~ "Storm",
    weather_rain == 0 ~ "Sunny",
    weather_snow == 0 ~ "Sunny",
    weather_cloud == 0 ~ "Sunny",
    weather_storm == 0 ~ "Sunny",
  )) %>%
  group_by(weather) %>%
  summarise(weather_ridership = mean(ridership)) %>%
  ggplot(aes(x=weather, y=weather_ridership, fill=weather_ridership)) +
  geom_col() +
  theme_minimal() +
  labs(title = "Inclement Weather Ridership for Chicago's Clark-Lake Station",
       x = "Weather", y = "Average Ridership",
       subtitle = "Weather is Not A Strong Predictor of Ridership",
       caption = "Data Source: Chicago Transity Authority") +
  theme(legend.position="none")

## Relationship Between Weather and Ridership
chicago_train %>%
  ggplot(aes(x=temp, y=ridership)) +
  geom_point(alpha=.3) +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Relationship Between Temperature and Ridership at Clark-Lake Station",
       x = "Temperature (F)", y = "Ridership", caption = "Data Source: Chicago Transity Authority")

```

My exploratory data analysis finds that the most important predictor of ridership is the day of the week with weekdays seeing a larger amount of ridership than weekends. Weather and temperature does not appear to have a strong effect, and neither does the occurrence of a professional sports game.

### c. Setting up v-fold cross validation

```{r folds, warning=FALSE, message=FALSE}
## Setting up folds
folds <- vfold_cv(data = chicago_train, v = 10, repeats = 1)
```

## 3. Testing Different Approaches

### 1. Creating Recipe

```{r recipebake, warning=FALSE, message=FALSE}
## Recipe
chicago_rec <-
  recipe(ridership ~ ., data = chicago_train) %>%
  step_dummy(all_nominal_predictors()) %>% ## Dummy encode categorical predictors
  step_normalize(all_numeric_predictors()) %>%   ## Center and scale all predictors
  step_nzv(all_predictors()) %>% ## Remove non-variance variables
  step_holiday(date) %>%
  step_rm(date)

## Baking Recipe
bake(prep(chicago_rec, training = chicago_train), new_data = chicago_train)

```


### 2. Defining Model Specifications

I specify a linear regression model, a KNN model with hyperparameter tuning, and a random forest model.

```{r models, warning=FALSE, message=FALSE}
## LM Model
lm_mod <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode(mode="regression")

## KNN Model
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_engine(engine = "kknn") %>%
  set_mode(mode = "regression")


## Random Forest Model
rf_mod <- rand_forest() %>%
  set_engine(engine = "randomForest") %>%
  set_mode(mode = "regression")
```

### 3. Workflows For Each Model

```{r workflows, warning=FALSE, message=FALSE}
## LM Workflow
lm_wf <- workflow() %>%
  add_recipe(chicago_rec) %>%
  add_model(lm_mod)

## KNN Tuning Grid
knn_grid <- grid_regular(neighbors(range = c(1, 15)), levels = 8)

## KNN Workflow
knn_wf <- workflow() %>%
  add_model(spec = knn_mod) %>%
  add_recipe(recipe = chicago_rec)

## Random Forest Workflow
rf_wf <- workflow() %>%
  add_model(spec = rf_mod) %>%
  add_recipe(recipe = chicago_rec)
```

### 4. Fitting Models with v-fold Cross Validation

```{r fittings, warning=FALSE, message=FALSE}
## Fitting LM Model
lm_fit <- lm_wf %>%
  fit_resamples(resamples = folds, metrics = metric_set(mae, rmse))

## Fitting KNN Model
knn_fit <- knn_wf %>%
  tune_grid(resamples = folds,
            grid = knn_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(rmse, mae))

## Fitting Random Forest Model
rf_fit <- rf_wf %>%
  fit_resamples(resamples = folds, metrics = metric_set(mae, rmse))
```

### 5. Calculating and Plotting RMSE and MAE for Each Model's Resamples

```{r plotting, warning=FALSE, message=FALSE}
# All LM Error Metrics
lm_fit %>%
  collect_metrics(summarize=FALSE)

## Averaged LM Error Metrics
lm_fit %>%
  collect_metrics(summarize=TRUE) %>%
  filter(.metric == "rmse")

# All KNN Error Metrics
knn_fit %>%
  collect_metrics(summarize = FALSE)

## Averaged KNN Error Metrics
knn_fit %>%
  collect_metrics(summarize=TRUE) %>%
  filter(.metric == "rmse") %>%
  summarize(avg_rmse = mean(mean))

## All Random Forest Error Metrics
rf_fit %>%
  collect_metrics(summarize=FALSE)

## Averaged Random Forest Error Metrics
rf_fit %>%
  collect_metrics(summarize=TRUE) %>%
  filter(.metric == "rmse")

## LM RMSE Plot
collect_metrics(lm_fit, summarize = FALSE) %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(id, .estimate, group = .estimator)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(limits=c(0,3)) +
  labs(title = "Calculated RMSE Across the 10 Folds for the Linear Regression Model",
       y = "RMSE_hat") +
  theme_minimal()

## KNN RMSE Plot
collect_metrics(knn_fit, summarize = FALSE) %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(id, .estimate, group = .estimator)) +
  geom_point() +
  geom_smooth(col="black", se=F) +
  scale_y_continuous(limits=c(0,3)) +
  labs(title = "Calculated RMSE Across the 10 Folds for the KNN Model",
       y = "RMSE_hat") +
  theme_minimal()

## Random Forest RMSE Plot
collect_metrics(rf_fit, summarize = FALSE) %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(id, .estimate, group = .estimator)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(limits=c(0,3)) +
  labs(title = "Calculated RMSE Across the 10 Folds for the Random Forest Model",
       y = "RMSE_hat") +
  theme_minimal()

```


## 4. Calculating Out-of-Sample Error

The random forest model has both the lowest averaged mean absolute error and the lowest average root mean square error.


```{r outofsampleerror, warning=FALSE, message=FALSE}
## Selecting Best
rf_best <- rf_fit %>%
  select_best("rmse")

## Finalizing Workflow for Random Forest Model
rf_final <- rf_wf %>%
  tune::finalize_workflow(rf_best) %>%
  parsnip::fit(data = chicago_train)

## Using the Estimated Model Actual Values in the Training Data
predictions <-
  bind_cols(
    chicago_test,
    predict(object = rf_final, new_data = chicago_test)
)

rmse(data = predictions, truth = ridership, estimate = .pred)


```

## 5. Implementing the Final Model

```{r implementingmodel, warning=FALSE, message=FALSE}
## Cleaning and Adding the weekday variables to the implementation data
Chicago_implementation <- Chicago_implementation %>%
  mutate(weekday = wday(date, label = TRUE),
         month = month(date, label = TRUE),
         yearday = yday(date)
  ) %>%
  janitor::clean_names()

## Predictions for Implementation Data
predict(object = rf_final, new_data = Chicago_implementation)

## Calculating RMSE for the Implementation Data
rmse(data = predictions, truth = ridership, estimate = .pred)
```

## 6.

The model overall is somewhat accurate, predicting higher ridership values when there is actually higher ridership and low ridership when there is lower ridership. The size of our error metric, root mean square error, however may be troublesome for Chicago city officials. As the ridership variable is measured in thousands of riders, the root mean square error is nearly 2 thousand riders. When some days such as Saturdays and Sundays only have 6 thousand riders, then a policy choice based on having $\frac{1}{3}$ fewer or greater riders may become an issue. It is better, however, then no model.

The model is both globally and locally interpretable, meaning we can gain both broad, or "global", insights about ridership, and we can also easily understand why specific points are predicted as they are. The variables included in the model are straightforward and the connection between outcome and predictors is easily made. For example, the day of the week is a strong predictor because we expect people to use the L train when going to work. Variables like the day of the week, especially Mondays, and the ridership levels at other stations are the most important predictors. This can be seen in the importance plot below.


```{r evaluatingmodel, warning=FALSE, message=FALSE}
## Saving Actual Values in Implementation Data
actual_values <- Chicago %>%
slice(5679:5698) %>%
select(ridership)

## Saving the Predicted Values
pred_values <- predict(object = rf_final, new_data = Chicago_implementation)

## Combining Columns
act_pred_values <- bind_cols(actual_values, pred_values)

## Visualizing Model Accuracy
act_pred_values %>%
  ggplot(aes(x=ridership, y=.pred)) +
  geom_point() +
  geom_smooth(method = "lm", se=F, col="black") +
  labs(title = "Actual Ridership vs Model Predicted Ridership",
       x = "Actual Ridership", y = "Random Forest Model Predicted Ridership",
       caption = "Data Source: Chicago Transit Authority")

## Showing the Important Variables
rf_final %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

```