-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathassignment08.qmd
More file actions
413 lines (318 loc) · 11.9 KB
/
assignment08.qmd
File metadata and controls
413 lines (318 loc) · 11.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
---
title: "Assignment 08"
author: "Benjamin Reese"
format: html
self-contained: true
---
### Packages
```{r packages, warning=FALSE,message=FALSE}
## Loading in Packages
library(readr)
library(tidyverse)
library(tidymodels)
library(patchwork)
library(factoextra)
library(broom)
library(tidytext)
library(lubridate)
library(SnowballC)
library(igraph)
library(ggraph)
library(stringr)
library(stopwords)
library(textrecipes)
library(tidyclust)
```
## Exercise 01
### (a)
```{r loading and cleaning votes data, message=FALSE, warning=FALSE}
## Loading in votes data
votes_time_series <- read_csv("data/votes_time_series.csv")
## Loading in states region data
states_regions <- read_csv("data/states_regions.csv")
## Replacing NAs with 0s
votes_103 <- votes_time_series %>%
mutate_all(funs(ifelse(is.na(.), 0, .))) %>%
filter(session==103) %>%
left_join(states_regions, by = c("state" = "State Code")) ## Joining to region for later use
```
### (b)
```{r votes recipe, message=FALSE, warning=FALSE}
## Creating the Recipe for PCA
votes_pca_rec <-
recipe(~., data = votes_103) %>%
step_pca(starts_with("v")) %>%
prep()
```
### (c)
```{r votes variance, message=FALSE, warning=FALSE}
## Checking the percent variance for pc1
tidy(votes_pca_rec, number=1, type = "variance") %>%
filter(terms == "percent variance", component == 1)
## Checking the cumulative percent variance for first five components
tidy(votes_pca_rec, number=1, type = "variance") %>%
filter(terms == "cumulative percent variance", component <= 5)
```
The first principal component explains $\approx 38%$ of the variance. The first five principal components cumulatively explain $\approx 68%$ of the variance. Most of the variance is being explained in the first two principal components.
### (d)
```{r votes plots, message=FALSE, warning=FALSE}
## Party Plot
p1 <- votes_pca_rec %>%
bake(new_data = votes_103) %>%
ggplot(aes(x=PC1, y=PC2, color=party)) +
geom_point() +
scale_color_manual(values = c("D" = "blue", "R"= "red")) +
theme_minimal() +
labs(x="PC1", y= "PC2", title = "PC1 and PC2 of the 103rd Senate",
subtitle = "Color by Party", color="Party",
caption="Data Source: Brad Robinson")
## Region Plot
p2 <- votes_pca_rec %>%
bake(new_data = NULL) %>%
ggplot(aes(x=PC1, y=PC2, color=Region)) +
geom_point() +
theme_minimal() +
labs(x="PC1", y= "PC2", title = "PC1 and PC2 of the 103rd Senate",
subtitle = "Color by Region", color="US Region",
caption="Data Source: Brad Robinson")
## Displaying Plots
p1 + p2
```
## Exercise 02
### (a)
```{r creating votes numeric, message=FALSE, warning=FALSE}
## Filtering votes_103 to only include votes
votes_numeric <- votes_103 %>%
select(starts_with("v"))
```
### (b)
```{r exercise2b, warning=FALSE, message=FALSE}
## Setting Seed
set.seed(20220412)
## Creating recipe
kmeans_rec <- recipe(formula = ~ .,data = votes_numeric) %>%
step_select(starts_with("v")) %>%
prep()
votes_baked <- kmeans_rec %>%
bake(new_data = votes_numeric)
## Silhouette Calculation
fviz_nbclust(votes_baked, FUN = kmeans, method = "silhouette")
## Gap Statistic Calculation
fviz_nbclust(votes_baked, FUN = kmeans, method = "gap_stat")
## WSS Calculation
fviz_nbclust(votes_baked, FUN = kmeans, method = "wss")
```
The average silhouette width indicates an optimal number of clusters is 2. The gap statistic indicates that the optimal number of clusters is 4. The within sum squares seems to indicate that the optimal number is 4 clusters.
### (c)
```{r creating k means function, warning=FALSE,message=FALSE}
## Creating the Function
#' k_clust
#'
#' @param k a numeric variable indicating the number of clusters to be used for k-means clustering
#' @param dataset a data frame to be input into the k-means cluster analysis model
#'
#' @return a ggplot object that plots the first and second principal components with color coded clusters
#'
#'
#' @examples k_clust(k=2, data=votes_103)
k_clust <- function(k, dataset) {
## Creating recipe
kmeans_rec <- recipe(formula = ~ .,data = dataset) %>%
step_select(starts_with("v"))
rec_pca <- recipe(formula = ~ ., data = dataset) %>%
step_select(starts_with("v")) %>%
step_pca(all_numeric(), id = "pca")
## Creating model
k_means_spec <- k_means(num_clusters = k) %>%
set_engine("stats", nstart = 100)
## Creating workflow
k_means_wflow <- workflow(
preprocessor = rec_pca,
spec = k_means_spec)
## Fitting Model
votes_k_means_fit <- k_means_wflow %>%
fit(data = dataset)
## Creating Dataframe to visualize
votes_data <- bind_cols(
select(dataset), votes_k_means_fit %>%
extract_recipe() %>%
bake(dataset),
cluster = votes_k_means_fit %>%
extract_cluster_assignment() %>%
pull(.cluster)) %>%
mutate(cluster = str_replace(cluster, "Cluster_",""))
## Plot
ggplot() +
geom_point(
data = votes_data,
mapping = aes(PC1, PC2, color = factor(cluster)),
alpha = 0.5) +
labs(title = paste0("K-Means with K=", k, " and PCA"),
x = "PC1",
y = "PC2", color="Cluster",
caption="Data Source: Brad Robinson") +
theme_minimal() +
guides(text = NULL)
}
```
### (d)
```{r testing optimal clusters}
## Running functions
## Silhouette
k_clust(k=2, data=votes_103)
## WSS
k_clust(k=4, data=votes_103)
## Gap Statistic
k_clust(k=4, data=votes_103)
```
## Exercise 03
### (a)
```{r loading and cleaning exec orders, warning=FALSE,message=FALSE}
## Loading executive orders data
executive_orders <- read_csv("data/executive-orders.csv")
## Filtering out text NAs
exec_orders <- executive_orders %>%
filter(!is.na(text))
## Creating bigrams
tidy_exec <- exec_orders %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)%>%
filter(!is.na(bigram))
```
### (b)
```{r exercise 3b, warning=FALSE,message=FALSE}
## Separating, filtering, and counting bigrams
bigram_150 <- tidy_exec %>%
separate(col = "bigram", into = c("word1", "word2")) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE) %>%
filter(n>150)
# plot the bigrams that exist more than 150 times
bigram_graph <- bigram_150 %>%
graph_from_data_frame()
# plot the relationships (you may want to make the plot window bigger)
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
```
### (c)
```{r calculating tf_idf, warning=FALSE,message=FALSE}
## Calculating tf_idf for each bigram-president pair
tf_idf <- tidy_exec %>%
separate(col = "bigram", into = c("word1", "word2")) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(bigrams = paste(word1, word2)) %>%
count(president, bigrams, sort = TRUE) %>%
bind_tf_idf(term = bigrams, document = president, n = n)
## Plotting bigram-president pair
tf_idf %>%
group_by(president) %>%
top_n(15, tf_idf) %>%
mutate(bigrams = reorder(bigrams, tf_idf)) %>%
ggplot(aes(tf_idf, bigrams, fill = president)) +
geom_col() +
facet_wrap(~president, scales = "free") +
theme_minimal() +
guides(fill = "none")
```
## Exercise 04
### (a)
```{r reading and cleaning senate votes, warning=FALSE, message=FALSE}
## Importing Senate bills data
bills <- read_csv("data/senate_bills_114.csv") %>%
mutate(passed = factor(passed, labels = c("1", "0"), levels = c("1", "0")))
## Counting How Many Bills Passed
bills %>%
count(passed)
```
108 bills passed and 3440 bills failed.
### (b)
```{r splitting bills data, warning=FALSE, message=FALSE}
## Seeting the Seed
set.seed(20220414)
## Splitting into training and testing
bills <- bills %>%
select(-bill_number)
## Splitting the Sample
split <- initial_split(bills, prop = 0.75, strata = "passed")
## Training and Testing
bills_train <- training(split)
bills_test <- testing(split)
```
### (c)
```{r creating bills recipe, warning=FALSE, message=FALSE}
## Creating recipe
bills_rec <- recipe(passed~., data = bills_train) %>%
step_tokenize(description) %>%
step_stopwords(description) %>%
step_stem(description) %>%
step_tokenfilter(description) %>%
step_tfidf(description)
```
### (d)
```{r prepping and baking bills recipe, warning=FALSE, message=FALSE}
## Prepping and Baking Recipe
bake(prep(bills_rec, training = bills_train), new_data = bills_train)
custom_stop_words <- c("1", "2", "3", "2015", "2016", "a", "an", "and",
"account", "act", "code", "doc")
## Re-creating recipe
bills_rec2 <- recipe(passed~., data = bills_train) %>%
step_tokenize(description) %>%
step_stopwords(description, custom_stopword_source = custom_stop_words) %>%
step_stem(description) %>%
step_tokenfilter(description) %>%
step_tfidf(description)
## Prepping and Baking Recipe
bake(prep(bills_rec2, training = bills_train), new_data = bills_train)
```
### (e)
```{r logistic regression model and workflow, warning=FALSE, message=FALSE}
## Logistic Regression Model
lr_mod <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode(mode="classification")
## Logistic Regression Workflow
lr_wf <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(bills_rec2)
## Fitting the Logistic Regression Model
lr_fit <-
lr_wf%>%
fit(data = bills_train)
```
### (f)
```{r fitting model to testing data, warning=FALSE, message=FALSE}
## Selecting the Predictions
predictions <- bind_cols(
bills_test,
predict(object = lr_fit, new_data = bills_test),
predict(object = lr_fit, new_data = bills_test, type = "prob")
)
## The Predictions
select(predictions, passed, starts_with(".pred"))
## Confusion matrix
conf_mat(data = predictions, truth = passed, estimate = .pred_class)
## Accuracy
accuracy(data = predictions, truth = passed, estimate = .pred_class)
## Precision
precision(data = predictions, truth = passed, estimate = .pred_class)
## Recall/Sensitivity
recall(data = predictions, truth = passed, estimate = .pred_class)
## ROC Curve
roc_curve(data = predictions, truth = passed, estimate = .pred_1)
## ROC Curve Plot
predictions %>%
roc_curve(truth = passed, .pred_1) %>%
autoplot()
## AUC
roc_auc(data = predictions, truth = passed, estimate = .pred_1)
```
The accuracy, or how often the model is correct, is quite high, about .97. A high accuracy, all else equal, is a sign of a high quality model. The precision and recall are not high, though. Recall, or how often the model is correct when there is an event is only about .05 and precision, or how often the model is correct when it predicts events is about .11. Both of these values are quite low. The ROC and AUC, however, are relatively good. The ROC Curve, as shown in the plot above, is consistent with a good classifier. Further, the AUC is about .85, which is also good, though not excellent. Overall, the model has high accuracy, but its usefulness is limited by its low recall and precision.
The first improvement that I can suggest is to include the actual text of each bill beyond their short descriptions. Having more text and, more specifically, a model that predicts bill passage based on the language in the actual bill provisions should increase the quality of the model. In addition, having more data, even without bill text, would be helpful. Having many different years of bills may improve model fit, though with changing partisan dynamics, a longer time frame may not be helpful.
Second, it would be an improvement to include other non-text variables in the model such as the party of the sponsor, the party in control of the chamber, the monetary cost of the bill, and other factors that may influence the probability of a bill passing.
Third, utilizing v-fold cross-validation could produce a "better" model that could then be implemented on the testing data. Through the process of cross-validation, perhaps a better model could be found. Further, utilizing alternate classification algorithms like a CART model or a random forest model could result in a higher quality predictive model. This, combined with cross-validation, should be tried to find the best candidate model.