Math130book/factors.qmd at main · csucdsi/Math130book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# Working with Factors {#sec-factors}

In this lesson we will discuss ways to organize and deal with categorical data, also known as factor data types.

:::{.callout-note title = "🎓 Learning Objectives" icon = false}
:::: {.columns}

::: {.column width="70%"}
After completing this lesson students will be able to

-   Convert a numeric variable to a factor variable.
-   Apply and change labels to factor
-   Understand and control the ordering of the factor.
-   Combine multiple levels of a factor variable into one level
-   Learn how to use the `forcats` package

:::

::: {.column width="30%"}
![](img/forcats.png)
:::

::::


:::


:::{.callout-tip title = "👉 Prepare" icon=false}

1.  Open your Math 130 R Project.
2.  Right click and "save as" this lessons [[Quarto notes file]](notes/factors_notes.qmd) and save into your `Math130/notes` folder.
3.  In the *Files* pane, open this Quarto file and Render this file.

:::

[The `email` data set contains information on emails received by one of the OpenIntro authors for the first three months in 2012. See `?email` for more details.]{.aside}
[The `fastfood` data set from the `openintro` package describes nutrition amounts in 515 fast food items. See `?fastfood` for more details.]{.aside}
[The `mtcars` data set contains data from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. See `?mtcars` for more details.]{.aside}

```{r}
library(forcats)
email <- openintro::email
ff    <- openintro::fastfood
nc    <- openintro::ncbirths # note the name change
mtcars <- mtcars # comes with R
```

The goal of the `forcats` package is to provide a suite of useful tools that solve common problems with factors. Often in R there are multiple ways to accomplish the same task. Some examples in this lesson will show how to perform a certain task using base R functions, as well as functions from the `forcats` package.

## What is a factor?

The term factor refers to a data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable corresponds to a limited number of categories, while a continuous variable can correspond to an infinite number of values.

An example of a categorical variable is the `number` variable in the `email` data set. This variable contains data on whether there was no number, a small number (under 1 million), or a big number in the content of the email.

## Confirming the factor data type

First we should confirm that R sees `number` as a factor. We can use the `class` function we saw earlier, or `str`.

```{r}
class(email$number)
```

We can use the `levels()` function to get to know factor variables.

```{r}
levels(email$number)
```

There are three levels: `none`, `small`, and `big`.

:::{.callout-tip title = "Character data types"}
Let's look at the variable `restaurant` from the fast food (`ff`) data set.

```{r}
levels(ff$restaurant)
```
:::

Wait - NULL? But this is a categorical variable.

```{r}
class(ff$restaurant)
```

There is a subtle difference between `factor` and `character` data types in R. Both are categorical measures (not numbers), but factor variables have an assigned order and character variables do not. If we want to specifically control the ordering of the levels, the data type must be `factor`.

## Convert a character variable to factor

We can use the `as.factor()` function.
```{r}
ff$restaurant <- as.factor(ff$restaurant)
levels(ff$restaurant)
```

The `forcats` package has a similar function `as_factor()` that could be used here also.

## Convert a number variable to factor

Sometimes data are entered into the computer using numeric codes such as 0 and 1. These codes stand for categories, such as "no" and "yes". Sometimes we want to analyze these binary variables in two ways:

-   For statistical analyses, the data must be numeric 0/1.
-   For many graphics, the data must be a factor, "no/yes".

:::{.callout-tip title = "Example: What type of transmission does that car have?" icon=false}
The `am` variable from the `mtcars` data set records whether or not the car has an automatic transmission, or a manual one. However, the values were recorded as 0 for automatic and 1 for manual.
:::

```{r}
table(mtcars$am) # view which values are present
class(mtcars$am) # confirm what R thinks the data type is
```

R thinks this variable is numeric, but we know that's not the case. We can use the function `factor()` to convert the numeric variable `am` to a factor, applying `labels` to convert 0 to "automatic" and 1 to "manual".

```{r}
mtcars$transmission_type <- factor(mtcars$am,
                                   labels=c("automatic", "manual"))
```

The ordering of the `labels` argument *must* be in the same order (left to right) as the factor levels themselves. Look back at the order of columns in the `table` - it goes 0 then 1. Thus our labels need to go "automatic" then "manual".

:::{.callout-important title = "Trust but verify"}
We can confirm that the new variable was created correctly by creating a two-way contingency table by calling the `table(old variable, new variable)` function on both the old and new variables.
:::

```{r}
table(mtcars$am, mtcars$transmission_type, useNA="always")
```

Here we see that all the 0's were recoded to automatic, and all the 1's recoded to manual, and there are no new missing values. Success!

## Factor (re)naming

What if the variable is already a factor, but has names we don't prefer. We want them to say something else. We can accomplish this in both base R and using `forcats` package.

::: {.panel-tabset}
## Base R

Re-factor the variable and apply new `labels`.

```{r}
email$my_new_number <- factor(email$number,
                              labels=c( "None", "<1M","1M+"))
table(email$number, email$my_new_number, useNA="always")
```

## `forcats`

Use the `fct_recode("NEW" = "old")` function here.

```{r}
email$my_forcats_number <- fct_recode(email$number,
                                      "1M+" = "big",
                                      "None" = "none",
                                      "<1M" = "small")

table(email$number, email$my_forcats_number, useNA="always")
```
:::


The `big` factor is now labeled `1M+`, `none` is named `None`, and `small` is `<1M`.

## Factor ordering

Let's look back at the variable `restaurant` from the fast food (`ff`) data set.

```{r}
levels(ff$restaurant)
```

R defaults to alphabetical order in other cases, so beware! You may need to correct the ordering for other data sets.

We need to take control of these factors! We can do that by re-factoring the existing factor variable, but this time specifying the `levels` of the factor (since it already has labels). Say we decide to order the restaurants by putting all the places that sell burgers together.

> Original Order: Arbys, Burger King, Chick Fil-A, Dairy Queen, Mcdonalds, Sonic, Subway, Taco Bell
>
> Desired Order: Arbys, Burger King, Dairy Queen, Mcdonalds, Sonic, Chick Fil-A, Subway, Taco Bell

[Since I did not use the assignment operator (`<-`) here, these changes were not made to the variable in the `ff` data set. The examples below demonstrate making an adjustment to a factor variable and saving that adjustment as a new variable in the data set.]{.aside}

::: {.panel-tabset}
## Base R

Use the `factor` function again, and write out **each** factor level in the desired order. Make sure you are spelling each level correctly.

```{r}
# results not saved
factor(ff$restaurant,
       levels=c("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic",
                "Chick Fil-A", "Subway", "Taco Bell")) |>
  table()
```

## `forcats`
Using the `fct_relevel` function, you only need to specify the levels that you want to move.
```{r}
# results not saved
ff$restaurant |>
  fct_relevel("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic") |>
  fct_count() # new function - acts like table
```


The `fct_relevel` function has nice shortcuts as well for example to keep the rest of the ordering but only one or two items. See `?fct_relevel` for other options.

```{r}
ff$restaurant |>
  fct_relevel("Subway", after = Inf) |> # move to end
  fct_relevel("Taco Bell") |> # move to front
  levels()
```
:::


## Decreasing number of levels

For analysis purposes, sometimes you want to work with a smaller number of factor variables. Let's look at the restaurants that are included in the `fastfood` data set.

```{r}
table(ff$restaurant)
```

### Combining multiple categories into one

Let's combine all the sandwich, and burger joints together. I am going to save this new variable as `restaurant_new`.

The syntax for the `fct_collapse` function is `new level` = `"old level"`, where the "old level" is in quotes. As always, it is good practice to create a two way table to make sure the code typed does what we expected it to do.

```{r}
ff$restaurant_new <- fct_collapse(ff$restaurant,
                                    BurgerJoint = c("Burger King", "Mcdonalds", "Sonic"),
                                    Sandwich = c("Arbys", "Subway"))

table(ff$restaurant, ff$restaurant_new, useNA="always")
```

### Keeping the most frequent categories

Sometimes we only want to keep the most frequent categories and then lump uncommon factor together levels into "other". The `fct_lump_n` function

```{r}
fct_lump_n(ff$restaurant, n=5) |>
  fct_count()
```


### Removing categories entirely

Sometimes, you don't even want to consider certain levels. This often occurs in survey data where the respondent provides an answer of "Refuse to answer" or the data is coded as the word "missing". The word "missing' is fundamentally different than the `NA` code for a missing value.

For demonstration purposes, let's get rid of the data from DQ. Who eats something other than ice cream at that place anyhow?

```{r}
ff$restaurant[ff$restaurant == "Dairy Queen"] <- NA
table(ff$restaurant)
```

Even though there are no records with the level `Dairy Queen`, the level itself still is there. R does not assume just because there are no records with that level, that the named level itself should be removed. We use the function `fct_drop` to drop the levels with no records.

```{r}
ff$restaurant <- fct_drop(ff$restaurant)
table(ff$restaurant)
```