uc-cfss.github.io/block006_tidy_data.Rmd at master · Alex-A14/uc-cfss.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: "Tidy data"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
```

# Objectives

* Define tidy data and its characteristics
* Practice tidying data

```{r packages, cache = FALSE}
library(tidyverse)
```

# Tidy data

**Tidy data** is a specific way of organizing data into a consistent format which plugs into the `tidyverse` set of packages for R. It is not the only way to store data and there are reasons why you might not store data in this format, but eventually you will probably need to convert your data to a tidy format in order to efficiently analyze it.

There are three rules which make a dataset tidy:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

![Figure 12.1 from [*R for Data Science*](r4ds.had.co.nz)](http://r4ds.had.co.nz/images/tidy-1.png)

# Most common tasks for tidying data

Let's review the different tasks for tidying data using the R for Data Science `gapminder` subset. This is the data in a tidy format:

```{r}
table1
```

Note that in this data frame, each variable is in its own column (`country`, `year`, `cases`, and `population`), each observation is in its own row (i.e. each row is a different country-year pairing), and each value has its own cell.

## Gathering

**Gathering** entails bringing a variable spread across multiple columns into a single column. For example, this version of `table1` is not tidy because the `year` variable is spread across multiple columns:

```{r}
table4a
```

We can use the `gather()` function from the `tidyr` package to reshape the data frame and make this tidy. To do this we need three pieces of information:

1. The names of the columns that represent the values, not variables. Here, those are `1999` and `2000`.
1. The `key`, or the name of the variable whose values form the column names. Here that is `year`.
1. The `value`, or the name of the variable whose values are spread over the cells. Here that is `cases`.

> Notice that we create the names for `key` and `value` - they do not already exist in the data frame.

We implement this using the `gather()` function:

```{r}
table4a %>%
  gather(`1999`, `2000`, key = "year", value = "cases")
```

> In Stata and other statistics software, this operation would be called reshaping data wide to long.

## Spreading

**Spreading** brings an observation spread across multiple rows into a single row. It is the reverse of gathering. For instance, take `table2`:

```{r}
table2
```

It violates the tidy data principle because each observation (unit of analysis is a country-year pairing) is split across multiple rows. To tidy the data frame, we need to know:

1. The `key` column, or the column that contains variable names. Here, it is `type`.
1. The `value` column, or the column that contains values for multiple variables. Here it is `count`.

> Notice that unlike for gathering, when spreading the `key` and `value` columns are already defined in the data frame. We do not create the names ourselves, only identify them in the existing data frame.

```{r}
table2 %>%
  spread(key = type, value = count)
```

> In Stata and other statistics software, this operation would be called reshaping data long to wide.

## Separating

**Separating** splits multiple variables stored in a single column into multiple columns. For example in `table3`, the `rate` column contains both `cases` and `population`:

```{r}
table3
```

**This is a no-no**. Tidy data principles require each column to contain a single variable. We can use the `separate()` function to split the column into two new columns:

```{r}
table3 %>%
  separate(rate, into = c("cases", "population"))
```

## Uniting

**Uniting** is the inverse of separating - when a variable is stored in multiple columns, uniting brings the variable back into a single column. `table5` splits the year variable into two columns:

```{r}
table5
```

To bring them back together, use the `unite()` function:

```{r}
table5 %>%
  unite(new, century, year)

# remove underscore
table5 %>%
  unite(new, century, year, sep = "")
```

# Practice tidying data

> Before you begin the exercises, update your `rcfss` library by running `devtools::install_github("uc-cfss/rcfss")` in the console.

For each exercise, tidy the data frame. Before you write any code examine the structure of the data frame and mentally (or with pen-and-paper) sketch out what you think the tidy data structure should be.

## Race data

```{r}
library(rcfss)
race
```

Important info:

* `Name` - pretty obvious
* `50`:`350` - column names define different lengths of time
* Cell values are scores associated with each name and length of time

<details>
  <summary>Click for a hint</summary>
  <p>

**Tidy data structure**

```{r, echo = FALSE}
race %>%
  gather(Time, Score, -Name, convert = TRUE) %>%
  arrange(Name, Time)
```

  </p>
</details>

<details>
  <summary>Click for the solution</summary>
  <p>

This is essentially a gathering operation. Except for the `Name` column, the remaining columns are actually one variable spread across multiple columns. The column names are a distinct variable, and the columns' values are another variable. We want to gather these columns. The `key` will tell us the original column name, and the `value` will give us the values in the cells. Because the column names are actually numeric values, we set `convert = TRUE` to coerce the new `Time` column into a numeric column (or vector). (The last line isn't necessary, but sorts the data frame in a consistent manner.)

```{r}
race %>%
  gather(Time, Score, -Name, convert = TRUE) %>%
  arrange(Name, Time)
```

  </p>
</details>

## Clinical trials

```{r}
results
```

<details>
  <summary>Click for a hint</summary>
  <p>

**Tidy data structure**

```{r, echo = FALSE}
results %>%
  spread(Treatment, value)
```

  </p>
</details>

<details>
  <summary>Click for the solution</summary>
  <p>

This dataset is not tidy because observations are spread across multiple rows. There only needs to be one row for each individual. Then `Treat` and `Cont` can be stored in separate columns.

```{r}
results %>%
  spread(Treatment, value)
```

  </p>
</details>

## Grades

```{r}
grades
```

This one is a bit tougher. Important info:

* **The unit of analysis is ID-Year-Quarter.** That is, in the tidy formulation each observation represents one individual during one quarter in a given year.
* **Each test is unique.** As in they should be treated as two separate variables.

<details>
  <summary>Click for a hint</summary>
  <p>

**Tidy data structure**

```{r, echo = FALSE}
grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Quarter)
```

  </p>
</details>

<details>
  <summary>Click for the solution</summary>
  <p>

In this example, the basic unit of observation is the test. Each individual takes two separate tests (labeled `1` and `2`) at multiple points in time: during each quarter (`Fall`, `Winter`, `Spring`) as well as in multiple years (`2008` and `2009`). So our final data frame should contain five columns: `ID` (identifying the student), `Year` (year the test was taken), `Quarter` (quarter in which the test was taken), `Test1` (score on the first test), and `Test2` (score on the second test).

Let's start with the gathering operation: we want to gather `Fall`, `Winter`, and `Spring` into a single column (we can use the inclusive select function `:` to gather these three columns):

```{r}
grades %>%
  gather(key = Quarter, value = Score, Fall:Winter)
```

Good, but now we spread observations across multiple rows. Remember that we want each test to be a separate variable. To do that, we can spread those values across two columns.

```{r}
grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  spread(Test, Score)
```

This is an acceptable solution, but look what happened to the column names for each test: they're labeled as numbers (`1` and `2`). We'd rather give columns text labels, since those are easier to reference in our code. We can use a function called `str_c()` from the `stringr` library to add text to the `Test` column before we spread it. `str_c()` combines two or more values together to create a single value:[^unite]

```{r}
library(stringr)

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = str_c("Test", Test))

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = str_c("Test", Test)) %>%
  spread(Test, Score)
```

If we're cleaning up the data frame, let's also arrange it in a logical order:

```{r}
grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = str_c("Test", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Quarter)
```

[^unite]: You might think to use the `unite()` function here, but it won't work unless you first create a column which only contains the value `Test`, like so:

    ```{r}
    grades %>%
      gather(Quarter, Score, Fall:Winter) %>%
      mutate(test_label = "Test") %>%
      unite(col = Test, test_label, Test)
    ```
    This works, but requires two lines of code instead of one.

  </p>
</details>

# Session Info {.toc-ignore}

```{r child='_sessioninfo.Rmd'}
```