uc-cfss.github.io/block010_vectors.Rmd at master · Alex-A14/uc-cfss.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
title: "Data storage types"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
```

# Objectives

* Review the major types of vectors
* Demonstrate how to subset vectors
* Demonstrate vector recycling
* Define lists

```{r packages, cache = FALSE}
library(tidyverse)
set.seed(1234)
```

# Vectors in R

So far the only type of data object in R you have encountered is a `data.frame` (or the `tidyverse` variant `tibble`). At its core though, the primary method of data storage in R is the **vector**. [We discussed them briefly a couple weeks ago](block005_parsing_vectors.html) merely to note that a data frame is composed from vectors and that there are a few different types of vectors: logical, numeric, and character. But now we want to understand more precisely how these data objects are structured and related to one another.

## Types of vectors

![Figure 20.1 from [*R for Data Science*](http://r4ds.had.co.nz/vectors.html)](http://r4ds.had.co.nz/diagrams/data-structures-overview.png)

There are two categories of vectors:

1. **Atomic vectors** - these are the types previously covered, including logical, integer, double, and character.
1. **Lists** - there are new and we will cover them later in this block. Lists are distinct from atomic vectors because lists can contain other lists.

Atomic vectors are **homogenous** - that is, all elements of the vector must be the same type. Lists can be **hetergenous** and contain multiple types of elements. `NULL` is the counterpart to `NA`. While `NA` represents the absence of a value, `NULL` represents the absence of a vector.

## Atomic vectors

### Logical vectors

**Logical vectors** take on one of three possible values:

* `TRUE`
* `FALSE`
* `NA` (missing value)

```{r}
parse_logical(c(TRUE, TRUE, FALSE, TRUE, NA))
```

> Whenever you filter a data frame, R is (in the background) creating a vector of `TRUE` and `FALSE` - whenever the condition is `TRUE`, keep the row, otherwise exclude it.

### Numeric vectors

**Numeric vectors** contain numbers (duh!). They can be stored as **integers** (whole numbers) or **doubles** (numbers with decimal points). In practice, you rarely need to concern yourself with this difference, but just know that they are different but related things.

```{r}
parse_integer(c(1, 5, 3, 4, 12423))
parse_double(c(4.2, 4, 6, 53.2))
```

> Doubles can store both whole numbers and numbers with decimal points.

### Character vectors

**Character vectors** contain **strings**, which are typically text but could also be dates or any other combination of characters.

```{r}
parse_character(c("Goodnight Moon", "Runaway Bunny", "Big Red Barn"))
```

## Using atomic vectors

Be sure to read ["Using atomic vectors"](http://r4ds.had.co.nz/vectors.html#using-atomic-vectors) for more detail on how to use and interact with atomic vectors. I have no desire to rehash everything Hadley already wrote, but here are a couple things about atomic vectors I want to reemphasize.

### Scalars

**Scalars** are a single number; **vectors** are a set of multiple values. In R, scalars are merely a vector of length 1. So when you try to perform arithmetic or other types of functions on a vector, it will **recycle** the scalar value.

```{r}
(x <- sample(10))
x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)
x + 100
```

This is why you don't need to write a `for` loop when performing these basic operations - R automatically converts it for you.

Sometimes this isn't so great, because R will also recycle vectors if the lengths are not equal:

```{r}
# create a sequence of numbers between 1 and 10
1:2
1:10

# add together two sequences of numbers
1:10 + 1:2
```

Did you really mean to recycle `1:2` five times, or was this actually an error? `tidyverse` functions will only allow you to implicitly recycle scalars, otherwise it will throw an error and you'll have to manually recycle shorter vectors.

### Subsetting

To filter a vector, we cannot use `filter()` because that only works for filtering rows in a `tibble`. `[` is the subsetting function for vectors. It is used like `x[a]`.

#### Subset with a numeric vector containing integers

```{r}
x <- c("one", "two", "three", "four", "five")
```

Subset with positive integers keeps the corresponding elements:

```{r}
x[c(3, 2, 5)]
```

Negative values drop the corresponding elements:

```{r}
x[c(-1, -3, -5)]
```

You cannot mix positive and negative values:

```{r, error = TRUE}
x[c(-1, 1)]
```

#### Subset with a logical vector

Subsetting with a logical vector keeps all values corresponding to a `TRUE` value.

```{r}
x <- c(10, 3, NA, 5, 8, 1, NA)

# All non-missing values of x
x[!is.na(x)]

# All even (or missing!) values of x
x[x %% 2 == 0]
```

### Exercise: subset the vector

```{r}
(x <- 1:10)
```

Create the sequence above in your R session. Write commands to subset the vector in the following ways:

1. Keep the first through fourth elements, plus the seventh element.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    x[c(1, 2, 3, 4, 7)]

    # use a sequence shortcut
    x[c(1:4, 7)]
    ```

      </p>
    </details>

1. Keep the first through eighth elements, plus the tenth element.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    # long way
    x[c(1, 2, 3, 4, 5, 6, 7, 8, 10)]

    # sequence shortcut
    x[c(1:8, 10)]

    # negative indexing
    x[c(-9)]
    ```

      </p>
    </details>

1. Keep all elements with values greater than five.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    x[x > 5]
    ```

      </p>
    </details>

1. Keep all elements evenly divisible by three.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    x[x %% 3 == 0]
    ```

      </p>
    </details>

# Lists

**Lists** are an entirely different type of vector.

```{r}
x <- list(1, 2, 3)
x
```

Use `str()` to view the **structure** of the list.

```{r}
str(x)

x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
```

Unlike the other atomic vectors, lists are recursive. This means they can:

1. Store a mix of objects.
    ```{r}
    y <- list("a", 1L, 1.5, TRUE)
    str(y)
    ```

1. Contain other lists.
    ```{r}
    z <- list(list(1, 2), list(3, 4))
    str(z)
    ```

    It isn't immediately apparent why you would want to do this, but in later units we will discover the value of lists as many packages for R store non-tidy data as lists.

You've already worked with lists without even knowing it. **Data frames and `tibble`s are a type of a list.** Notice that you can store a data frame with a mix of column types.

```{r}
str(diamonds)
```

## How to subset lists

Sometimes lists (especially deeply-nested lists) can be confusing to view and manipulate. Take the example from [R for Data Science](http://r4ds.had.co.nz/vectors.html#subsetting-1):

```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a)
```

* `[` extracts a sub-list. The result will always be a list.

    ```{r}
    str(a[1:2])
    str(a[4])
    ```
* `[[`` extracts a single component from a list and removes a level of hierarchy.

    ```{r}
    str(a[[1]])
    str(a[[4]])
    ```

* `$` can be used to extract named elements of a list.

    ```{r}
    a$a
    a[["a"]]
    ```

![Figure 20.2 from [R for Data Science](http://r4ds.had.co.nz/vectors.html#fig:lists-subsetting)](http://r4ds.had.co.nz/diagrams/lists-subsetting.png)

> Still confused about list subsetting? [Review the pepper shaker.](http://r4ds.had.co.nz/vectors.html#lists-of-condiments)

## Exercise: subset a list

```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a)
```

Create the list above in your R session. Write commands to subset the list in the following ways:

1. Subset `a`. The result should be an atomic vector.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    # use the index value
    a[[1]]

    # use the element name
    a$a
    a[["a"]]
    ```

      </p>
    </details>

1. Subset `pi`. The results should still be a new list.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    # correct method
    str(a["c"])

    # incorrect method to produce another list
    # the result is a scalar
    str(a$c)
    ```

      </p>
    </details>

1. Subset the first and third elements from `a`.

    <details>
      <summary>Click for the solution</summary>
      <p>

    ```{r}
    a[c(1, 3)]
    a[c("a", "c")]
    ```

      </p>
    </details>

# Session Info {.toc-ignore}

```{r child='_sessioninfo.Rmd'}
```