Math130book/import.qmd at main · csucdsi/Math130book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Getting data into R {#sec-import}

In this lesson we will introduce methods to get data from external files into R.


:::{.callout-note title = "🎓 Learning Objectives" icon=false}
After completing this lesson learners will be able to

* Distinguish between an `absolute` and `relative` path, and identify which one is used from within an R Project.
* Import data that is stored in an external Excel, CSV or tab-delimited text file into R.
* Compare and contrast the three file types discussed in this lesson.
* Use functions to examine data objects to ensure data was imported correctly.

:::

:::{.callout-tip title = "👉 Prepare" icon=false}

1.  Open your Math 130 R Project.
2.  Right click and "save as" this lessons [[Quarto notes file]](notes/import_notes.qmd) and save into your `Math130/notes` folder.
3. Right click each link and save the following data sets into your `Math130/data` folder.
4.  In the *Files* pane, open this Quarto file and Render this file.

* [County Complete](https://www.norcalbiostat.com/data/countyComplete.csv)
* [email](https://www.norcalbiostat.com/data/email.txt)
* [Police Shootings](https://www.norcalbiostat.com/data/fatal-police-shootings-data.xlsx)

:::

:::{.callout-important title = "DO NOT RENAME ANYTHING"}
If you download multiple copies of the data, do not keep the versions that have a (1) or some number in the name. Do not leave these files in your downloads folder.
:::

:::{.callout-important title = "CAUTION FOR MAC USERS"}
Be sure you download the CSV file as .csv and not as a .numbers file. If you create a file in Numbers, you can choose File > Export To > CSV to convert it.
:::


## File types

In this lesson we are only going to explore reading files that exist on your computer into R from three most commonly used data sources: A tab-delimited text file, A CSV file and an Excel file.

The three different files we will be using have different file types, or extensions. [.doc  (Microsoft word), .pdf (PDF) and .png (images) are examples of file extensions for other types of files]{.aside}

* `email.txt` is a `.txt` or "text" file.
* `NCBirths.csv` is a `.csv` or "comma separated values" file.
* `fatal-police-shootings-data.xlsx` is a `.xlsx` or Excel file.

Each of these file types differ in the type of _delimiter_ used. The _delimiter_ is a character or symbol that separates columns of data from each other when stored in an external file.  Recall back to the earlier lesson on data frames and matrices. Each column in the matrix represented data on a specific variable. Something had to tell R how to distinguish which values went with which variable.

There are two main types of delimiters we will consider in this class; comma and tab. That does not mean that data can't be stored in other ways, these are just the two most common.

::: {#fig-delim layout-ncol=2}

![Tab separated data. Variable names are `id`, `gender`, `race`, and `ses`.](img/tab_sep.png){#fig-tab_sep}

![Comma separated data. Variable names are `Month`, `RsnStop`, `OffRace`, etc ](img/comma_sep.png){#fig-csv}

Example of two common types of delimiters.

:::

Each delimiter type requires a different function or mechanism to import the data into R. If you use the wrong mechanism, the data may not be read in correctly if at all.


## Paths (where the data lives)

In addition to using the correct function for the delimiter type, we have to then tell the program explicitly where to find the files you want to access. To do that we need to find the file's _path_. The _path_ is a programmatic way to direct your computer to the location of a file.

You can think of it as a file's address, or where it lives on your computer. But to get to that address depends on where you start your journey. For example to get to your instructors office in Holt Hall, it will be different depending on if you start from your house, vs start from Meriam Library. If someone else starts from their house, their path would look different than if you both started at the Library.

* **Absolute paths** point to where a file is on a specific computer.
    * Example: `/Users/rdonatello/Library/CloudStorage/Box-Box/Teaching/MATH 130/data` This path to the `data` folder only exists on Dr. Donatello's macbook.
* **Relative paths** indicate where inside the project folder a file is located.
    * Example: `MATH 130/data`. This path to the `data` folder once you're already in the `MATH 130` folder. Anyone that has a `MATH 130` folder with a `data` subfolder has the same path.

When we use R Projects, we are setting the starting point (called the _working directory_) to your Math 130 folder. We will also reinforce that by using the `here` package. We want to specify that all our paths start `"here"`, at the top of our project working directory.
[learn more about the [here](https://here.r-lib.org/articles/here.html) package]{.aside}

![Cartoon showing two paths side-by-side. On the left is a scary spooky forest, with spiderwebs and gnarled trees, with file paths written on the branches like “~/mmm/nope.csv” and “setwd(“/haha/good/luck/”), with a scared looking cute fuzzy monster running out of it. On the right is a bright, colorful path with flowers, rainbow and sunshine, with signs saying “here!” and “it’s all right here!” A monster facing away from us in a backpack and walking stick is looking toward the right path. Stylized text reads “here: find your path.”](img/here.png){width=80%}

[Learn more about [here](https://here.r-lib.org/).]{.aside}

:::{.callout-tip title = "👉 Whats your path?" icon=false}

In the console, type `here::here()` to have R show you what your current working directory is at. This is the *absolute path* to your Math 130 folder on your computer.

```{r}
# Dr. D's path to the code to build this course website
here::here()
```
:::


Working with a combination of R projects and the `here` package this way makes it a lot easier to move your project around on your computer and share it with others without having to directly modify file paths in the individual scripts.

:::{.callout-important title = "Cannot open the Connection"}
The import code in this lesson assumes that you are working in your R project, and that you have your data downloaded and in the `data` folder. If you get an error message saying `Cannot open the connection` - double check the following common places for mistakes:

1. is the data file in the right place?
2. did you spell the name of the data set correctly?
3. are you in your R project and using the `here::here` function correctly?
:::

## Checking the import was successful

The first thing you should always do after importing a data file is look at the raw data and ask yourself the following things:

1. Were the variable names read in correctly?
2. Is there the expected number of rows and columns?
3. Are the data types for each variable as expected?

Next we will read in three different data sets, each with unique file types, and look at each one to make sure it was read in correctly.

## Text files

Text files are very simple files that have a `.txt` file extension. Common delimiters include a space, a comma (,) or a tab. Uncommon delimiters could include a `%` or even a semi-colon (`;`).

:::{.callout-tip title = "👉 Open the file directly" icon=false}
Navigate to your class folder, and open the `emamil.txt` file that should be in your `data` folder by double clicking. If your computer asks, you can open it with the `notepad` program. Identify the following:

1. what's in the first row - data or variable names?
2. what is the delimiter? - in this case it is a tab.
:::

We will use the `read.table()` function that is in base R to read in any type of delimited file.
```{r}
#| code-annotations: below
email <- read.table( # <1>
          here::here("data/email.txt"), # <2>
          header=TRUE, # <3>
          sep="\t" # <4>
          )
```

1. Use the `read.table()` function because it is a text file
2. Use `here::here` to ensure you start at the project directory, then go into the `data` folder, and find the `email.txt` data set.
3. Set `header=TRUE` to signify that the data in the first row contains our column names
4. Specify that the delimiter is a tab (`"\t"`).

:::{.callout-important title = "Trust but verify"}
Let's use the `head` function to look at the imported data, ensuring the variable names are intended, and that each column has data in it.
:::
```{r}
head(email)
```

:::{.callout-important title = "Errors!"}
What happens if you forget to include the arguments for `header` or `sep`? Try that now and discuss what happened.

<details>
  <summary> Solution - Forgetting both arguments </summary>
```{r, error=TRUE}
email.noboth <- read.table(here::here("data/email.txt"))
head(email.noboth)
```
The data set won't successfully be read in, and won't show up in your environment.
</details>

<details>
  <summary> Solution - Forgetting the header </summary>
```{r, error=TRUE}
email.nohead <- read.table(here::here("data/email.txt"), sep="\t")
head(email.nohead)
```
The data is read into the object `email.nohead`, but the variable names were read in as data in the first row. R assigned generic `V1`, and `V2` data names as your clue.
</details>

<details>
  <summary> Solution - Forgetting the separator </summary>
```{r, error=TRUE}
email.nosep <- read.table(here::here("data/email.txt"), header=TRUE)
```
We get the same error message and it will not read the data into your environment.
</details>

:::


## CSV: Comma Separated Values

CSV stands for "comma-separated values". This file type can also open in spreadsheet programs like MS Excel, Numbers or google sheets. The icons even look similar. Here we use the `read.csv()` function because it is optimized to read in this file type. [Not CVS the pharmacy]{.aside}

```{r}
cc <- read.csv(here::here("data/countyComplete.csv"), header=TRUE)
cc[1:5, 1:6] # peek at the first 5 rows and 6 columns of the data
```

The variable names are present, the data are all in columns with reasonable data types.

## Excel files

Files with the `.xlsx` or `.xls` extensions are Microsoft Excel files. These are also comma separated, but have more features than a `.csv` file. We will use the `read_excel()` function from the [`readxl` package](http://readxl.tidyverse.org/), which is contained within the `tidyverse` package that we installed earlier.

```{r}
library(readxl)
police  <- read_excel(here::here("data/fatal-police-shootings-data.xlsx"),
                      sheet=1, col_names=TRUE)
police[1:6,2:6] # peek at the first 5 rows and 6 columns of the data
```

* Notice this was read in as a `tibble`, not a `data.frame`. Ref Section @sec-tibble to recall the differences.
* The variable `date` is a `dttm` or _date-time_ variable. This means `R` recognizes it directly as a date, not some string of numbers. This is good.
* Categorical variables such as `manner_of_death` and `city` are read in as `character` instead of factor. We may or may not want to change that later.
* Numeric variables such as `age` are of type `dbl` (double). This is similar to `integer` or `numeric`, so we are fine.