intro_stats/06-correlation-web.qmd at main · VectorPosse/intro_stats · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
#  Correlation

::: {.callout-note}

### Functions introduced in this chapter

`cor`, `fct_recode`

:::


## Introduction

In this chapter, we will learn about the concept of correlation, which is a way of measuring a linear relationship between two numerical variables.

### Install new packages

Type the following at the Console:

```
install.packages("faraway")
install.packages("openintro")
```

### Download the Quarto file

Look at either the top (Posit Cloud) or the upper right corner of the RStudio screen to make sure you are in your `intro_stats` project.

Then click on the following link to download this chapter as a Quarto file (`.qmd`).

<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/06-correlation.qmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/06-correlation.qmd</a>

Once the file is downloaded, move it to your project folder in RStudio and open it there.

### Restart R and run all chunks

In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.

### Load packages

We load the now-standard `tidyverse` package. We also include the `faraway` package to access data about Chicago in the 1970s, the `openintro` package to access the `bdims` data set on body dimensions, and the `palmerpenguins` package to access the familiar `penguins` data.

```{r}
library(tidyverse)
library(faraway)
library(openintro)
library(palmerpenguins)
```


## Redlining in Chicago

The data set we will use throughout this chapter is from Chicago in the 1970s studying the practice of "redlining".

##### Exercise 1 {-}

Do an internet search for "redlining".

Consult at least two or three sources. Then, in your own words (not copied and pasted from any of the websites you consulted), explain what "redlining" means.

::: {.answer}

Please write up your answer here.

:::

*****

The `chredlin` data set appears in the `faraway` package accompanying a book by Julian Faraway (*Practical Regression and Anova using R*, 2002.) Faraway explains:

> "In a study of insurance availability in Chicago, the U.S. Commission on Civil Rights attempted to examine charges by several community organizations that insurance companies were redlining their neighborhoods, i.e. canceling policies or refusing to insure or renew. First the Illinois Department of Insurance provided the number of cancellations, non-renewals, new policies, and renewals of homeowners and residential fire insurance policies by ZIP code for the months of December 1977 through February 1978. The companies that provided this information account for more than 70% of the homeowners insurance policies written in the City of Chicago. The department also supplied the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978. Since most FAIR plan policyholders secure such coverage only after they have been rejected by the voluntary market, rather than as a result of a preference for that type of insurance, the distribution of FAIR plan policies is another measure of insurance availability in the voluntary market."

In other words, the degree to which residents obtained FAIR policies can be seen as an indirect measure of redlining. This participation in an "involuntary" market is thought to be largely driven by rejection of coverage under more traditional insurance plans.

### Exploratory data analysis

Before we learn about correlation, let's get to know our data a little better.

Type `?chredlin` at the Console to read the help file. While it's not very informative about how the data was collected, it does have crucial information about the way the data is structured.

Here is the data set:

```{r}
chredlin
```

##### Exercise 2 {-}

What do each of the rows of this data set represent? You'll need to refer to the help file. (They are *not* individual people.)

::: {.answer}

Please write up your answer here.

:::

##### Exercise 3 {-}

The `race` variable is numeric. Why? What do these numbers represent? (Again, refer to the help file.)

::: {.answer}

Please write up your answer here.

:::

*****

The `glimpse` command gives a concise overview of all the variables present.

```{r}
glimpse(chredlin)
```

##### Exercise 4(a) {-}

Which variable listed above represents participation in the FAIR plan? How is it measured? (Again, refer to the help file.)

::: {.answer}

Please write up your answer here.

:::

##### Exercise 4(b) {-}

Why is it important to analyze the number of plans *per 100 housing units* as opposed to the total number of plans across each ZIP code? (Hint: what happens if some ZIP codes are larger than others?)

::: {.answer}

Please write up your answer here.

:::

*****

We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate of FAIR plan policies obtained in that ZIP code.

##### Exercise 5(a) {-}

Since `race` is a numerical variable, what type of graph or chart is appropriate for visualizing it? (You may need to refer back to the "Numerical data" chapter.)

::: {.answer}

Please write up your answer here.

:::

##### Exercise 5(b) {-}

Using `ggplot` code, create the type of graph you identified above. After creating the initial plot, be sure to go back and set the `binwidth` and `boundary` to sensible values.  (Refer back to the "Numerical data" chapter for sample code if you've forgotten how to make such a graph. If you were unsure about part (a), the instructions about `binwidth` and `boundary` should be a pretty big hint.)

::: {.answer}

```{r}
# Add code here to create a plot of race
```

:::

##### Exercise 5(c) {-}

Describe the shape of the `race` variable using the three key shape descriptors (modes, symmetry, and outliers).

::: {.answer}

Please write up your answer here.

:::

##### Exercise 5(d) {-}

Create the same kind of graph as above, but for `involact`. (Again, go back and set the `binwidth` and `boundary` to sensible values.)

::: {.answer}

```{r}
# Add code here to create a plot of involact
```

:::

##### Exercise 5(e) {-}

Describe the shape of the `involact` variable using the three key shape descriptors (modes, symmetry, and outliers).

::: {.answer}

Please write up your answer here.

:::

##### Exercise 5(f) {-}

Since both `race` and `involact` are numerical variables, what type of graph or chart is appropriate for visualizing the relationship between them?

::: {.answer}

Please write up your answer here.

:::

##### Exercise 5(g) {-}

For our research question, is `race` functioning as a predictor variable or as the response variable? What about `involact`? Why? Explain why it makes more sense to think of one of them as the predictor and the other as the response.

::: {.answer}

Please write up your answer here.

:::

##### Exercise 5(h) {-}

Using `ggplot` code, create the type of graph you identified above. Be sure to put `involact` on the y-axis and `race` on the x-axis. (Again, that's a hint in case you were confused in part (g).)

::: {.answer}

```{r}
# Add code here to create a plot of involact against race
```

:::


## Correlation

The word *correlation* describes a linear relationship between two numerical variables. As long as certain conditions are met, we can calculate a statistic called the *correlation coefficient*, often denoted with a lowercase r.

There are several different ways to compute a statistic that measures correlation. The most common way, and the way we will learn in this chapter, is often attributed to an English mathematician named Karl Pearson. According to his [Wikipedia page](https://en.wikipedia.org/wiki/Karl_Pearson),

> "Pearson was also a proponent of social Darwinism, eugenics and scientific racism."

##### Exercise 6 {-}

Do an internet search for each of the following terms:

- Social Darwinism
- Eugenics
- Scientific racism

Consult at least two or three sources for each term. Then, in your own words (not copied and pasted from any of the websites you consulted), explain each of these terms.

::: {.answer}

Please write up your answer here.

:::

*****

While Pearson is often credited with its discovery, the so-called "Pearson correlation coefficient" was first developed by a French scientist, Auguste Bravais. Due to the misattribution of discovery, along with the desire to disassociate the useful tool of correlation from its problematic applications to racism and eugenics, we will just refer to it as the *correlation coefficient* (without a name attached).

The correlation coefficient, r, has some important properties.

* The correlation coefficient is a number between -1 and 1.
* A value close to 0 indicates little or no correlation.
* A value close to 1 indicates strong positive correlation.
* A value close to -1 indicates strong negative correlation.

In between 0 and 1 (or -1), we often use words like weak, moderately weak, moderate, and moderately strong. There are no exact cutoffs for when such words apply. You must learn from experience how to judge scatterplots and r values to make such determinations.

A correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average.

A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average.


## Conditions for correlation

Two variables are considered "associated" any time there is any type of relationship between them (i.e., they are not independent). However, in statistics, we reserve the word "correlation" for situations meeting more stringent conditions:

1. The two variables must be numerical.^[There are other ways of measuring association for variables that are not numerical, but these aren't covered in this course.]
2. There is a somewhat linear relationship between the variables, as shown in a scatterplot.
3. There are no serious outliers.

For condition (2) above, keep in mind that real data in scatterplots very rarely lines up in a perfect straight line. Instead, you will see a "cloud" of dots. All we want to know is whether that cloud of dots mostly moves from one corner of the scatterplot to the other. Violations of this condition will usually be for one of two reasons:

* The dots are scattered completely randomly with no discernible pattern.
* The dots have a pattern or shape to them, but that shape is curved and not linear.

##### Exercise 7 {-}

Check the three conditions for the relationship between `involact` and `race`. For conditions (2) and (3), you'll need to check the scatterplot you created above. (You did create a scatterplot for one of the exercises above, right?)

::: {.answer}

Please write up your answer here.

1.
2.
3.

:::


## Calculating correlation

Since the conditions are met, We calculate the correlation coefficient using the `cor` command.

```{r}
cor(chredlin$race, chredlin$involact)
```

The order of the variables doesn't matter; correlation is symmetric, so the r value is the same independent of the choice of response and predictor variables.

Since the correlation between `involact` and `race` is a positive number and slightly closer to 1 than 0, we might call this a "moderate" positive correlation. You can tell from the scatterplot above that the relationship is not a strong relationship. The words you choose should match the graphs you create and the statistics you calculate.

##### Exercise 8(a) {-}

Create a scatterplot of `income` against `race`. (Put `income` on the y-axis and `race` on the x-axis.)

::: {.answer}

```{r}
# Add code here to create a scatterplot of income against race
```

:::

##### Exercise 8(b) {-}

Check the three conditions for the relationship between `income` and `race`. Which condition(s) are seriously violated here?

::: {.answer}

1.
2.
3.

:::

##### Exercise 9(a) {-}

Create a scatterplot of `theft` against `fire`. (Put `theft` on the y-axis and `fire` on the x-axis.)

::: {.answer}

```{r}
# Add code here to create a scatterplot of theft against fire
```

:::

##### Exercise 9(b) {-}

Check the three conditions for the relationship between `theft` and `fire`. Which condition(s) are seriously violated here?

::: {.answer}

1.
2.
3.

:::

##### Exercise 9(c) {-}

Even though the conditions are not met, what if you calculated the correlation coefficient anyway? Try it.

::: {.answer}

```{r}
# Add code here to calculate the correlation coefficient between theft and fire
```

:::

##### Exercise 9(d) {-}

Suppose you hadn't looked at the scatterplot and you only saw the correlation coefficient you calculated in the previous part. What would your conclusion be about the relationship between `theft` and `fire`. Why would that conclusion be misleading?

::: {.answer}

Please write up your answer here.

:::

The lesson learned here is that you should never try to interpret a correlation coefficient without looking at a plot of the data to assure that the conditions are met and that the result is a sensible thing to interpret.


## Correlation is not causation

When two variables are correlated---indeed, associated in any way, not just in a linear relationship---that means that there is a relationship between them. However, that does not mean that one variable *causes* the other variable.

For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance.

In the Chicago example, there is still likely a causal connection between one variable (`race`) and the other (`involact`), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples.

##### Exercise 10 {-}

Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?)

See if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths.

::: {.answer}

Please write up your answer here.

:::

*****

In the Chicago example, the causal effect was indirect. In the example from the exercise above, there is no causation whatsoever between the two variables. Instead, the causal effect was generated by a third factor that caused both ice cream sales to go up, and also happened to cause drowning deaths to go up. (Or, equivalently stated, it caused ice cream sales to be low during certain times of the year and also caused the drowning deaths to be low as well.) Such a factor is called a *lurking variable*. When a correlation between two variables exists due solely to the intervention of a lurking variable, that correlation is called a *spurious correlation*. The correlation is real; a scatterplot of ice cream sales and drowning deaths would show a positive relationship. But the reasons for that correlation to exist have nothing to do with any kind of direct causal link between the two.

Here's another one:

##### Exercise 11 {-}

Most studies involving children create a number of weird correlations. For example, the height of children is very strongly correlated to pretty much everything you can measure about scholastic aptitude. For example, vocabulary count (the number of words children can use fluently in a sentence) is strongly correlated to height. Are tall people just smarter than short people?

The answer is, of course, no. The correlation is spurious. So what's the lurking variable?

::: {.answer}

Please write up your answer here.

:::


## Observational studies versus experiments

So when is a statistical finding (like correlation, for example) evidence of a causal relationship? Before we can answer that question, we need a few more definitions.

A lot of data comes from "observational studies" where we simply observe or measure things as they are "in the wild," so to speak. We don't interfere in any way. We just write down what we see. Polls are usually observational in that we ask people questions and record their responses. We do not try to manipulate their responses in any way. We just ask the questions and observe the answers. Field studies are often observational. We go out in nature and write stuff down as we observe it.

Another way to gather data is an *experiment*. In an experiment, we introduce a manipulation or treatment to try to ascertain its effect. For example. if we're testing a new drug, we will likely give the drug to one group of patients and a *placebo* to the other.

##### Exercise 12 {-}

Here's another internet rabbit hole for you. First, look up the definition of placebo. You do not need to write up your own version of that definition here; just familiarize yourself with the term if you're not already familiar with it. Next, find some websites about the *placebo effect* and read those.

Given what you have learned about the placebo effect, why is it important to have a placebo group in a drug trial? Why not just give one set of patients the drug and compare them to another group that takes no pill at all?

::: {.answer}

Please write up your answer here.

:::

*****

The goal of the experiment is to learn whether the *treatment* (in this example, the drug) is effective when compared to the *control* (in this example, the placebo).

Note that the word "effective" implies a causal claim. We want to know if the drug *causes* patients to get better.

Unlike an observational study, in which the relationship between variables can be caused by a lurking variable, in an experiment, we purposefully manipulate one of the variables and try to control all others. For example, we manipulate the drug variable (we purposefully give some people the drug and others the placebo). But we control the amount of the drug given and the schedule on which patients are required to take the pills.

There are lots of things we cannot control. For example, it would be very difficult to control the diet of every person in the experiment. Could diet play a role in whether a patient gets better? Sure, so how do we know diet is not a lurking variable? In the context of an experiment, lurking variables are often called *confounders* or *confounding variables*.

One way to mitigate the effect of confounders that we cannot directly control is to *randomize* the patients into the treatment and control groups. With random selection, there will likely be people who have relatively healthy diets in both the control and treatment groups. If the drugs work, in theory they should still work better for the treatment group than for those taking the placebo. And likewise, patients with less healthy diets will generally be mixed up in both groups, and the drug should also work better for them.

The mantra of experimental design is, "Control as much as you can. Randomize to take care of the rest."

There are lots of aspects of experimental design that we will not go into here (for example, blinding and blocking). But we will continue to mention the differences between observational studies and experiments in future chapters as we exercise caution in making causal claims.


## Prediction versus explanation

Even when claims are not causal, we can use associations (and correlations more specifically) for purposes of *prediction*.

##### Exercise 13 {-}

If I tell you that ice cream sales are high right now, can you make a reasonable prediction about the relative number of drowning deaths this month (high or low)? Why or why not?

::: {.answer}

Please write up your answer here.

:::

*****

So even when there is no direct causal link between two variables, if they are positively correlated, then large values of one variable are associated with large values of the other variable. So if I tell you one value is large, it is reasonable to predict that the other value will be large as well.

We use the language "predictor" variable and "response" variable to reinforce this idea.

In a properly designed and controlled experiment, we can use different language. In this case, we can *explain* the outcome using the treatment variable. If we've controlled for everything else, the only possible explanation for a difference between the treatment and control groups must be the treatment variable. If the patients get better on the drug (more so than those on the placebo) and we've controlled for every other possible confounding variable, the only possible explanation is that the drug works. The drug "explains" the difference in the response variable.

Be careful, as sometimes statisticians use the term "explanatory variable" to mean any kind of variable that predicts or explains. In this course, we will try to use the term "predictor variable" exclusively.


## Visualizing lurking variables

When we create a scatterplot, we can visualize associations between the two numerical variables. Is there a way to see lurking variables in the scatterplot as well?

One simple case is when the lurking variable is a categorical variable. We saw several examples of that in Chapters 3 and 4 in the `penguins` data. The association (or lack thereof) between variables was often misleading when we failed to take into account the fact that there were three different species of penguin.

Here are a few more interesting examples. The `bdims` data (hosted in the `openintro` package) consists of many body measurements taken from 507 physically active individuals. Type `?bdims` at the Console to read the help file.

Here is the data:

```{r}
bdims
```

```{r}
glimpse(bdims)
```

Most physical body measurements are known to be correlated; this makes sense because when one part of the body is larger, we expect lots of other body parts to be larger as well (and similarly for smaller individuals).

For example, it's no surprise that shoulder girth (`sho_gi`) and chest girth (`che_gi`) are strongly correlated:

```{r}
ggplot(bdims, aes(y = sho_gi, x = che_gi)) +
  geom_point()
```

Is there a possible lurking variable here, though? You may wonder about `sex`. (In this data set, the `sex` variable is presumed to be biological sex assigned at birth.)

Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable.

```{r}
bdims <- bdims |>
  mutate(sex_fct = as_factor(sex))
glimpse(bdims)
```

If you look at the `glimpse` output above, you see that we do have a new variable called `sex_fct` and it is properly coded as a factor variable. However, the labels 0 and 1 (for females and males, respectively) are not very helpful. Can we change them? Yes, the `forcats` package has a `fct_recode` function that does just that. Here is what it looks like:

```{r}
bdims <- bdims |>
  mutate(sex_fct = fct_recode(sex_fct, "female" = "0", "male" = "1"))
glimpse(bdims)
```

This will be a lot more helpful!

Now, back to the scatterplots.

One way we learned (in Chapters 3 and 4) to incorporate a third variable into the analysis is through the use of color as an additional aesthetic element. We'll use our new `sex_fct` variable. Also, don't forget to use the Viridis color palette and the black-and-white theme.

```{r}
ggplot(bdims, aes(y = sho_gi, x = che_gi, color = sex_fct)) +
  geom_point() +
  scale_color_viridis_d() +
  theme_bw()
```

In this example, there is a strong correlation between shoulder girth and chest girth, but females and males lie in completely different parts of the graph. Having said that, if you focus on the females separately, you can still see a strong positive correlation, and if you focus on males separately, there is also a strong positive correlation there. So the inclusion of sex didn't really change much about the nature of the correlation in this example. Even still, the correlation coefficients do change a little depending on whether we look at the whole data set versus females/males separately:

```{r}
cor(bdims$sho_gi, bdims$che_gi)
```

```{r}
bdims |>
  group_by(sex_fct) |>
  summarise(corr = cor(sho_gi, che_gi))
```

##### Exercise 14 {-}

Why would the correlation coefficient be stronger for the whole data set and slightly less strong for the sexes separately? (Hint: think about sample size.)

::: {.answer}

Please write up your answer here.

:::

*****

In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises?

##### Exercise 15(a) {-}

Create a scatterplot of thigh girth against weight (put `thi_gi` on the y-axis and `wgt` on the x-axis).

::: {.answer}

```{r}
# Add code here to create a scatterplot of thigh girth against weight.
```

:::

##### Exercise 15(b) {-}

Change the scatterplot above to include `sex_fct` as a `color` aesthetic. (Use the Viridis color palette and `theme_bw`.)

::: {.answer}

```{r}
# Add code here to add color for sex_fct.
```

:::

##### Exercise 15(c) {-}

Calculate the correlation coefficients for thigh girth and weight, once for the whole data set, and again and for the data split by `sex_fct` (as above).

::: {.answer}

```{r}
# Add code here to calculate the correlation coefficient
# between thigh girth and weight.
```

```{r}
# Add code here to calculate the correlation coefficient
# between thigh girth and weight split by sex.
```

:::

##### Exercise 15(d) {-}

Explain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between thigh girth and weight? What changed about the nature of the correlation within each sex category?

::: {.answer}

Please write up your answer here.

:::

##### Exercise 16(a) {-}

The help file for the `bia_di` variable describes it as the "respondent's biacromial diameter in centimeters." What is "biacromial diameter"?

::: {.answer}

Please write up your answer here.

:::

##### Exercise 16(b) {-}

Create a scatterplot of biacromial diameter against weight (put `bia_di` on the y-axis and `wgt` on the x-axis).

::: {.answer}

```{r}
# Add code here to create a scatterplot of biacromial diameter against weight.
```

:::

##### Exercise 16(c) {-}

Change the scatterplot above to include `sex_fct` as a `color` aesthetic. (Use the Viridis color palette and `theme_bw`.)

::: {.answer}

```{r}
# Add code here to add color for sex_fct.
```

:::

##### Exercise 16(d) {-}

Calculate the correlation coefficients for biacromial diameter and weight, once for the whole data set, and again and for the data split by `sex_fct` (as above).

::: {.answer}

```{r}
# Add code here to calculate the correlation coefficient
# between biacromial diameter and weight
```

```{r}
# Add code here to calculate the correlation coefficient
# between biacromial diameter and weight split by sex
```

:::

##### Exercise 16(e) {-}

Explain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between biacromial diameter and weight? What changed about the nature of the correlation within each sex category?

::: {.answer}

Please write up your answer here.

:::

*****

The take-home message here is that lurking variables can change the strength of the correlation between two variables, making it appear stronger or weaker. In more extreme cases, it's even possible to change the direction of the correlation altogether! There isn't an example of this phenomenon in the `bdims` data, but we do find one in the `penguins` data.

Here is a scatterplot of bill depth against bill length.

```{r}
ggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm)) +
  geom_point()
```

There is not much correlation between bill depth and bill length, but if anything, it looks like there might be a slightly negative association. (In the following code chunk, the `cor` command uses a different method for dealing with missing data.)

```{r}
cor(penguins$bill_depth_mm, penguins$bill_length_mm,
    use = "complete.obs")
```

Now split by species:

```{r}
ggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm,
                     color = species)) +
  geom_point() +
  scale_color_viridis_d() +
  theme_bw()
```

```{r}
penguins |>
  group_by(species) |>
  summarise(corr = cor(bill_depth_mm, bill_length_mm,
                       use = "complete.obs"))
```

There was a very weak negative correlation in the full data set, but, behold, bill depth and bill length are positive correlated within each species!

The phenomenon of an association between two variables "reversing" direction when considering a third variable is often called "Simpson's Paradox".^[Just like for every other named concept in statistics, Simpson's Paradox wasn't first observed by Simpson.] We'll revisit Simpson's Paradox in a future chapter.


## Conclusion

If we have two numerical variables that have a linear association between them (also assuming there are no serious outliers), we can compute the correlation coefficient that measures the strength and direction of that linear association.

Keep in mind that in an observational study, the correlation coefficient is a measure of association but it does not signify that one variable causes the other. It's possible that one variable causes the other, but it's also possible that a third "lurking" variable is responsible for all or part of the association. Either way, the fact that a relationship exists means it is possible to use values of one variable to make reasonable predictions about the values of the other variable.

In a properly designed experiment, the manipulation of one variable while controlling for others (and randomizing to take care of other confounders) ensures that there is a causal link between the treatment variable and the response of interest. In this case, the treatment can "explain" the response, not just predict it.

Finally, when there is a lurking variable (confounding), we can visualize it by adding it to our scatterplots as a color aesthetic. There, we often find that associations change; they can get stronger, weaker, or even change direction altogether in the presence of that third variable.

### Preparing and submitting your assignment

1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1--2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Render" button one last time to generate the final draft of the HTML file. (If there are errors here, you may need to go back and fix broken inline code or other markdown issues.)
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.

If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.