Statistical-Inference/08-moderation-categorical.Rmd at master · WdeNooy/Statistical-Inference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Regression Analysis And A Categorical Moderator {#moderationcat}

> Key concepts: regression equation, dummy variables, normally distributed residuals, linearity, homoscedasticity, independent observations, statistical diagram, interaction variable, covariate, common support, simple slope, conditional effect.

Watch this micro lecture on regression analysis with a categorical moderator for an overview of the chapter.

```{r, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/lDkGyTvPzOY", height = "360px")
```

### Summary {.unnumbered}

```{block2, type='rmdimportant'}
My dependent variable is numerical but at least one predictor is also numerical, so I cannot apply analysis of variance. How can I investigate moderation with regression analysis?
```

The linear regression model is a powerful and very popular tool for predicting a numerical dependent variable from one or more independent variables. In this chapter, we use regression analysis to evaluate the effects of an anti-smoking campaign. We predict attitude towards smoking from exposure to the anti-smoking campaign (numerical), time spent with smokers (numerical), and the respondent's smoking status (categorical).

Regression coefficients, that is, the slopes of regression lines, are the effects in a regression model. They show the predicted difference in the dependent variable for a one unit difference in the independent variable (exposure, time spent with smokers) or the predicted mean difference for two categories (smokers versus non-smokers).

But what if the predictive effect is not the same in all contexts? For example, exposure to an anti-smoking campaign may generally generate a more negative attitude towards smoking. The effect, however, is probably different for people who smoke than for people who do not smoke. In this case, the effect of campaign exposure on attitude towards smoking is moderated by context: Whether or not the person exposed to the campaign is a smoker.

Different effect sizes for different contexts are different regression coefficients for different contexts. We need different regression lines for different groups of people. We can use an interaction variable as an independent variable in a regression model to accommodate for moderation as different effects. An interaction variable is just the product of the predictor variable and the moderator variable.

As an independent variable in the model, the regression coefficient of an interaction variable (interaction effect for short) has a confidence interval and a *p* value. The confidence interval tells us the plausible values for the size of the interaction effect in the population. The *p* value tests the null hypothesis that there is no interaction effect at all in the population.

To interpret the interaction effect, we must determine the size of the effect of the predictor on the dependent variable for each group of the moderator. For example, the effect of campaign exposure on smoking attitude for smokers and the effect for non-smokers.

An interaction effect in a regression model closely resembles an interaction effect in analysis of variance. The effect of a single predictor that is involved in an interaction effect in a regression model, however, is not a main effect as in analysis of variance. It is a conditional effect, namely the effect for one particular value of the moderator, that is, the effect within one particular context. To understand this, we must pay close attention to the regression equation.

#### Essential Analytics {.unnumbered}

In SPSS, we use the *Linear* option in the *Regression* submenu for regression analysis. For a moderation model, we first use the *Compute Variable* option in the *Transform* menu to calculate an interaction variable: we multiply (using `*`) the predictor variable by the moderator variable. The interaction variable is included in the regression model as an independent variable, just like the predictor, moderator, and any other independent variables (covariates).

A categorical predictor variable such as a participant's residential area (urban, suburban, rural) is included in the regression model as dummy variables, which have the values 0 or 1, for example: dummy variables *suburban* and *rural*, each with values yes (1) and no (0). The category without dummy variable is the reference group.

The regression coefficient of a dummy variable gives us the difference between the average score on the dependent variable of the group scoring 1 on the dummy variable and the reference group. In the example presented in Figure \@ref(fig:regressiontable), the attitude towards smoking for participants living in a suburban environment (red box) is on average 0.33 more positive than among participants in an urban environment (the reference group).

```{r regressiontable, echo=FALSE, out.width="100%", fig.pos='H', fig.align='center', fig.cap="SPSS table of regression effects for a model in which the effect of exposure is moderated by participant's smoking status (reference group: people who never smoked)."}
knitr::include_graphics("figures/S8_AE1.png")
```

The predictor variable *Exposure* is included in the interaction effect. As a consequence, the regression coefficient for this variable (green box in Figure \@ref(fig:regressiontable)) expresses the effect of exposure on attitude towards smoking for the reference group on the other variable included in the interaction effect, namely, people who never smoked. A one unit increase in exposure predicts a 0.20 more negative (-0.197) attitude towards smoking **for people who never smoked**.

If we would like to know the effect of exposure on attitude towards smoking for former smokers, we must add the regression coefficient for the interaction of exposure with former smokers (blue box) to the regression coefficient of exposure (green box): A one unit increase in exposure predicts a 0.47 more negative (-0.465 = -0.197 + -0.268) attitude towards smoking for former smokers.

An interaction effect such as -0.27 for *Former smoker \* exposure* tells us the difference between the exposure effect for former smokers and the exposure effect for people who never smoked (reference group). A plot of the regression lines shows the different exposure effects (Figure \@ref(fig:regressionplot)). The red line (effect of exposure for former smokers) has a stronger downward tendency than the blue line (exposure effect for people who never smoked).

```{r regressionplot, echo=FALSE, out.width="75%", fig.pos='H', fig.align='center', fig.cap="Simple regression lines showing the effect of exposure on attitude towards smoking for former smokers and people who never smoked (in an urban environment)."}
knitr::include_graphics("figures/S8_AE2.png")
```

```{r SPSS-PROCESS, eval=FALSE, echo=FALSE}
# TERMINOLOGY: predictor (X, cause), moderator (M, represents context), covariate (C, control), dependent variable (Y).

# SPSS versus PROCESS:
# + SPSS: visual checks on residuals: normal distribution and zpred by zresid (linearity, homoscedasticity)
# + SPSS: scatterplot (X, Y) with regression lines per group (Moderator) with original variable and value labels, showing common support ; add reference line for each group manually specifying the regression equation, setting covariates to their mean values (with categorical moderator and no covariates or covariates that are not correlated with the predictor, Regression Variable Plots can be used -not in SPSS V25? - or a regression line per subgroup with equation as label can be added in the Chart Editor)
# - SPSS: manual entering of regression equation with selected values for covariates (and a numerical moderator; lines can only be labeled with the equation text)
# - SPSS: interaction predictors have to be created by hand (also multiple interaction variables for a categorical predictor; Transform>Create Dummy Variables, taught in RMCS?)
# - SPSS: mean-centering must be done by hand
# - SPSS: statistical inference for non-zero moderator values requires separate regression models where the low category requires ADDING one SD instead of subtracting.
# + PROCESS: must be used anyway for mediation models
# - PROCESS: no visual checks on assumptions
# - PROCESS: no visual impression of common support of predictor for different values of the moderator (requires additional work with numerical moderator also in SPSS)
# - PROCESS: data list for visualization of results must be copied from output to syntax file, variable and value labels must be added, lines must be added (and this requires that the moderator has no decimal places in SPSS?) in chart editor
# - PROCESS: model number must be remembered
# - PROCESS: because the student need not create the interaction variables, mean-center or "re-center" for probing the interaction, PROCESS output is more mysterious (but the estimated slopes for different moderator values are directly linked to the graph)
# - PROCESS: dichotomies are automatically treated as indicator variables but categorical predictors/moderators are treated as numerical ; it is not possible to use more than one moderator variable, so PROCESS cannot handle a categorical moderator. (Fixed in V3?)
# DECISION: Use SPSS for results, interpretation, and assumption checks (interaction variables and, possibly, dummies must be created but no need for mean-centering).
```

## The Regression Equation {#regression-equation}

In the social sciences, we usually expect that a particular outcome has several causes. Investigating the effects of an anti-smoking campaign, for instance, we would not assume that a person's attitude towards smoking depends only on exposure to a particular anti-smoking campaign. It is easy to think of other and perhaps more influential causes such as personal smoking status, contact with people who do or do not smoke, susceptibility to addiction, and so on.

```{r concept-smoke, echo=FALSE, fig.asp=0.4, fig.pos='H', fig.align='center', fig.cap="A conceptual model with some hypothesized causes of attitude towards smoking."}
# Draw conceptual diagram: Attitude towards smoking predicted by Exposure, Smoking status, and Contact with smokers.
library(ggplot2)
# Create coordinates for the variable names.
variables <- data.frame(x = c(0.3, 0.3, 0.3, 0.7),
                        y = c(.1, .3, .5, .3),
                        label = c("Exposure", "Smoking Status", "Contact with Smokers", "Attitude"))
ggplot(variables, aes(x, y)) +
  geom_segment(aes(x = x[1], y = y[1], xend = x[4] - 0.04, yend = y[4] - 0.02), arrow = arrow(length = unit(0.04, "npc"), type = "closed")) +
  geom_segment(aes(x = x[2], y = y[2], xend = x[4] - 0.04, yend = y[4]), arrow = arrow(length = unit(0.04, "npc"), type = "closed")) +
  geom_segment(aes(x = x[3], y = y[3], xend = x[4] - 0.04, yend = y[4] + 0.02), arrow = arrow(length = unit(0.04, "npc"), type = "closed")) +
  geom_label(aes(label=label)) +
  coord_cartesian(xlim = c(0.2, 0.8), ylim = c(0.05, 0.55)) +
  theme_void()
# Cleanup.
rm(variables)
```

Figure \@ref(fig:concept-smoke) summarizes some hypothesized causes of the attitude towards smoking. Attitude towards smoking is measured as a scale, so it is a numerical variable. In linear regression, the dependent variable ($y$) must be numerical and in principle continuous. There are regression models for other types of dependent variables, for instance, logistic regression for a dichotomous (0/1) dependent variable and Poisson regression for a count dependent variable, but we will not discuss these models.

A regression model translates this conceptual diagram into a statistical model. The statistical regression model is a mathematical function with the dependent variable (also known as the outcome variable, usually referred to with the letter $y$) as the sum of a constant, the effects ($b$) of independent variables or predictors ($x$), which are *predictive effects*, and an error term ($e$), which is also called the *residuals*, see Equation \@ref(eq:regression).

```{=tex}
\begin{equation}
\small
  y = constant + b_1*x_1 + b_2*x_2 + b_3*x_3 + e
  (\#eq:regression)
\normalsize
\end{equation}
```
If we want to predict the dependent variable ($y$), we ignore the error term ($e$) in the equation. The equation without the error term [Eq. \@ref(eq:regressionpred)] represents the regression line that we visualize and interpret in the following subsections. We use the error term only when we discuss the assumptions for statistical inference on a regression model in Section \@ref(regr-inference).

```{=tex}
\begin{equation}
\small
  y = constant + b_1*x_1 + b_2*x_2 + b_3*x_3
  (\#eq:regressionpred)
\normalsize
\end{equation}
```
### A numerical predictor

Let us first have a close look at a *simple regression equation*, that is, a regression equation with just one predictor ($x$). Let us try to predict attitude towards smoking from exposure to an anti-smoking campaign.

```{r regression-continuous, fig.pos='H', fig.align='center', fig.cap="Predicting attitude towards smoking from exposure to an anti-smoking campaign. The orange dot represents the predicted attitude for the selected value of exposure.", echo=FALSE, out.width="775px", screenshot.opts = list(delay = 5), dev="png"}
# Goal: Understand the meaning of the constant and the regression coefficient
# with a single numerical predictor.
# Generate a data set with attitude towards smoking as Y and exposure as X with
# a negative (b = -0.6) more or less linear relation.
# Display the scatterplot.
# Add the line of a simple regression of attitude on exposure for the generated data.
# Allow the user to change the value of exposure.
# A change to the exposure value triggers the app to add/reposition the predicted value as a dot (on the regression line), and show the contribution of b*x to the predicted value as a 'triangle' anchored on (0, constant). In addition, the values between parentheses of x and y are updated.
knitr::include_app("http://82.196.4.233:3838/apps/regression-continuous/", height="408px")
```

<A name="question8.1.1"></A>

```{block2, type='rmdquestion'}
1. What is the predicted attitude for a person with zero exposure to the campaign? Explain why this value equals the constant of the regression equation. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.1)
```

<A name="question8.1.2"></A>

```{block2, type='rmdquestion'}
2. What does the vertical orange line in Figure \@ref(fig:regression-continuous) mean if exposure is set to 1? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.2)
```

<A name="question8.1.3"></A>

```{block2, type='rmdquestion'}
3. Use the equation to calculate the predicted attitude if exposure is 10. Check your answer using the exposure slider. What is troublesome about this predicted value? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.3)
```

Good understanding of the regression equation is necessary for understanding moderation in regression models. So let us have a close look at an example equation [Eq. \@ref(eq:regrexample)]. In this example, the dependent variable attitude towards smoking is predicted from a constant and one independent variable, namely exposure to an anti-smoking campaign.

```{=tex}
\begin{equation}
\small
  attitude = constant + b*exposure
  (\#eq:regrexample)
\normalsize
\end{equation}
```
The constant is the predicted attitude if a person scores zero on all independent variables. To see this, plug in (replace) zero for the predictor in the equation (Eq. \@ref(eq:regsmokedummy)) and remember that zero times something yields zero. This reduces the equation to the constant.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b*0 \\
  attitude &= constant + 0 \\
  attitude &= constant
\end{split}
  (\#eq:regsmokedummy)
\normalsize
\end{equation}
```
For all persons scoring zero on exposure, the predicted attitude equals the value of the regression constant. This interpretation only makes sense if the predictor can be zero. If, for example, exposure had been measured on a scale ranging from one to seven, nobody can have zero exposure, so the constant has no straightforward meaning.

The unstandardized regression coefficient $b$ represents the predicted difference in the dependent variable for a difference of one unit in the independent variable. For example, plug in the values 1 and 0 for the *exposure* variable in the equation. If we take the difference of the two equations, we are left with $b$. Other terms in the two equations cancel out.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude = constant + b*1 \\
  \underline{- \mspace{20mu} attitude = constant + b*0} \\
  attitude \mspace{4mu} difference = b*1 - b*0 = b - 0 = b
\end{split}
(\#eq:regweight)
\normalsize
\end{equation}
```
```{block2, type='rmdimportant'}
The unstandardized regression coefficient $b$ represents the predicted difference in the dependent variable for a difference of one unit in the independent variable.

It is the slope of the regression line.
```

Whether this predicted difference is small or large depends on the practical context. Is the predicted decrease in attitude towards smoking worth the effort of the campaign? In the example shown in Figure \@ref(fig:regression-continuous), one additional unit of exposure decreases the predicted attitude by 0.6. This seems to be quite a substantial change on a scale from -5 to 5.

In the data, the smallest exposure score is (about) zero, predicting a positive attitude of 1.6. The largest observed exposure score is around eight, predicting a negative attitude of -3.2. If exposure causes the predicted differences in attitude, the campaign would have interesting effects. It may change a positive attitude into a rather strong negative attitude.

If we want to apply a rule of thumb for the strength of the effect, we usually look at the standardized regression coefficient ($b^*$ according to APA, *Beta* in SPSS output). See Section \@ref(assoc-size) for some rules of thumb for effect size interpretation.

Note that the regression coefficient is calculated for predictor values that occur within the data set. For example, if the observed exposure scores are within the range zero to eight, these values are used to predict attitude towards smoking.

We cannot see this in the regression equation, which allows us to plug in -10, 10, or 100 as exposure values. But the values for attitude that we predict from these exposure values are probably nonsensical (if possible at all: -10 exposure?) Our data do not tell us anything about the relation between exposure and anti-smoking attitude for predictor values outside the observed zero to eight range. We should not pretend to know the effects of exposure levels outside this range. It is good practice to check the actual range of predictor values.

### Dichotomous predictors {#dichpredictor}

Instead of a numerical independent variable, we can use a dichotomy as an independent variable in a regression model. The dichotomy is preferably coded as 1 versus 0, for example, 1 for smokers and 0 for non-smokers among our respondents.

```{r regression-dichotomy, eval=TRUE, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="What is the difference in attitude between non-smokers and smokers?", screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Refresh interpretation of unstandardized regression weight for a
# dichotomous independent variable by manipulating group averages.
# Generate attitude scores  with average values -0.6 for non-smokers and 1.0 for
# smokers (N = 20 per group).
# Draw horizontal and vertical lines from the axes to the group means and add a
# regression line (line through the two group means).
# Display the current regression equation beneath or in the plot.
# Allow the user to change the average score per group. Update the scatterplot,
# regression line and equation, and the horizontal/vertical lines.
knitr::include_app("http://82.196.4.233:3838/apps/regression-dichotomy/", height="330px")
```

<A name="question8.1.4"></A>

```{block2, type='rmdquestion'}
4. What is the relation between the constant of the regression line in Figure \@ref(fig:regression-dichotomy) and group averages? Motivate your answer by changing the average attitude towards smoking for non-smokers in Figure \@ref(fig:regression-dichotomy). [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.4)
```

<A name="question8.1.5"></A>

```{block2, type='rmdquestion'}
5. Change the slider for smokers to detect the relation between group means and the unstandardized regression coefficient ($b$). How can we calculate the unstandardized regression coefficient ($b$) from the group averages? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.5)
```

The interpretation of the effect of a dichotomous independent variable in a regression model is quite different from the interpretation of a numerical independent variable's effect.

It does not make sense to interpret the unstandardized regression coefficient of, for example, smoking status as predicted difference in attitude for a difference of one 'more' smoking status. After all, the 0 and 1 scores do not mean that there is one unit 'more' smoking. Instead, the coefficient indicates that we are dealing with different groups: smokers versus non-smokers.

If smoking status is coded as smoker (1) versus non-smoker (0), we effectively have two versions of the regression equation. The first equation \@ref(eq:regdicho1) represents all smokers, so their smoking status score is 1. The smoking status of this group has a fixed contribution to the predicted average attitude, namely $b$.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b*status \\
  attitude_{smokers} &= constant + b*1 \\
  attitude_{smokers} &= constant + b
\end{split}
(\#eq:regdicho1)
\normalsize
\end{equation}
```
Regression equation \@ref(eq:regdicho0) represents all non-smokers. Their smoking status score is 0, so the smoking status effect drops from the model.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b*status \\
  attitude_{non-smokers} &= constant + b*0 \\
  attitude_{non-smokers} &= constant + 0
\end{split}
  (\#eq:regdicho0)
\normalsize
\end{equation}
```
If you compare the final equations for smokers [Eq. \@ref(eq:regdicho1)] and non-smokers [Eq. \@ref(eq:regdicho0)], the only difference is $b$, which is present for smokers but absent for non-smokers. It is the difference between the average score on the dependent variable (attitude) for smokers and the average score for non-smokers. We are testing a mean difference. Actually, this is exactly the same as an independent-samples *t* test!

```{block2, type='rmdimportant'}
The unstandardized regression coefficient for a dummy (0/1) variable represents the difference between the average outcome score of the group coded as '1' and the average outcome score of the group coded as '0'.
```

Imagine that $b$ equals 1.6. This indicates that the average attitude towards smoking among smokers (coded '1') is 1.6 units above the average attitude among non-smokers (coded '0'). Is this a small or large effect? In the case of a dichotomous independent variable, we should **not** use the standardized regression coefficient to evaluate effect size. The standardized coefficient depends on the distribution of 1s and 0s, that is, which part of the respondents are smokers. But this should be irrelevant to the size of the effect.

Therefore, it is recommended to interpret only the unstandardized regression coefficient for a dichotomous independent variable. Interpret it as the difference in average scores for two groups.

### A categorical independent variable and dummy variables {#categorical-predictor}

How about a categorical variable containing three or more groups, for example, the distinction between respondents who smoke (smokers), stopped smoking (former smokers), and respondents who never smoked (non-smokers)? Can we include a categorical variable as an independent variable in a regression model? Yes, we can but we need a trick.

```{r regression-categorical, fig.pos='H', fig.align='center', fig.cap="What are the predictive effects of smoking status?", echo = FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Understanding effects of dummy variables by manipulating the reference
# group.
# Generate numerical dependent variable data (attitude, range [-5, 5]) for a categorical
# independent variable with 3 categories: (1) non-smoker, (2) former smokers, (3) smoker,
# and a random choice of one of the following scenarios:
#* (1 M = -1.7, SD = 1) < (2) = (3 M = 0.75, SD = 1) {initial situation},
#* (1 M = -1.7, SD = 1) = (2) < (3 M = 0.75, SD = 1),
#* (2  M = -1.7, SD = 1) < (1) = (3 M = 0.75, SD = 1),
#* (1 M = -1.7, SD = 1) < (2 M = 0.75, SD = 1) < (3 M = 1.8, SD = 1).
# Display jittered scatterplot containing the three groups with group means
# indicated by a line segment and value, regression lines through the reference
# group mean and each of the other group means. Display b and p value of
# regression weights with the regression lines.
# Add input to select the reference group, initially set to group (1). Update
# regression lines and their p values on selection of a new reference group.
# Add button to generate a new plot (with a new scenario).
knitr::include_app("http://82.196.4.233:3838/apps/regression-categorical/", height="325px")
```

<A name="question8.1.6"></A>

```{block2, type='rmdquestion'}
6. Interpret the effects of smoking status (grey and black *b*'s) in Figure \@ref(fig:regression-categorical). [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.6)
```

<A name="question8.1.7"></A>

```{block2, type='rmdquestion'}
7. In the initial state of Figure \@ref(fig:regression-categorical), can you tell whether the attitude of smokers is significantly different from the attitude of former smokers? If not, how can you get the _p_ value that you need? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.7)
```

<A name="question8.1.8"></A>

```{block2, type='rmdquestion'}
8. Select some new plots. For each plot, determine which reference group you think is most convenient for summarizing the results. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.8)
```

In this example, smoking status is measured with three categories: (1) non-smokers, (2) former smokers, and (3) smokers. Let us use the term *categorical variable* only for variables containing three or more categories or groups. This makes it easy to distinguish them from dichotomous variables. This distinction is important because we can include a dichotomous variable straight away as a predictor in a regression model but we cannot do so for a variable with more than two categories. We can only include such a categorical independent variable if we change it into a set of dichotomies.

We can create a new dichotomous variable for each group, indicating whether (score 1) or not (score 0) the respondent belongs to this group. In the example, we could create the variables *neversmoked*, *smokesnomore*, and *smoking*. Every respondent would score 1 on one of the three variables and 0 on the other two variables (Table \@ref(tab:dummytable)). These variables are called *dummy variables* or *indicator variables*.

```{r dummytable, echo=FALSE, screenshot.opts=list(delay = 2)}
knitr::kable(rbind(c("1 - Non-smoker", "1", "0", "0"),
                   c("2 - Former smoker", "0", "1", "0"),
                   c("3 - Smoker", "0", "0", "1")),
             col.names = c("Original categorical variable:", "neversmoked", "smokesnomore", "smoking"), caption = "Dummy variables for a categorical independent variable: One dummy variable is superfluous.", align = c("l", "c", "c", "c"), booktabs = TRUE) %>%
  kable_styling(font_size = 12, full_width = F, position = "float_right",
                latex_options = c("scale_down", "HOLD_position"))
```

If we want to include a categorical independent variable in a regression model, we must use all dummy variables as independent variables **except one**. In the example, we must include two out of the three dummy variables. Equation \@ref(eq:regcat) includes dummy variables for former smokers ($smokesnomore$) and smokers ($smoking$).

```{block2, type='rmdimportant'}
Include dummy variables as independent variables for *all except one* categories of a categorical variable.

The category without dummy variable is the *reference group*.
```

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b_1*smokesnomore + b_2*smoking
\end{split}
(\#eq:regcat)
\normalsize
\end{equation}
```
The two dummy variables give us three different regression equations: one for each smoking status category. Just plug in the correct 0 or 1 values for respondents with a particular smoking status.

Let us first create the equation for non-smokers. To this end, we replace both $smokesnomore$ and $smoking$ by 0. As a result, both dummy variables drop from the equation [Eq. \@ref(eq:regcat1)], so the constant is the predicted attitude for non-smokers. The non-smokers are our *reference group* because they are not represented by a dummy variable in the equation.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b_1*smokesnomore + b_2*smoking \\
  attitude_{non-smokers} &= constant + b_1*0 + b_2*0 \\
  attitude_{non-smokers} &= constant
\end{split}
(\#eq:regcat1)
\normalsize
\end{equation}
```
For former smokers, we plug in 1 for $smokesnomore$ and 0 for $smoking$. The predicted attitude for former smokers equals the constant plus the unstandardized regression coefficient for the $smokesnomore$ dummy variable ($b_1$), see Equation \@ref(eq:regcat2). Remember that the constant represents the non-smokers (reference group), so the unstandardized regression coefficient $b_1$ for the $smokesnomore$ dummy variable shows us the difference between former smokers and non-smokers: How much more positive or more negative the average attitude towards smoking is among former smokers than among non-smokers.

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b_1*smokesnomore + b_2*smoking \\
  attitude_{former smokers} &= constant + b_1*1 + b_2*0 \\
  attitude_{former smokers} &= constant + b_1
\end{split}
(\#eq:regcat2)
\normalsize
\end{equation}
```
Finally, for smokers, we plug in 0 for $smokesnomore$ and 1 for $smoking$ [Eq. \@ref(eq:regcat3)]. The predicted attitude for smokers equals the constant plus the unstandardized regression coefficient for the $smoking$ dummy variable ($b_2$). This regression coefficient, then, represents the difference in average attitude between smokers and non-smokers (reference group).

```{=tex}
\begin{equation}
\small
\begin{split}
  attitude &= constant + b_1*smokesnomore + b_2*smoking \\
  attitude_{smokers} &= constant + b_1*0 + b_2*1 \\
  attitude_{smokers} &= constant + b_2
\end{split}
(\#eq:regcat3)
\normalsize
\end{equation}
```
The interpretation of the effects (regression coefficients) for the included dummies is similar to the interpretation for a single dichotomous independent variable such as smoker versus non-smoker. It is the difference between the average score of the group coded 1 on the dummy variable and the average score of the reference group on the dependent variable. The reference group is the group scoring 0 on all dummy variables that represent the categorical independent variable.

If we exclude the dummy variable for the respondents who never smoked, as in the above example, the regression weight of the dummy variable $smokesnomore$ gives the average difference between former smokers and non-smokers. If the regression weight is negative, for instance -0.8, former smokers have on average a more negative attitude towards smoking than non-smokers. If the difference is positive, former smokers have on average a more positive attitude towards smoking.

Which group should we use as reference category, that is, which group should not be represented by a dummy variable in the regression model? This is hard to say in general. If one group is of greatest interest to us, we could use this as the reference group, so all dummy variable effects express differences with this group. Alternatively, if we expect a particular ranking of the average scores, we may pick the group at the highest, lowest or middle rank as the reference group. If you can't decide, run the regression model several times with a different reference group.

Finally, note that we should not include all three dummy variables in the regression model [Eq. \@ref(eq:regcat)]. We can already identify the non-smokers, because they score 0 on both the $smokesnomore$ and $smoking$ dummy variables. Adding the $neversmoked$ dummy variable to the regression model is like including the same independent variable twice. How can the estimation process decide which of the two identical independent variable is responsible for the effect? It can't decide, so the estimation process fails or it drops one of the dummy variables. If this happens, the independent variables are said to be perfectly *multicollinear*.

### Sampling distributions and assumptions {#regr-inference}

```{r regression-sampling, eval=FALSE, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="What happens to regression lines from sample to sample?"}
# Goal: Understand that regression constant and coefficient(s) have sampling
# distributions.
# Generate a population with a weak negative effect (-0.6) of exposure on
# attitude and exposure, with a sizable error term (so a lot of variation in
# sample regression lines).
# Generate a sample (N = 10) and display it in a scatterplot with regression
# line, labelled with it's unstandardized regression coefficient value. Also
# plot the sampling distribution for the regression coefficient.
# Add a button to allow drawing a new sample; display the new sample and new
# regression line but retain the existing regression lines.
# Add button (or change sampling button) to draw 1,000 samples: don't display
# samples, just update sampling distribution with normal (or t) distribution as
# superimposed curve.

1. Which estimates can change from sample to sample: the regression constant, the regression coefficient, or both? Check your answer by drawing new samples.

2. What is the shape of the sampling distribution if you draw a lot of samples?

3. What happens if you draw samples of larger size? Think of what you learned in preceding chapters. Formulate your answer before you change sample size in Figure \@ref(fig:regression-sampling).
```

If we are working with a random sample or we have other reasons to believe that our data could have been different due to chance (Section \@ref(no-random-sample)), we should not just interpret the results for the data set that we collected. We should apply statistical inference---confidence intervals and significance tests---to our results. The confidence interval gives us bounds for plausible population values of the unstandardized regression coefficient. The *p* value is used to test the *null hypothesis that the unstandardized regression coefficient is zero in the population*.

Each regression coefficient as well as the constant may vary from sample to sample drawn from the same population, so we should devise a sampling distribution for each of them. These sampling distributions happen to have a *t* distribution under particular assumptions.

Chapters \@ref(param-estim) and \@ref(hypothesis) have extensively discussed how confidence intervals and *p* values are constructed and how they must be interpreted. So we focus now on the assumptions under which the *t* distribution is a good approximation of the sampling distribution of a regression coefficient.

#### Independent observations

The two most important assumptions require that the observations are *independent and identically distributed*. These requirements arise from probability theory. If they are violated, the statistical results should not be trusted.

Each observation, for instance, a measurement on a respondent, must be independent of all other observations. A respondent's dependent variable score is not allowed to depend on scores of other respondents.

It is hardly possible to check that our observations are independent. We usually have to assume that this is the case. But there are situations in which we should not make this assumption. In time series data, for example, the daily amount of political news, we usually have trends, cyclic movements, or issues that affect the amount of news over a period of time. As a consequence, the amount and contents of political news on one day may depend on the amount and contents of political news on the preceding days.

Clustered data should also not be considered as independent observations. Think, for instance, of student evaluations of statistics tutorials. Students in the same tutorial group are likely to give similar evaluations because they had the same tutor and because of group processes: Both enthusiasm and dissatisfaction can be contagious.

#### Identically distributed observations

To check the assumption of identically distributed observations, we inspect the residuals. Remember, the residuals are represented by the error term ($e$) in the regression equation. They are the difference between the scores that we observe for our respondents and the scores that we predict for them with our regression model.

```{r resid-normal, fig.pos='H', fig.align='center', fig.cap="What are the residuals and how are they distributed?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Understand the meaning of residuals by linking residuals in a
# scatterplot to the x values in a histogram.
# Generate a sample (N = 20?) with a weak negative effect (-0.6) of exposure on
# attitude and exposure, with a sizable error term to have residuals that are
# clearly visible. Generate a sample with uniformly distributed residuals.
# Display the sample as a scatterplot with the regression line and (red) line
# segments linking the dots vertically to the regression line. Display the
# residuals also as a histogram with normal curve. Hovering over/clicking a line
# segment (residual) in the scatterplot should highlight the corresponding bar
# in the histogram. Add a button to draw a new sample.
knitr::include_app("http://82.196.4.233:3838/apps/resid-normal/", height="264px")
```

<A name="question8.1.9"></A>

```{block2, type='rmdquestion'}
9. What do the lines between the dots and the (blue) regression line represent in the scatter plot at the left of Figure \@ref(fig:resid-normal)? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.9)
```

<A name="question8.1.10"></A>

```{block2, type='rmdquestion'}
10. What is the relation between the scatter plot and the histogram? Can you point out the dot in the scatter plot that belongs to the leftmost bar in the histogram? Tip: Drag your mouse around a dot while pressing the left mouse button to see its residual in the histogram. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.10)
```

<A name="question8.1.11"></A>

```{block2, type='rmdquestion'}
11. Draw some new samples. Are the residuals always normally distributed: Does the top of the histogram coincide with the center of the normal distribution and are the left and right tails equally "fat"?  [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.11)
```

If we sample from a population where attitude towards smoking depends on exposure, smoking status, and contact with smokers, we will be able to predict attitude from the independent variables in our sample. Our predictions will not be perfect, sometimes too high and sometimes too low. The differences between predicted and observed attitude scores are the residuals.

If our sample is truly a random sample with independent and identically distributed observations, the sizes of our errors (residuals) should be normally distributed for each value of the dependent variable, that is, attitude in our example. The residuals should result from chance (see Section \@ref(datageneratingprocess) for the relation between chance and a normal distribution).

So for all possible values of the dependent variable, we must collect the residuals for the observations that have this score on the dependent variable. For example, we should select all respondents who score 4.5 on the attitude towards smoking scale. Then, we select the residuals for these respondents and see whether they are approximately normally distributed.

Usually, we do not have more than one observation (if any) for a single dependent variable score, so we cannot apply this check. Instead, we use a simple and coarse approach: Are all residuals normally distributed?

A histogram with an added normal curve (like the right-hand plot in Figure \@ref(fig:resid-normal)) helps us to evaluate the distribution of the residuals. If the curve more or less follows the histogram, we conclude that the assumption of identically distributed observations is plausible. If not, we conclude that the assumption is not plausible and we warn the reader that the results can be biased.

#### Linearity and prediction errors

The other two assumptions that we use tell us about problems in our model rather than problems in our statistical inferences. Our regression model assumes a linear effect of the independent variables on the dependent variable (*linearity*) and it assumes that we can predict the dependent variable equally well or equally badly for all levels of the dependent variable (*homoscedasticity*, next section).

The regression models that we estimate assume a linear model. This means that an additional unit of the independent variable always increases or decreases the predicted value by the same amount. If our regression coefficient for the effect of exposure on attitude is -0.25, an exposure score of one predicts a 0.25 more negative attitude towards smoking than zero exposure. Exposure score five predicts the same difference in comparison to score four as exposure score ten in comparison to exposure score nine, and so on. Because of the linearity assumption, we can draw a regression model as a straight line. Residuals of the regression model help us to see whether the assumption of a linear effect is plausible.

```{r pred-linearity, fig.pos='H', fig.align='center', fig.cap="How do residuals tell us whether the relation is linear?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Understand the relation between linear model (scatterplot) and residuals plot by manipulating the shape of the association.
# Generate a sample (N = 20?) with a weak negative effect (-0.6) of exposure on attitude, with a sizable error term to have residuals that are clearly visible. Generate either a sample with (1) linear, (2) curved, (3) U-shaped association. Display the sample as a scatterplot with the regression line and (red) line segments linking the dots vertically to the regression line. Display the residuals also in a residuals (Y) by predicted values (X) plot. Hovering over/clicking a dot in the scatterplot should highlight the corresponding dot in the residuals plot. Add a button to select a different association shape. Upon selection of a shape, generate & display new sample data.
knitr::include_app("http://82.196.4.233:3838/apps/pred-linearity/", height="460px")

#ADD OPTION: LINEAR POSITIVE?
```

<A name="question8.1.12"></A>

```{block2, type='rmdquestion'}
12. Which dot in the plot of residuals (Figure \@ref(fig:pred-linearity) bottom) corresponds with the left-most observation (dot) in the scatter plot of attitude by exposure (Figure \@ref(fig:pred-linearity) top)? Drag your mouse around the left-most dot while pressing the left mouse button to check your choice. Repeat for more dots until you understand the relation between the two plots. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.12)
```

<A name="question8.1.13"></A>

```{block2, type='rmdquestion'}
13. Select a U-shaped curve in Figure \@ref(fig:pred-linearity). Explain how the plot of residuals tells you that the association is not linear. Do the same for a curved association. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.13)
```

The relation between an independent and dependent variable, for example, exposure and attitude towards smoking, does not have to be linear. It can be curved or have some other fancy shape. Then, the linearity assumption is not met. A straight regression line does not nicely fit such data.

We can see this in a graph showing the (standardized) residuals (vertical axis) against the (standardized) predicted values of the dependent variable (on the horizontal axis), as exemplified by the lower plot in Figure \@ref(fig:pred-linearity). Note that the residuals represent prediction errors. If our regression predictions are systematically too low at some levels of the dependent variable and too high at other levels, the residuals are not nicely distributed around zero for all predicted levels of the dependent variable. This is what you see if the association is curved or U-shaped.

This indicates that our linear model does not fit the data. If it would fit, the average prediction error is zero for all predicted levels of the dependent variable. Graphically speaking, our linear model matches the data if positive prediction errors (residuals) are more or less balanced by negative prediction errors everywhere along the regression line.

#### Homoscedasticity and prediction errors

The plot of residuals by predicted values of the dependent variable tells us more than whether a linear model fits the data.

```{r pred-homoscedasticity, fig.pos='H', fig.align='center', fig.cap="How do residuals tell us that we predict all values equally well?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Understand the relation between linear model (scatterplot) and residuals plot by manipulating homoscedasticity.
# Generate a sample (N = 20) with a weak negative effect (-0.6) of exposure on attitude and exposure, with a sizable error term to have residuals that are clearly visible. Generate error terms with a dependency on the independent variable ranging from -1 to +1. Display the sample as a scatterplot with the regression line and (red) line segments linking the dots vertically to the regression line. Display the residuals also in a residuals (Y) by predicted values (X) plot. Hovering over/clicking a dot in the scatterplot should highlight the corresponding dot in the residuals plot. Add a slider (range [-@, 0], initial value 0) to set the levl of heteroscedasticity. Upon slider change, generate & display new sample data.
knitr::include_app("http://82.196.4.233:3838/apps/pred-homoscedasticity/", height="460px")
```

<A name="question8.1.14"></A>

```{block2, type='rmdquestion'}
14. What strikes you about the residuals in Figure \@ref(fig:pred-homoscedasticity)? Remember that residuals represent prediction errors. [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.14)
```

<A name="question8.1.15"></A>

```{block2, type='rmdquestion'}
15. What happens if you move the slider to the far left? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.15)
```

<A name="question8.1.16"></A>

```{block2, type='rmdquestion'}
16. At which slider position are all attitude levels predicted more or less equally well or equally badly? [<img src="icons/2answer.png" width=115px align="right">](#answer8.1.16)
```

The other assumption states that we can predict the dependent variable equally well at all dependent variable levels. In other words, the prediction errors (residuals) are more or less the same at all levels of the dependent variable. This is called *homoscedasticity*. If we have large prediction errors at some levels of the dependent variable, we should also have large prediction errors at other levels. As a result, the vertical width of the residuals by predictions scatter plot should be more or less the same from left to right. The dots representing residuals resemble a more or less rectangular band.

If the prediction errors are not more or less equal for all levels of the predicted scores, our model is better at predicting some values than other values. For example, low values can be predicted better than high values of the dependent variable. The dots representing residuals resemble a cone. This may signal, among other things, that we need to include moderation in the model.

```{r eval=FALSE, echo=FALSE}
#DROPPED

Why do we use the residuals and predicted values instead of a scatterplot for each dependent-independent variable pair to assess linearity and homoscedasticity? The reason is that some independent variables may predict low values and other independent variables may predict high values. This is perfectly OK if together they predict low and high values equally well.
```

### Answers {.unnumbered}

<A name="answer8.1.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* The predicted attitude is 1.6.
* This equals the constant of a regression equation because we plug in zero (0) as the value of exposure, so the regression equation simplifies to:

> _attitude_ = 1.6 + _b_ \* 0 = 1.6 + 0.

* In other words, the constant is the predicted value of the dependent variable if all predictors are zero. [<img src="icons/2question.png" width=161px align="right">](#question8.1.1)
```

<A name="answer8.1.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* The vertical orange line shows the difference between the predicted attitude if exposure is one and the predicted attitude if exposure is zero.

![](figures/S8_1Q2.png)

* This difference is captured by the unstandardized regression coefficient, denoted by the symbol $b$.
* More generally, this coefficient tells us the predicted difference in the dependent variable for a difference of one unit in the independent variable.
* According to the equation, the predicted attitude decreases by 0.6 for a one unit difference (0 to 1) in exposure.
* This is the decrease from 1.6 to 1.0 signalled by the vertical orange line. [<img src="icons/2question.png" width=161px align="right">](#question8.1.2)
```

<A name="answer8.1.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* Just replace exposure by 10 in the equation and calculate the result:

> _attitude_ = 1.6 + (-0.6) \* 10 = 1.6 + -6.0 = -4.4

* It is troublesome that we do not have any respondents with exposure scores near 10. The highest exposure scores are below 8. We cannot check that the regression line still fits the observations. We should not trust the regression line outside the range of values that we have observed for the predictor. [<img src="icons/2question.png" width=161px align="right">](#question8.1.3)
```

<A name="answer8.1.4"></A>

```{block2, type='rmdanswer'}
Answer to Question 4.

* Here, the constant is equal to the average attitude of non-smokers.
* Use the slider to change the average attitude towards smoking for
non-smokers.
* This will change the value of the constant in the regression equation. Why?
* Non-smokers score 0 on the (smoking) status variable. The regression
equation for non-smokers, then, is:

> _attitude_ = _constant_ + _b_ * _status_ = _constant_ + _b_ * 0 = _constant_

* Thus, we see that the predicted attitude for non-smokers equals the
constant. In addition, we know that the predicted value for a group
equals the average score of the group. As a result, the constant equals
the average score of non-smokers in this example.
* Note that this is true only if the group is coded zero and if the regression
model contains only one independent variable (simple regression model). [<img src="icons/2question.png" width=161px align="right">](#question8.1.4)
```

<A name="answer8.1.5"></A>

```{block2, type='rmdanswer'}
Answer to Question 5.

* In this example, the unstandardized regression coefficient (_b_) is equal to
the average attitude score of smokers minus the average attitude score of
non-smokers.
* This is so because smokers are coded as ones and non-smokers are coded as
zeros. The difference between smokers and non-smokers on the smoking status
variable is one. The regression coefficient (_b_) tells us the difference in
predicted attitude scores for a difference of one unit on the independent
variable. As a result, the unstandardized regression coefficient (_b_) tells us
the difference between the average (= predicted value) for smokers and the
average (= predicted value) for non-smokers.
* With equations:
* Smokers average: _attitude_ = _constant_ + _b_ * _status_ = _constant_ + _b_ * 1 = _constant_ + _b_

* Non-smokers average: _attitude_ = _constant_ + _b_ * _status_ = _constant_ + _b_ * 0 = _constant_
* Smokers - Non-smokers: _(constant + b)_ - _constant_ = _b_ [<img src="icons/2question.png" width=161px align="right">](#question8.1.5)
```

<A name="answer8.1.6"></A>

```{block2, type='rmdanswer'}
Answer to Question 6.

* The precise answer to this question obviously depends on the group means in your plot.
* In general, the interpretation focuses on differences between the mean score of the reference group and the mean scores of the other groups. In this example, we are talking about average attitude towards smoking for each group defined by their smoking status. The reference group is selected with the Select reference drop-down list.

![](figures/S8_1Q6.png)

* The unstandardized regression coefficient (_b_) tells us how much larger (positive coefficient) or smaller (negative coefficient) the mean score of a group is in comparison to the reference group. In the above figure, the regression coefficient for smokers versus non-smokers (_b_ = 4.39) is equal to the average of the non-reference group (smokers: mean = 1.40) minus the average of the reference group (non-smokers: mean = -2.99).
* The associated _p_ value tells us how uncertain we are that there truly is a mean difference in the population. It tests the null hypothesis that the two groups have the same mean score on the dependent variable in the population.
[<img src="icons/2question.png" width=161px align="right">](#question8.1.6)
```

<A name="answer8.1.7"></A>

```{block2, type='rmdanswer'}
Answer to Question 7.

* In the initial state of this figure, non-smokers are the reference group. The _p_ values, then, are associated with differences between on the one hand non-smokers and on the other hand people who stopped smoking or are still smoking. The comparison between the latter two is not included.
* It is hazardous to derive the _p_ value of the difference between two non-reference groups from the _p_ values of the differences between these groups and the reference group. Remember that a _p_ value depends on both effect size (difference between group means) and on the standard error. The latter depends, among other things, on the sample size per group. All of these aspects may vary across groups, so we cannot guess the _p_ value of the mean difference between two groups.
* The solution is to re-estimate the regression model with one of the groups
that you want to compare as reference group. If former smokers or smokers are the reference group, you obtain a _p_ value for the difference between former smokers and smokers. Select one of these groups in the drop-down list and there you go. [<img src="icons/2question.png" width=161px align="right">](#question8.1.7)
```

<A name="answer8.1.8"></A>

```{block2, type='rmdanswer'}
Answer to Question 8.

There is not one right way of choosing a reference group; think about
arguments.

1. Substantive interest: Does your research focus on one particular group? If
so, use this group as the reference group, so it is included in all
comparisons. If, for example, the research is meant to support an anti-smoking
campaign, the group of current smokers is probably of central interest. Make
them your reference group.
2. If you expect a particular order in the group means, the group that you
expect to be in the middle is a good choice as reference group. If, for
example, you expect that attitude towards smoking is more positive for smokers
than for former smokers, and the latter are more positive than people who never
smoked, the former smokers are expected to be in the middle. If we use them as
reference group, we can test if they are more positive than people who never
smoked, and more negative than people who are currently smoking. If both differences are statistically significant, the difference between the highest scoring group and the lowest scoring group must also be statistically significant.
3. If two groups have relatively similar means in comparison to the third
group, you may be interested to know if the relatively small difference is
statistically significant. In this case, one of the two groups that have
similar means is a good reference group. [<img src="icons/2question.png" width=161px align="right">](#question8.1.8)
```

<A name="answer8.1.9"></A>

```{block2, type='rmdanswer'}
Answer to Question 9.

* A red line depicts the residual or prediction error: The difference between
the actual dependent variable score (attitude) for a respondent (dot) and the
dependent variable score predicted by the regression line (blue) for the
independent variable score (exposure) of the respondent. [<img src="icons/2question.png" width=161px align="right">](#question8.1.9)
```

<A name="answer8.1.10"></A>

```{block2, type='rmdanswer'}
Answer to Question 10.

* The residuals represented by the red lines in the scatterplot are counted in
the histogram.
* Drag the mouse pointer around one or more dots in the scatterplot while
pressing the left mouse button to select one or more observations (dots). You
will see where they are featured in the histogram (blue). The dot that is
furthest below the regression line is counted in the left-most bar of the
histogram. [<img src="icons/2question.png" width=161px align="right">](#question8.1.10)
```

<A name="answer8.1.11"></A>

```{block2, type='rmdanswer'}
Answer to Question 11.

* No, sometimes the distribution is clearly skewed (asymmetrical). The top of the histogram is to the right or to the left of the center of the normal distribution and the right tail is shorter and fatter than the left tail or the other way around. [<img src="icons/2question.png" width=161px align="right">](#question8.1.11)
```

<A name="answer8.1.12"></A>

```{block2, type='rmdanswer'}
Answer to Question 12.

* The blue regression line in the top graph represents the predicted values on
the attitude variable from exposure scores. The predicted attitude value for
the left-most observation is the value at which the blue line intersects with
the vertical red line dropping down from that observation.
* The left-most observation is the highest predicted value because the
regression line slopes down to the right. The highest predicted value is at
the far right of the residuals by predicted attitude graph (bottom graph)
because the predicted values are on the horizontal axis of this graph. The
left-most observation in the top graph, then, must correspond with the
right-most observation in the bottom graph.
* Note that this is not always true. If the regression slope is positive, that
is, the regression line goes up from left to right, the left-most observation
in the top graph has the lowest predicted value, so it corresponds to the
left-most observation in the residuals plot. [<img src="icons/2question.png" width=161px align="right">](#question8.1.12)
```

<A name="answer8.1.13"></A>

```{block2, type='rmdanswer'}
Answer to Question 13.

![](figures/S8_1Q13.png)

* If the association between two variables is U-shaped, the regression line underestimates the attitude for observations with low exposure, overestimates the attitude for medium values of exposure, and underestimates the attitude for high exposure values. As a result, the residuals have a marked pattern from left to right: a set of positive residuals, followed by a set of negative residuals, followed by a set of positive residuals.
* The same phenomenon occurs for a curved association.
* In contrast, a linear association yields residuals without a clear pattern. At all levels of exposure and, hence, at all predicted levels of attitude, we may encounter both positive and negative residuals. In the residuals by predicted values plot, we have positive and negative residuals everywhere. We may underestimate as well as overestimate the dependent variable everywhere. [<img src="icons/2question.png" width=161px align="right">](#question8.1.13)
```

<A name="answer8.1.14"></A>

```{block2, type='rmdanswer'}
Answer to Question 14.

![](figures/S8_1Q14.png)

* The residuals are larger for higher values of exposure. Residuals are prediction errors, so larger residuals mean worse predictions.
* In the top graph, the residuals tend to become larger from left to right: The residuals are larger for larger exposure scores (horizontal axis in the top graph).
* In this particular example, larger exposure scores predict more negative ('lower') attitudes towards smoking (vertical axis in the top graph). We can predict high attitude levels better (smaller residuals) than low attitude levels (larger residuals).
* Attitudes are on the horizontal axis in the bottom graph, so the residuals tend to become smaller if we go from left (low attitude scores) to right (high attitude scores).
* The regression model seems to predict attitude better for participants with low exposure scores than for participants with high exposure scores. [<img src="icons/2question.png" width=161px align="right">](#question8.1.14)
```

<A name="answer8.1.15"></A>

```{block2, type='rmdanswer'}
Answer to Question 15.

* The pattern reverses. Now, residuals are larger for low values of exposure
or, equivalently in this model with a negative slope, for higher predicted
values of attitude.
* In this situation, we are better at predicting low attitude levels than high
attitude levels. Note that we prefer to predict all attitude levels equally
well. [<img src="icons/2question.png" width=161px align="right">](#question8.1.15)
```

<A name="answer8.1.16"></A>

```{block2, type='rmdanswer'}
Answer to Question 16.

![](figures/S8_1Q16.png)

* If the slider is positioned at or around zero, the residuals are more or less equal for low, medium, or high predicted values of attitude.
Here, we can predict all levels of attitude equally well or equally badly.
* This is best seen in the bottom graph: The vertical spread of observations is more or less the same at the left, middle, and right of the graph. The vertical diameter of the dot cloud is more or less equal from left to right.
* This is how we like the plot to look. [<img src="icons/2question.png" width=161px align="right">](#question8.1.16)
```

## Regression Analysis in SPSS {#SPSS-regression}

### Instructions

```{r SPSSregsimple, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:regsimpleSPSS)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/XrxlCOi6SgE", height = "360px")
# Goal: asymmetric association as prediction
# Example: consumers.sav, brand awareness by advertisement exposure
# Technique: regression with confidence intervals
# SPSS menu: regression>linear with CI under Statistics.
# Paste & Run.
# Interpret output: R2, F test, predictive effect strength (b*) and change (b) with 95% confidence interval.
# Check assumptions: in chapter on moderation with regression analysis?
```

------------------------------------------------------------------------

```{r SPSSregdummy2, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:regdummy2SPSS)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/c2b4dtlPS54", height = "360px")
# Creating dummy variables in SPSS.
# Goal: Understand creating dummy variables.
# Example: smokers.sav, respondent's smoking status ( 3 categories).
# SPSS menu: Transform > Create Dummy Variables or Trasform > Recode into Different Variables.
# Inspect results: new variables, coded 0/1.
```

------------------------------------------------------------------------

```{r SPSSregdummy, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:regdummySPSS)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/Vs26zuwAZdk", height = "360px")
# Using dummy variables in a regression model in SPSS.
# Example: smokers.sav, predict the attitude towards smoking from exposure to an anti-smoking campaign and respondent's smoking status (dummies).
# SPSS menu: Transform > Create Dummy Variables
# Remember: Leave one dummy variable out.
# Interpret results: unstandardized regression coefficient as average difference with reference category. Don't interpret the standardized regression coefficient.
```

------------------------------------------------------------------------

```{r SPSSregassumpt, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:regassumptSPSS)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/hx2qdaVhlaM", height = "360px")
# Goal: Inspecting residuals.  (see Chapter 4 on hypothesis testing for video about regression basics.)
# Example: smokers.sav, predict the attitude towards smoking from exposure to an anti-smoking campaign and respondent's smoking status.
# Technique: regression analysis
# SPSS menu: linear regression, add plots
# Interpret output = check assumptions: Chart Editor of zresid * zpred plot: add reference line at 0 and perhaps at +2 and -2 to inspect shape of residual distribution.
```

### Exercises

<A name="question8.2.1"></A>

```{block2, type='rmdquestion'}
1. Use the data in [allsmokers.sav](http://82.196.4.233:3838/data/allsmokers.sav) to predict the attitude towards smoking from exposure to an anti-smoking campaign. Check the assumptions and interpret the results. [<img src="icons/2answer.png" width=115px align="right">](#answer8.2.1)
```

<A name="question8.2.2"></A>

```{block2, type='rmdquestion'}
2. Add smoking status (variable _status3_), and contact with smokers as predictors to the regression model of Exercise 1. Check the assumptions and interpret the results. [<img src="icons/2answer.png" width=115px align="right">](#answer8.2.2)
```

<A name="question8.2.3"></A>

```{block2, type='rmdquestion'}
3. The data set [allchildren.sav](http://82.196.4.233:3838/data/allchildren.sav) contains information about media literacy of children and parental supervision of their media use. Are the two related? Check the assumptions and interpret the results. [<img src="icons/2answer.png" width=115px align="right">](#answer8.2.3)
```

<A name="question8.2.4"></A>

```{block2, type='rmdquestion'}
4. How well can we predict brand awareness with ad exposure? Use [allconsumers.sav](http://82.196.4.233:3838/data/allconsumers.sav) to answer this question. [<img src="icons/2answer.png" width=115px align="right">](#answer8.2.4)
```

### Answers {.unnumbered}

<A name="answer8.2.1"></A>

```{block2, type='rmdanswer'}
Answer to Exercise 1.

SPSS syntax:

\* Check data.
FREQUENCIES VARIABLES=exposure attitude
  /ORDER=ANALYSIS.
\* Simple regression analysis with assumption checks.
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS CI(95) R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT attitude
  /METHOD=ENTER exposure
  /SCATTERPLOT=(\*ZRESID ,\*ZPRED)
  /RESIDUALS HISTOGRAM(ZRESID).

Check data:

There are no impossible values on the two variables.

Check assumptions:

![](figures/S8_2Q1.png)

* The distribution of the residuals is (left) skewed rather than normal. This is not good and we had better report this.

![](figures/S8_2Q1a.png)

* The residuals are nicely centered around zero at all predicted levels of the dependent variable, so the association seems to be linear.
* The residuals are evenly spread around zero at all predicted levels.
Perhaps, the spread is slightly smaller at high predicted values. In all,
however, prediction accuracy seems to be more or less the same at all predicted levels (homoscedasticity).

Interpret the results:

* Campaign exposure predicts attitude towards smoking among adults reasonably
well, *R*^2^ = .32, *F* (1, 310) = 143.05, *p* < .001.
* One additional unit of exposure decreases the predicted attitude by 0.32 to
0.44 points, *t* = -11.96, *p* < .001, 95% CI [-0.44; -0.32]. This is a moderate to strong effect (*b\** = -.56). [<img src="icons/2question.png" width=161px align="right">](#question8.2.1)
```

<A name="answer8.2.2"></A>

```{block2, type='rmdanswer'}
Answer to Exercise 2.

SPSS syntax:

\* Check data.
FREQUENCIES VARIABLES=exposure status3 contact attitude
  /ORDER=ANALYSIS.

\* Create dummy variables for status3.
\* With Transform > Create Dummy Variables.
\* ENSURE THAT MEASUREMENT LEVEL IS SET TO NOMINAL OR ORDINAL.
\* Define Variable Properties.
\*status3.
VARIABLE LEVEL  status3(ORDINAL).
EXECUTE.
SPSSINC CREATE DUMMIES VARIABLE=status3
ROOTNAME1=status
/OPTIONS ORDER=A USEVALUELABELS=YES USEML=YES OMITFIRST=NO.

\* If your SPSS version does not have this command, use Recode.
RECODE status3 (1=1) (ELSE=0) INTO status_2.
VARIABLE LABELS  status_2 'Former smoker'.
EXECUTE.
RECODE status3 (2=1) (ELSE=0) INTO status_3.
VARIABLE LABELS  status_3 'Smoker'.
EXECUTE.

\* Multiple regression analysis with assumption checks.
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS CI(95) R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT attitude
  /METHOD=ENTER exposure contact status_2 status_3
  /SCATTERPLOT=(\*ZRESID ,\*ZPRED)
  /RESIDUALS HISTOGRAM(ZRESID).

Check data:

All values on the variables seem to be valid.

Check assumptions:

![](figures/S8_2Q2.png)

* The residuals look quite like a normal distribution. They are much more symmetrical than in Exercise 1. Note that a new regression model (we added predictors) may solve problems with the assumptions.

![](figures/S8_2Q2a.png)

* The residuals by predicted values plot suggests that the effects can be linear. At all predicted levels, the residuals are both above and below zero, so the average residual is close to zero.
* However, the spread of the residuals is larger at higher predicted
levels (right) than low predicted levels (left). More negative attitudes are predicted better than more positive attitudes. The  new regression model solves the linearity problem (above), but it creates a problem with homoscedasticity.

Interpret the results:

* Include a table with regression coefficients, so we do not have to report all t
test results in our interpretation.

<div style="font-size: 0.8em">
| | _b_ |	SE | _b\*_ | _t_ | _p_ | 95\%CI |
|:------------------|----:|----:|----:|----:|----:|:--------:|
| Constant | 1.11 | 0.24 | | 4.52 | <.001 | [0.62, 1.59] |
| Exposure to anti-smoking campaign| -0.30 |0.02 | -0.45 | -13.31 | <.001 | [-0.35, -0.26] |
| Contact with smokers | 0.20 | 0.03 | 0.21 | 6.06 | <.001 | [0.14, 0.27] |
| Former smoker | -2.86 | 0.16 | -0.60 | -17.94 | <.001 | [-3.17, -2.54] |
| Smoker | -0.50 | 0.15 | -0.11 | -3.37 | .001 | [-0.79, -0.21] |
</div>

* The regression model predicts about seventy per cent of the variation in
attitude towards smoking, which is very much for a social scientific model, *R*^2^ = .69, *F* (4, 307) = 171.98, *p* < .001.
* Exposure to the anti-smoking campaign predicts a more negative attitude
towards smoking. We are 95% confident that an additional unit of exposure (on