Statistical-Inference/05-power.Rmd at master · WdeNooy/Statistical-Inference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
# Which Sample Size Do I Need? Power! {#power}

> Key concepts: minimum sample size, practical relevance, unstandardized effect size, standardized effect size, Cohen's *d* for means, Type I error, Type II error, test power.

Watch this micro lecture on test power for an overview of the chapter.

```{r, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/s93CScWswIE", height = "360px")
```

### Summary {.unnumbered}

```{block2, type='rmdimportant'}
* How large should my sample be?
* What does it mean if a statistical test is not significant?
```

At the start of a quantitative research project, we are confronted with a seemingly simple practical question: How large should our sample be? In some cases, the statistical test that we plan to use gives us rules of thumb for the minimum size that we need for this test.

This may tell us the minimum sample size but not necessarily the optimal sample size. Even if we can apply the statistical test technically, sample size may not be sufficient for the test to signal the population differences or associations, for short, the effect sizes, that we are interested in.

If we want to know the minimum sample size that we need to signal important effects in our data, things become rather complicated. We have to decide on the size of effects that we deem important. We also have to decide on the minimum probability that the statistical test will actually reject the hypothesis of no effect (the nil) if the true effect in the population has the selected interesting size.

This probability is the power of a test: the probability of rejecting a null hypothesis of no effect if the effect in the population is of a size interesting to us. If we do not reject a false null hypothesis, we make a Type II error.

Thinking about sample size thus confronts us with a problem that we have hitherto neglected, namely the problem of not rejecting a false null hypothesis. This problem is very important if the null hypothesis represents our research hypothesis. If the null hypothesis represents our research hypothesis, our expectations are confirmed if we do *not* succeed in rejecting the null hypothesis.

However, if we do *not* reject the null hypothesis, we cannot make a Type I error, namely rejecting a true null hypothesis. As a consequence, the significance level of our test, which is the maximum probability of making a Type I error, is meaningless. We must know the probability of rejecting a false null hypothesis---the power of the test---to express our confidence that our research hypothesis is true.

This chapter reviews concepts that are central to understanding what statistical significance of a test means: sampling distribution, hypotheses, statistical significance, and Type I error. It adds concepts that we need to interpret a test that is *not* statistically significant: effect size, practical relevance, Type II error, and test power.

## Sample Size and Test Requirements {#size-test-req}

Table \@ref(tab:thumb) in Chapter \@ref(probmodels) (summarised in Table \@ref(tab:thumbsize) below) shows the conditions that must be satisfied if we want to use a theoretical probability distribution to approximate a sampling distribution. Only if the conditions are met, the theoretical probability distribution resembles the sampling distribution sufficiently for using the former as approximation of the latter.

```{r thumbsize, echo=FALSE, screenshot.opts=list(delay = 2)}
knitr::kable(rbind(c("Binomial distribution", "proportion", "-"), c("(Standard) normal distribution", "proportion", ">= 5 divided by test proportion (<= .5)"), c("(Standard) normal distribution", "one or two means", "> 100"), c("t distribution", "one or two means", "each group > 30"), c("t distribution", "(Spearman) rank correlation coefficient", "> 30"), c("t distribution", "regression coefficient", "20+ per independent variable"), c("F distribution", "3+ means", " all groups are more or less of equal size"), c("chi-squared distribution", "row or cell frequencies", "expected frequency >= 1 and 80% >= 5 (sample 5+ observations per category or cell)")), col.names = c("Distribution", "Sample statistic", "Minimum sample size"), caption = "Rules of thumb for minimum sample sizes.", booktabs = TRUE) %>%
  kable_styling(font_size = 12, full_width = F,
                latex_options = c("scale_down", "HOLD_position"))
```

Conditions often include sample size. Table \@ref(tab:thumbsize) reproduces the size requirements from Table \@ref(tab:thumb). If you plan to do a *t* test, each group should contain more than thirty cases. So if you intend to apply *t* tests, recruit more than thirty participants for each experimental group or more than thirty respondents for each group in your survey. If you expect non-response, that is, sampled participants or respondents unwilling to participate in your research, you should recruit more participants or respondents to have more than thirty observations in the end.

Chi-squared tests require a minimum of five expected frequencies per category in a frequency distribution or cell in a contingency table. Your sample size should be at least the number of categories or cells times five to come even near this requirement. Regression analysis requires at least 20 cases per independent variable in the regression model.

The variation of sample size across groups is important in analysis of variance (ANOVA), which uses the *F* distribution. If the number of cases is more or less the same across all groups, we do not have to worry about the variances of the dependent variable for the groups in the population. To be on the safe side, then, it is recommended to design your sampling strategy in such a way that you end up with more or less equal group sizes if you plan to use analysis of variance.

## Effect Size {#effectsize}

```{r sample-size-unstand, fig.pos='H', fig.align='center', fig.cap="Sampling distribution of average candy weight under the null hypothesis that average candy weight is 2.8 grams in the population.", echo=FALSE, out.width="775px", screenshot.opts = list(delay = 5), dev="png"}
# Average candy weight sampling distribution approximated by a t distribution with two-sided 5% significance tails. Weak, moderate, and strong effect sizes (under null hypothesis that average candy weight is 2.8 in the population) marked by vertical red lines.
knitr::include_app("http://82.196.4.233:3838/apps/sample-size-unstand/", height="268px")
```

<A name="question5.2.1"></A>

```{block2, type='rmdquestion'}
1. If we draw a sample of twenty candies, which average candy weight in our sample rejects the null hypothesis according to Figure \@ref(fig:sample-size-unstand): 2.40, 2.70, or 3.05 grams? [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.1)
```

<A name="question5.2.2"></A>

```{block2, type='rmdquestion'}
2. Imagine that legal regulations allow average candy weight in a sample bag to differ 0.10 grams but not 0.25 grams from the weight reported on the sample bag. Our statistical test should pick up (be significant) differences of at least 0.25 grams, but not differences of 0.10 grams or less. What is the minimum and maximum size of our sample? Use the slider to find the answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.2)
```

We have learned that larger samples have smaller standard errors (Section \@ref(sample-size)). Smaller standard errors yield larger test statistic values and larger test statistics have smaller *p* values. In other words, a test on a larger sample is more often statistically significant.

A larger sample offers more precision, so the difference between our sample outcome and the hypothesized value is more often sufficient to reject the null hypothesis. For example, we would reject the null hypothesis that average candy weight is 2.8 grams in the population if average weight in our sample bag is 2.70 grams and our sample is large. But we may not reject this null hypothesis if we have the same outcome in a small sample bag.

::: {style="column-count: 2; -moz-column-count: 2"}
The larger our sample, the more sensitive our test will be, so we will get statistically significant results more often. If we think of our statistical test as a security metal detector, a more sensitive detector will go off more often.

Of course, the size of the difference between our sample outcome and the hypothesized population value matters as well. This difference is called *effect size*. If average candy weight in our sample bag deviates more from the average weight specified in the null hypothesis, we are more likely to reject the null hypothesis. In terms of a security metal detector: Our test will pick up large pieces of metal more easily than small pieces.

![Security metal detector. [Jason Lander from Portland, United States, Wikimedia CommonsCC BY 2.0](https://upload.wikimedia.org/wikipedia/commons/c/c0/Security_%282363676375%29.jpg)](figures/metaldetector.jpg)
:::

The _p_ value and rejection of the null hypothesis based on the _p_ value, then, depend both on sample size and effect size.

```{block2, type='rmdimportant'}
* A larger sample size makes a statistical test more sensitive. The test will pick up (be statistically significant for) smaller effect sizes.

* A larger effect size is more easily picked up by a statistical test. Larger effect sizes yield statistically significant results more easily, so they require smaller samples.
```

Deciding on our sample size, we should ask ourselves this question: What effect size should produce a significant test result? In the security metal detector example, at what minimum quantity of metal should the alert sound? To answer this question, we should consider the practical aims and context of our research.

### Practical relevance

Investigating the effects of a new medicine on a person's health, we may require some minimum level of health improvement to make the new medicine worthwhile medically or economically. If a particular level of improvement is clinically important, it is *practically relevant* (sometimes called practically significant).

If we have decided on a minimum level of improvement that is relevant to us, we want our test to be statistically significant if the average true health improvement in the population is at least of this size. We want to reject the null hypothesis of no improvement in this situation.

For media interventions such as health, political, or advertisement campaigns, one could think of a minimum change of attitude affected by the campaign in relation to campaign costs. A choice between different campaigns could be based on their efficiency in terms of attitudinal change per cost unit.

Note the important difference between practical relevance and statistical significance. Practical relevance is what we are interested in. If the new medicine is sufficiently effective, we want our statistical test to signal it. In the security metal detector example: If a person carries too much metal, we want the detector to pick it up.

Statistical significance is just a tool that we use to signal practically relevant effects. Statistical significance is not meaningful in itself. For example, we do not want to have a security detector responding to a minimal quantity of metal in a person's dental filling. Statistical significance is important only if it signals practical relevance. We will return to this topic in Chapter \@ref(crit-discus).

### Unstandardized effect size

The difference between our sample outcome and the hypothesized value is the *unstandardized effect size*. If we test a mean, the unstandardized effect size is just the difference between our sample mean and the hypothesized population mean. For example, if we hypothesize that average candy weight in the population is 2.8 grams and we find an average candy weight in our sample bag of 2.75 grams, the unstandardized effect size is -0.05 grams. If a difference of 0.05 grams is a great deal to us, the effect is practically relevant.

Unstandardized effect sizes depend on the scale on which we measure the sample outcome. The unstandardized effect size of average candy weight changes if we measure candy weight in grams, micro grams, kilograms, or ounces. Of course, changing the scale does not affect the meaning of the effect size but the number that we are looking at is very different: 0.05 grams, 50 milligrams, 0.00005 kilos, or 0.00176 ounces. For this reason, we do not have rules of thumb for interpreting unstandardized effect sizes in terms of small, medium, or large effects.

### Standardized effect size: Cohen's *d* for one or two means

In scientific research, we rarely have precise norms for raw differences (unstandardized effects) that are practically relevant or substantial. For example, what would be a practically relevant attitude change among people exposed to a health campaign?

To avoid answering this difficult question, we can take the variation in scores (standard deviation) into account. In the context of the candies example, we will not be impressed by a small difference between observed and expected (hypothesized) average candy weight if candy weights vary a lot. In contrast, if candy weight is quite constant, a small average difference can be important.

For this reason, standardized effect sizes for sample means divide the difference between the sample mean and the hypothesized population mean by the standard deviation in the sample. Thus, we take into account the variation in scores. This standardized effect size for tests on one or two means is known as *Cohen's* d.

------------------------------------------------------------------------

These are the formulas for Cohen's *d* for a one-sample *t* test, a paired-samples *t* test, and an independent-samples *t* test (they will be provided if needed):

::: {style="column-count: 3; -moz-column-count: 3"}
```{=tex}
\begin{equation}
  d_{one_-sample} = \frac{M - \mu_0}{SD}
\end{equation}
```
```{=tex}
\begin{equation}
  d_{paired_-samples} = \frac{M_{diff} - \mu_{0_-diff}}{SD_{diff}}
\end{equation}
```
```{=tex}
\begin{equation}
  d_{independent_-samples} = \frac{2*t}{\sqrt(df)}
\end{equation}
```
:::

Where:

-   $M$ is the sample mean, $\mu_0$ is the hypothesized population mean, and $SD$ is the standard deviation in the sample,

-   $M_{diff}$ is the difference between the two means in the sample, $\mu_{0_-diff}$ is the hypothesized difference between the two means in the population mean, which is zero in case of a nil hypothesis, and $SD_{diff}$ is the standard deviation of the difference in the sample,

-   $t$ is the test statistic value and $df$ is the number of degrees of freedom of the *t* test.

------------------------------------------------------------------------

The sample outcome can be a single mean, for instance the average weight of candies, but it can also be the difference between two means, for example, the difference in colourfulness of yellow candies at the beginning and end of a time period. In the latter case, the standard deviation that we need is the standard deviation of colourfulness difference across all candies (Section \@ref(dependentsamples)). In the case of independent samples, such as average weight of red versus yellow candies, we need a special combined (*pooled*) standard deviation for yellow and red candy weight that is not reported by SPSS. Here, we use the *t* value and degrees of freedom to calculate Cohen's *d*.

Using an inventory of published results of tests on one or two means, Cohen [-\@RefWorks:3933] proposed rules of thumb for standardized effect sizes (ignore a negative sign if it occurs):

-   0.2: weak (small) effect,

-   0.5: moderate (medium) effect,

-   0.8: strong (large) effect.

Note that Cohen's *d* can take values above one. These are not errors, they reflect very strong or huge effects [@sawilowskyNewEffectSize2009].

### Obtaining Cohen's *d* with SPSS

#### Instructions

```{r cohend, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:cohendtitle)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/HmyW7HRM64Q", height = "360px")
# Unfortunately, the t test commands in SPSS have no option to calculate Cohen's
# d. It is, however, relatively easy to calculate Cohen's _d_ by hand from SPSS
# output. Remember that we must divide the unstandardized effect by the standard
# deviation.
#
# For a t test on one mean, the unstandardized effect is the difference between
# the sample mean and the hypothesized mean. SPSS reports this value in the
# column __Mean Difference__ of the table with test results. Drop any negative
# signs! Divide it by the standard deviation of the variable as given in Table
# __One-Sample Statistics__.
#
# In the example, Cohen's _d_ is 0.036 / 0.169 = 0.21. This is a weak effect.
#
# For a paired-samples t test, the unstandardized effect size is reported in the
# column __Mean__ in the Table __Paired Samples Test__. The standard deviation
# of the difference can be found in column __Std. Deviation__ in the same table.
# Divide the first by the second, for instance, 1.880 / 1.033 = 1.82. This is a
# strong effect.
#
# For an independent-samples t test, the situation is less fortuitous because
# SPSS does not report the pooled sample standard deviation that we need. The
# pooled sample standard deviation takes a sort of  average of the outcome
# variable's standard deviations in the two groups. As an approximation, we can
# calculate Cohen's _d_ as follows: Double the t value and divide it by the square
# root of the degrees of freedom.
#
# In the example, Cohen's _d_ equals $(2 * 0.651) / \surd(18) = 0.31$. This is a
# moderate effect size.
```

#### Exercises

<A name="question5.2.3"></A>

```{block2, type='rmdquestion'}
3. Open data set [voters.sav](http://82.196.4.233:3838/data/voters.sav) that contains information about the attitude towards immigration among a random sample of voters. What are the unstandardized and standardized effect sizes if the hypothesized average attitude towards immigrants in the population is 6.0? [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.3)
```

<A name="question5.2.4"></A>

```{block2, type='rmdquestion'}
4. What are the effect sizes if the null hypothesis states that the average attitude towards immigrants in the population is at least 6.0? And what if it states that average attitude is at most 6.0? [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.4)
```

<A name="question5.2.5"></A>

```{block2, type='rmdquestion'}
5. What are the unstandardized and standardized effect sizes of a test in which we compare the attitude towards immigrants of young voters to the attitude of old voters? Again, use data set [voters.sav](http://82.196.4.233:3838/data/voters.sav). [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.5)
```

### Association as effect size {#assoc-size}

Measures of association such as Pearson's product-moment correlation coefficient or Spearman's rank correlation coefficient express effect size if the null hypothesis expects no correlation in the population. If zero correlation is expected, a correlation coefficient calculated for the sample expresses the difference between what is observed (sample correlation) and what is expected (zero correlation in the population).

Effect size is also zero according to the standard null hypotheses used for tests on the regression coefficient (*b*), *R*^2^ for the regression model, and eta^2^ for analysis of variance. As a result, we can use the standardized regression coefficient (Beta in SPSS and *b*\* according to APA), *R*^2^, and eta^2^ as standardized effect sizes.

Because they are standardized, we can interpret their effect sizes using rules of thumb. The rule of thumb for interpreting a standardized regression coefficient (*b*\*) or a correlation coefficient, for example, could be that a value between 0 and .10 is interpreted as no or a very weak association, between .10 and .30 as weak, between .30 and .50 as moderate, .50 to .80 as strong, and .80 to 1.00 as very strong, while exactly 1.00 is a perfect association. Note that we ignore the sign (plus or minus) of the effect when we interpret its size.

### Standardized effect size and sample size

We can use standardized effect size to express the effects that we are interested in. We choose whether small, moderate, or large effects are of practical interest to us. Preferably, we know from previous research whether small, moderate, or large effects are common in our type of research. If moderate or large effects are rare, we should use a sample size that allows detecting small effects. In contrast, when large effects occur frequently, we can do with a smaller sample that may overlook small effects.

If we know the effect size in the sample for which we want statistically significant results, we can figure out the minimum sample size for which the test statistic is statistically significant.

```{r sample-size, fig.pos='H', fig.align='center', fig.cap="What is the minimum sample size required for a significant two-sided test result if the sample mean has a particular effect size? The _p_ values belong to two-sided tests.", echo=FALSE, out.width="775px", screenshot.opts = list(delay = 5), dev="png"}
# Average candy weight sampling distribution approximated by a t distribution with two-sided 5% significance tails. Weak, moderate, and strong effect sizes (under null hypothesis that average candy weight is 2.8 in the population) marked by vertical red lines with two-sided p values for a smaple with this average.
knitr::include_app("http://82.196.4.233:3838/apps/sample-size/", height="268px")
```

<A name="question5.2.6"></A>

```{block2, type='rmdquestion'}
6. Use the slider in Figure \@ref(fig:sample-size) to find the minimum sample size that we need for a statistically significant test result for each of the three effect sizes represented by red lines. [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.6)
```

<A name="question5.2.7"></A>

```{block2, type='rmdquestion'}
7. What is the meaning of the _p_ values and why do they decrease if we increase sample size? [<img src="icons/2answer.png" width=115px align="right">](#answer5.2.7)
```

Effect size as well as test statistics reflect the difference between what we expect according to the null hypothesis and what we observe in our sample. As a consequence, effect size indicators and test statistics are related. In some cases, such as Cohen's *d*, the relation between effect size and test statistic is very simple.

The test statistic *t* for a *t* test on one mean, for example, is equal to Cohen's *d* times the square root of sample size. Here, the only difference between the two is sample size! Sample size influences the test statistic---the larger the sample, the larger the test statistic--- but it does not influence effect size. This is an important difference.

### Answers {.unnumbered}

<A name="answer5.2.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* With sample size _N_ = 20, the samples with 2.40 and 3.05 grams for average candy weight (red line at the left and at the right) are statistically significant. These red lines cross a blue tail, which tells us that they are in the rejection region of the test. [<img
src="icons/2question.png" width=161px align="right">](#question5.2.1)
```

<A name="answer5.2.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* Average weight in the sample differs 0.25 from the hypothesized value (2.80
gram) for a sample average of 0.25 + 2.80 = 3.05. This value is marked by the
red line at the right.
* The value 3.05 starts falling within the rejection region (under the blue
tails) at sample size 18. This is the minimum sample size we need to get a
statistically significant result for a sample average that differs 0.25 grams
from the hypothesized value.
* Average sample candy weight of 2.70, which is 0.10 below the hypothesized
value (2.80), becomes statistically significant at sample size 91. Samples of
this size or larger pick up differences that are too small to be of interest to
us. Their statistical significance give a false alarm, so to speak.
* All sample sizes from 18 up to and including 91 serve our purpose, namely
signalling differences larger than 0.10 grams and always signalling a difference
of 0.25 grams. [<img src="icons/2question.png" width=161px align="right">](#question5.2.2)
```

<A name="answer5.2.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* Execute a one-sample _t_ test with 6.0 as test value.
* The reported mean difference is -0.5; this is the unstandardized effect
size.
* SPSS (version 27 and later) reports that Cohen's _d_ is -0.26. This is a weak effect.
* If you have to calculate Cohen's _d_ by hand, divide the
unstandardized effect size by the standard deviation of the variable (here:
attitude towards immigration), which is reported to be 1.939 Cohen's _d_ = -0.5 / 1.939 = -0.26. [<img src="icons/2question.png" width=161px align="right">](#question5.2.3)
```

<A name="answer5.2.4"></A>

```{block2, type='rmdanswer'}
Answer to Question 4.

* In a one-sided test, we take the boundary value (here: 6.0) as the value
against which we test. It makes sense to use this value also for calculating
effect size provided that our sample result is on the correct side of the boundary value.
* If we hypothesize, for example, that average attitude towards immigrants is at least 6.0 in the population and we find a sample average of 5.5, the unstandardized and standardized effect sizes, then, are the same as in Question 3.
* In contrast, if our sample average would have been 6.5, it is covered by the null hypothesis (population average is at least 6.0). In this situation, we could argue that there is no difference between the hypothesized average(s) and the average in our sample, so there is no effect. [<img src="icons/2question.png" width=161px align="right">](#question5.2.4)
```

<A name="answer5.2.5"></A>

```{block2, type='rmdanswer'}
Answer to Question 5.

* Execute an independent-samples _t_ test with groups defined by the variable age_group.
* The reported mean difference is (-)0.718; this is the unstandardized effect size. Note that the mean difference is positive if you select the old as the first group and the young as the second group. Otherwise, it is negative.
* SPSS (version 27 and later) reports that Cohen's _d_ is (-)0.37. This is a moderate (medium) effect.
* If you calculate the standardized effect size (Cohen's _d_) by hand, you should use the _t_ value and degrees of freedom from the top row. Divide twice the _t_ value by the square root of the degrees of freedom: Cohen's _d_ = 2 * 1.201 / &radic;(64) = 2.402 / 8 = 0.30. This is a
weak effect.
* In this particular case, the value of Cohen's _d_ reported by SPSS and the manually calculated value are slightly different, probably because we have such a small group of young voters in our sample.
* Note that the sample of young voters is too small to conduct a _t_ test here. [<img src="icons/2question.png" width=161px align="right">](#question5.2.5)
```

<A name="answer5.2.6"></A>

```{block2, type='rmdanswer'}
Answer to Question 6.

* We need a sample of minimum size 9 to have a statistically significant result
for a sample with average weight 2.40 grams, which represents a strong effect.
With 9 observations in the sample, the red line of a strong effect is in the
2.5% tail of the sampling distribution. The _p_ value is below .05.
* For a moderate effect, we need a sample of at least 18 observations.
* For a weak effect, we need no less than 99 observations.
* By the way, some of these  sample sizes are too small for using the t
distribution as approximation of the sampling distribution. Here, the rule of
thumb (31 cases) has precedence. [<img src="icons/2question.png" width=161px align="right">](#question5.2.6)
```

<A name="answer5.2.7"></A>

```{block2, type='rmdanswer'}
Answer to Question 7.

* The _p_ values give the (two-sided) probability of drawing a sample with average
candy weight at least as different from the hypothesized value as the value on
the horizontal axis.
* For example, the leftmost red line represents a strong effect for a sample
with average weight 2.40 grams under the null hypothesis that average candy
weight is 2.80 in the population. The associated _p_ value informs us that we
have 14.8% probability of drawing a sample with average weight at least 0.40
(2.80 - 2.40) grams different from the hypothesized value (2.80). Differences
of at least 0.40 represent strong effects in this example, so the _p_ value
tells us the probability of finding a strong effect in our sample if the null
hypothesis is true.
* The larger the sample, the smaller the standard error, the narrower the
sampling distribution, the smaller the probability of drawing a sample with an
average candy weight that differs from the hypothesized value purely by
chance. [<img src="icons/2question.png" width=161px align="right">](#question5.2.7)
```

## Hypothetical World Versus Imaginary True World

In the preceding paragraphs, we determined sample size using the effect size that we expect to find in our sample. We should realize, however, that we are interested in the effect size in the population. The 'true' effect size, so to speak.

The effect of a new medicine or media campaign in our sample is not important but the effect in the population is. This complicates the calculation of the sample size that we need. Instead of using the effect size in our (future!) sample, we must use the effect size in the population.

### Imagining a population with a small effect

Our null hypothesis states that average candy weight in the population is 2.8 grams. Let us decide that a small effect size is practically relevant. We can think now of a population that could be the true population if the effect size is small. For example, a population in which average candy weight is 2.9 grams (and the standard deviation is 0.5).

We do not know whether average candy weight is 2.9 grams in the true population. So we may regard the statement that average candy weight is 2.9 grams in the population as another hypothesis. Let us call this the alternative hypothesis *H*~1~. Note that this is not an ordinary alternative hypothesis because it does not include all outcomes not covered by the null hypothesis (*H*~0~). Instead, it represents only one value, which is an important value to us because it represents a population with an interesting effect size.

```{block2, type='rmdpearson'}
Our habit of formulating a null hypothesis and an alternative hypothesis for all situations not covered by the null hypothesis is generally attributed to the statistician R.A. Fisher. This, however, is not entirely correct [see, e.g., @RefWorks:3931]. Fisher introduced the concept of a null hypothesis [@RefWorks:3932: 18] but not the concept of an alternative hypothesis.
The statisticians Jerzy Neyman and Egon Pearson introduced the idea of working with two or more hypotheses. But the two hypotheses do not cover all possible population values and they were usually not called a null and alternative hypothesis. They specify two or more different population values. A statistical test is used to determine which of the hypotheses fits the sample best. [@RefWorks:3906]

Egon Pearson. Photo by Grasso Luigi, Wikimedia Commons, CC BY-SA 4.0
```

Figure \@ref(fig:TypeI-II-errors) illustrates this situation. Before reading on, try to make sense of the steps in this figure. The questions accompanying the figure walk you through these steps.

```{r TypeI-II-errors, echo=FALSE,fig.pos='H', fig.align='center', fig.cap="The relation between significance tests, effect size, Type II error, and power.", screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
#This Shiny app illustrates Type I and Type II Error. Interaction helps to understand how significance level and power are related.
#Source: Adapted from Tarik Gouhier, type1vs2-master, https://github.com/tgouhier/type1vs2
knitr::include_app("http://82.196.4.233:3838/apps/type1vs2/", height="310px")
```

<A name="question5.3.1"></A>

```{block2, type='rmdquestion'}
1. What is the rejection region in the graph provided in Step 1 of Figure \@ref(fig:TypeI-II-errors)? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.1)
```

<A name="question5.3.2"></A>

```{block2, type='rmdquestion'}
2. Is the null hypothesis true or false if the sample mean is not in the rejection region? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.2)
```

<A name="question5.3.3"></A>

```{block2, type='rmdquestion'}
3.  Can we make a Type I error if the sample mean is not in the rejection region? Explain the meaning of a Type I error in your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.3)
```

<A name="question5.3.4"></A>

```{block2, type='rmdquestion'}
4. What is the size of the unstandardized effect size that is deemed practically relevant in Step 2? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.4)
```

<A name="question5.3.5"></A>

```{block2, type='rmdquestion'}
5. If H~1~ is true, which area under the sampling distributions represents the probability that we draw a sample with average candy weight that is less than the average of our sample (2.84 grams)? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.5)
```

<A name="question5.3.6"></A>

```{block2, type='rmdquestion'}
6. In Step 3, the yellow area represents a set of samples. What will researchers do with the null hypothesis if they draw one of these samples? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.6)
```

<A name="question5.3.7"></A>

```{block2, type='rmdquestion'}
7. If H~1~ is true, the yellow area is a probability. A probability of what? [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.7)
```

<A name="question5.3.8"></A>

```{block2, type='rmdquestion'}
8. The green area represents the power of a test in Step 4. Describe what power is in terms of rejecting the null hypothesis. [<img src="icons/2answer.png" width=115px align="right">](#answer5.3.8)
```

### The world of the researcher

We have two populations, a hypothetical population defined by our null hypothesis (H~0~) and an imaginary true population defined by the alternative hypothesis (H~1~). Once we have drawn our sample, we only deal with the hypothetical population as we have done in all preceding chapters.

Acting as if the null hypothesis is true, we determine how (un)likely the sample is that we have drawn. If it is very unlikely, we have a *p* value below the significance level and we reject the null hypothesis. We say: If the null hypothesis is true, our sample is too unlikely, so we reject the null hypothesis.

We can be wrong. Perhaps the null hypothesis is actually true and we were just very unfortunate to draw a sample that is very different from the population. If so, we make a Type I error (see Section \@ref(sig-typeI)). The probability that we will make this error is the significance level of the test, which is usually set to .05.

This is what we are doing once we have the sample. Let us call this the world of the researcher.

### The alternative world of a small effect

Now let us ask ourselves: What is going to happen to our statistical test if the true population from which we draw our sample has average candy weight that is a bit higher (small effect) than candy weight according to our null hypothesis?

If we actually sample from this imaginary true population, the sampling distribution centered around the alternative hypothesis (H~1~) in Figure \@ref(fig:TypeI-II-errors) represents our true sampling distribution. It shows us the true probabilities (areas under the curve) of drawing a sample with a particular minimum or maximum value. These are the probabilities of our sample if there is a small effect in the population.

Now that we know the true sampling distribution if there is a small effect in the population, we can foresee what is going to happen when we enter the world of the researcher. The researcher is going to use the rejection region to decide on the null hypothesis. If the sample mean is in the rejection region, the researcher is going to reject the null hypothesis. Otherwise, the null hypothesis is not rejected.

### Type II error {#typeIIerror}

If there is a (small) effect in the population, the null hypothesis is not true. For example, average candy weight is not 2.8 grams, it is 2.9 grams in the population. If our sample mean is close to 2.8 grams, we may not reject the null hypothesis even if it is not true. This is a *Type II error*: not rejecting a false null hypothesis.

The probability that we make a Type II error if there is a small effect is expressed by the yellow section in Figure \@ref(fig:TypeI-II-errors), Steps 3 and 4. It is usually denoted by the Greek letter beta ($\beta$). The yellow section represents the probability of drawing a sample from the population with a small effect size that is *not* in the rejection region, so the null hypothesis is *not* rejected.

Table \@ref(tab:errortable) summarizes the four possible situations that may arise if we test a null hypothesis. The null hypothesis may be true or false and we may or may not reject the null hypothesis.

```{r errortable, echo=FALSE, screenshot.opts=list(delay = 2)}
knitr::kable(rbind(c("Null is rejected", "Type I error, Significance level (alpha)", "No error, Power (1 - beta)"), c("Null is not rejected", "No error, (1 - alpha)", "Type II error, (beta)")), col.names = c("", "Null is true", "Null is false"), caption = "Error types and their probabilities.", booktabs = TRUE) %>%
  kable_styling(font_size = 12, full_width = F,
                latex_options = c("scale_down", "HOLD_position"))
```

### Power of the test

The probability of *not* making a Type II error is called the *power of the test*. It is equal to one minus the probability of making a Type II error, that is, 1 - $\beta$. The power of the test is represented by the green section in Figure \@ref(fig:TypeI-II-errors), Step 4. It represents the probability of getting a sample that makes the researcher reject the null hypothesis. So a false null hypothesis is rejected and we do not make an error.

In the example of Figure \@ref(fig:TypeI-II-errors) Step 4, the power of the test is 84 per cent (0.84). If average candy weight is 2.9 grams in the population, we have 84 per cent probability of rejecting the null hypothesis that it is 2.8 grams if we draw a sample (of the current size) from this population. This is usually considered an acceptable level of power (see Section \@ref(sizeeffectpower)).

### Answers {.unnumbered}

<A name="answer5.3.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* The rejection region contains all sample outcomes (average candy weight in
this example) for which we reject the null hypothesis. These are the sample
outcomes in the blue tails of the sampling distribution, where the horizontal axis is red: all averages below
2.73 and all averages over 2.87. [<img src="icons/2question.png" width=161px
align="right">](#question5.3.1)
```

<A name="answer5.3.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* We don't know if the null hypothesis is true or false. Chapter \@ref(hypothesis) should have taught us that we reject the null hypothesis or not, but this does not prove that the null hypothesis is false or true. [<img
src="icons/2question.png" width=161px align="right">](#question5.3.2)
```

<A name="answer5.3.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* No, if we do not reject the null hypothesis, we cannot reject a true null
hypothesis. Type I error is rejecting a null hypothesis that is true (Section
\@ref(sig-typeI)). [<img src="icons/2question.png" width=161px
align="right">](#question5.3.3)
```

<A name="answer5.3.4"></A>

```{block2, type='rmdanswer'}
Answer to Question 4.

* In general, an effect size is the difference (horizontal distance in the figure) between the population value according to the null hypothesis and either an (imaginary) true population value or the value found in a sample.
* H~1~ specifies an imaginary true population value that differs sufficiently from the null hypothesis to be practically relevant.
* The effect size is the difference between average candy weight according to H~1~ (2.9 grams) and according to H~0~ (2.8 grams), so the effect size is 0.1 (grams).
* The effect size is unstandardized because it is expressed in the original unit of measurement: weight in grams. We did not standardize the effect size using a standard deviation.
* Note that the unstandardized effect size of the sample is the difference between average candy weight in the sample (2.84 grams) and average weight according to H~0~ (2.8 grams), so this effect size is 0.04 (grams). [<img src="icons/2question.png" width=161px align="right">](#question5.3.4)
```

<A name="answer5.3.5"></A>

```{block2, type='rmdanswer'}
Answer to Question 5.

* The sampling distribution of H~1~ is the actual sampling distribution from which we draw our sample because the true population value is the center of the true sampling distribution. So we must look at this sampling distribution.
* The area under this curve to the left of the red line representing the sample mean (2.84) gives the probability of drawing a sample with average candy weight less than our sample. [<img src="icons/2question.png" width=161px align="right">](#question5.3.5)
```

<A name="answer5.3.6"></A>

```{block2, type='rmdanswer'}
Answer to Question 6.

* Researchers will not reject the null hypothesis because the sample is not in the rejection area. All samples with average candy weight between 2.73 and 2.87 grams are sufficiently close to the value of the null hypothesis: 2.8 grams.
 [<img src="icons/2question.png" width=161px align="right">](#question5.3.6)
```

<A name="answer5.3.7"></A>

```{block2, type='rmdanswer'}
Answer to Question 7.

* If H~1~ is true, the right-hand curve is the true sampling distribution because it is centered around H~1~. The yellow area, then, is the true probability of drawing samples with average candy weight that do not reject the null hypothesis (see Question 6).
* In short, the yellow area is the probability that we do not reject the null hypothesis (H~0~) if it is false because H~1~ is true.
* Not rejecting a false null hypothesis is called a Type II error.
 [<img src="icons/2question.png" width=161px align="right">](#question5.3.7)
```

<A name="answer5.3.8"></A>

```{block2, type='rmdanswer'}
Answer to Question 8.

* The green area is positioned above the horizontal blue lines, so samples in the green area are in the rejection region of the null hypothesis test. With these samples, we reject the null hypothesis.
* The green area is the probability---remember, areas under sampling distributions represent probabilities---of rejecting the null hypothesis if H~0~ is false and H~1~ is true. This probability is called the power of a test.
* Rejecting a false null hypothesis means that we do not make a Type II error. So power is the probability of not making a Type II error.
 [<img src="icons/2question.png" width=161px align="right">](#question5.3.8)
```

## Sample Size, Effect Size, and Power {#sizeeffectpower}

Finally, after all these sections, we can answer the question raised in the beginning of this chapter: How large should my sample be? To answer this question, we must consider effect size, type of test, significance level, and test power.

```{r sample-size-power, fig.pos='H', fig.align='center', fig.cap="How does test power depend on effect size, type of test, significance level, and sample size? Sampling distributions of the sample mean under the null hypothesis (H~0~, left-hand curve) and under the assumed true value of the population mean (H~1~, right-hand curve) for a one-sample _t_ test.", echo=FALSE, out.width="775px", screenshot.opts = list(delay = 5), dev="png"}
# Shiny app to determine sample size for a specified (standardized) effect size,
# significance level, and test power for a (simple) one-sample _t_ test, using the
# pwr:: package. Illustrate power with hypothesized and true sampling
# distributions as _t_ distributions (as in app reshyp-althyp) with effect size on
# x axis. SLiders or inputs vor standardized effect size (0.2 - small, 0.5 -
# medium, 0.8 -large), significance level (90% two-sided, 90% one-sided, 95%
# two-sided, 95% one-sided, 99% two-sided, 99% one-sided), and test power {50%,
# 80%, 90%, 95%, 99%}.
# simplify PS.shiny_master (doesn't work yet?) Use R code from
# http://powerandsamplesize.com/Calculators/ in our own app?
knitr::include_app("http://82.196.4.233:3838/apps/sample-size-power2/", height="385px")
```

<A name="question5.4.1"></A>

```{block2, type='rmdquestion'}
1. Figure \@ref(fig:sample-size-power) shows the sampling distributions of the sample mean under the null hypothesis (H~0~, left-hand curve) and under the assumed true value of the population mean (H~1~, right-hand curve). Explain the meaning of the yellow and green surfaces in the graph. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.1)
```

<A name="question5.4.2"></A>

```{block2, type='rmdquestion'}
2. How does test power change if we want our statistical test to detect a smaller effect, for example, a standardized effect of 0.2 instead of 0.5? Move the standardized effect size slider to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.2)
```

<A name="question5.4.3"></A>

```{block2, type='rmdquestion'}
3. How does test power change if we change our two-sided test into a one-sided test? Change the test type to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.3)
```

<A name="question5.4.4"></A>

```{block2, type='rmdquestion'}
4. How does test power change if we select a lower significance level (lower probability of making a Type I error), for example, $\alpha$ = 0.01? Move the significance level slider to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.4)
```

<A name="question5.4.5"></A>

```{block2, type='rmdquestion'}
5. How does test power change if we increase sample size? Explain what happens. Move the sample size slider to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.5)
```

<A name="question5.4.6"></A>

```{block2, type='rmdquestion'}
6. Which minimum sample size do we need to have 80% probability that a one-sided test is statistically significant at the 5% significance level if there is a moderately strong effect (standardized effect size 0.5) in the population? Use the sample size slider in Figure \@ref(fig:sample-size-power) to find the minimum sample size. [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.6)
```

<A name="question5.4.7"></A>

```{block2, type='rmdquestion'}
7. Effects tend to be small (0.2) rather than moderate (0.5) in communication science. What is the test power for a small effect with the sample size that you obtained in Question 6 and how large must the sample be if we want 80% test power for a test with a small effect? [<img src="icons/2answer.png" width=115px align="right">](#answer5.4.7)
```

Sample size, statistical significance, effect size, and test power are related. To determine the size of your sample, you have three sliders that you should adjust simultaneously. Statistical significance is the easiest slider to decide on; we usually leave the significance level at .05. We do not select a smaller value because it will reduce the power of the test (with the same sample size and effect size) as you may have noticed while answering Question 4.

For effect size, we have to choose between a small, moderate, or large effect. Previous results of research similar to our research project can help us decide whether we should expect small effect sizes or not. If we have a concrete number for the (standardized) minimum effect size that is of practical relevance, we can use that number.

For power, the conventional rule of thumb is that we like to have at least 80 per cent probability of rejecting a false null hypothesis. You may note that the probability of *not* rejecting a true null hypothesis is higher: 95 per cent. Remember, the probability to reject a true null hypothesis, which is the significance level, is usually set to five per cent, so the probability of not rejecting a true null hypothesis is 95 per cent.

Power is set to a lower level because the null hypothesis is usually assumed to reflect our best knowledge about the world. From this perspective, we are keener on avoiding the error of falsely rejecting the null hypothesis (our current best knowledge) than falsely not rejecting (accepting) it. This approach, however, is not without criticisms as we will discuss in Chapter \@ref(crit-discus). Anyway, if you want to raise the power to the same level of .95, you can do so; it will require a larger sample.

Unfortunately, test power receives little attention in several software packages for statistical analysis. Using power and effect size to calculate the required sample size is usually not provided in the package. To calculate sample size, we need dedicated software, for example [GPower](http://www.gpower.hhu.de/).

### So how do we determine sample size?

All in all, using effect size and test power to determine the size of the sample requires several decisions on the part of the researcher. It can be difficult to specify the effect size that we should expect or that is practically relevant. If there is little prior research comparable to our new project, we cannot reasonably specify an effect size and calculate sample size.

Of course, it is important to ensure that our sample meets the requirements of the tests that we want to specify (Section \@ref(size-test-req)). In practice, researchers often go well beyond this minimum. They try to collect as large a sample as is feasible just to be on the safe side.

Does this mean that all we have learned about effect size and test power is useless? Certainly not. First of all, we should have learned that effect size is more important than statistical significance because effect size relates to practical relevance.

Second, test power and Type II errors are important in situations in which we do not reject the null hypothesis. Then, we should calculate test power to get an impression of our confidence in the result. Is our test of sufficient power to yield significant results if there is an effect in the population? This is the topic of the next section.

### Answers {.unnumbered}

<A name="answer5.4.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* The yellow area gives the probability that we draw a sample from a population with true mean H~1~ which does not reject the null hypothesis. With these samples, we do not reject a false null hypothesis, so we make a Type II error. The yellow area is the probability of making a Type II error.
* The green area is the probability that we draw a sample from a population with true mean H~1~ that rejects the null hypothesis. Here, we reject a false null hypothesis, so we do not make a Type II error. This probability is the power of the test. [<img src="icons/2question.png" width=161px align="right">](#question5.4.1)
```

<A name="answer5.4.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* It is more difficult to show that a small effect is statistically significant than a large effect. After all, the (assumed) true value (under H~1~) is closer to the hypothesized value (under H~0~) in case of a small effect. The two sample distributions are more closely together. Our sample outcome, for example average candy weight, will be closer to the hypothesized average candy weight if the effect is small.
* Therefore, it is more difficult to reject the null hypothesis. We have a lower probability of rejecting the null hypothesis if it is actually false. In other words, we have lower test power. [<img src="icons/2question.png" width=161px align="right">](#question5.4.2)
```

<A name="answer5.4.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* With a one-sided test, the total significance level (the tails cut of in the sampling distribution around H~0~) is situated in one tail. The cut point (the critical value) is closer to the hypothesized value if we change our two-sided test into a one-sided test.
* If this tail is at the right side, namely, at the side of the true population mean, we are more likely to reject the null hypothesis with a sample from the true population (the green area becomes larger). Test power increases. [<img src="icons/2question.png" width=161px align="right">](#question5.4.3)
```

<A name="answer5.4.4"></A>

```{block2, type='rmdanswer'}
Answer to Question 4.

* A lower significance level, for instance, 1% instead 5%, decreases test
power. With a lower significance level, it is more difficult to reject the
null hypothesis (the cut-off tails become smaller), so we are less likely to
reject a false null hypothesis.
* By the way, this is the main reason why we usually do not use a significance
level below 5%: our test power would be smaller, so the probability of a Type
II error would be higher. [<img src="icons/2question.png" width=161px
align="right">](#question5.4.4)
```

<A name="answer5.4.5"></A>

```{block2, type='rmdanswer'}
Answer to Question 5.

* Increasing sample size increases test power. Increasing sample size is the researcher's main tool for increasing test power.
* A larger sample has a smaller standard error, so the sampling distributions become narrower, more peaked.
* As a consequence, the critical values that cut off the tails move towards the center of the sampling distribution of the null hypothesis. Sample outcomes that are closer to the null hypothesis are now statistically significant. It is easier to reject the null hypothesis.
* A larger part of the sampling distribution according to the (assumed) true population value is in the area where we reject the null hypothesis. We have higher test power. [<img src="icons/2question.png" width=161px
align="right">](#question5.4.5)
```

<A name="answer5.4.6"></A>

```{block2, type='rmdanswer'}
Answer to Question 6.

* First, note that the probability that a test is statistically significant if there is a particular effect in the population, is a description of test power. So we have to find the sample size that has 80% test power for the specified test.
* Set standardized effect size to 0.5, the type of test to one-sided, and the significance level to 5%.
* Now, change sample size by changing the sample size slider until the figure reports 80% test power. This happens with a sample containing 26 observations. By the way, this is below the minimum number of observations needed for a _t_ test (more than 30). If you want to make small adjustments to the sample size: click the sample size slider's handle and use the left and right arrow keys on your keyboard. [<img src="icons/2question.png" width=161px align="right">](#question5.4.6)
```

<A name="answer5.4.7"></A>

```{block2, type='rmdanswer'}
Answer to Question 7.

* Change the standardized effect size to 0.2 (small effect). The figure reports a test power of only 26%.
* At sample size 154, the power of a test with a small effect is 80%.
* Note that we need a sample of just 26 observations to obtain 80% test power in case of a moderate effect. In contrast, small effects quickly require large samples! [<img src="icons/2question.png" width=161px align="right">](#question5.4.7)
```

## Research Hypothesis as Null Hypothesis

As noted before (Section \@ref(null-alt)), the research hypothesis usually is the alternative hypothesis. We expect something to change, to be(come) different rather than be or stay the same. We expect an association to be present rather than absent.

In this situation, rejection of the null hypothesis, which is the nil, supports our alternative hypothesis, hence our research hypothesis, so we are glad if we reject the null hypothesis. Of course, we know that we can be wrong. Our null hypothesis may still be true even if the probability of drawing a sample like the one we have drawn is so small that we have to reject the null hypothesis. This is a Type I error.

Fortunately, we know the probability of making this error because it is the significance level that we have chosen, five per cent usually. We can live with this probability of making an error if we reject the null hypothesis. So we are doubly glad: We found support for our research hypothesis *and* we know how confident we are about this support.

What if our research hypothesis is our null hypothesis? For example, we have a specific idea of average candy weight in the population from previous research or from specifications by the candy factory. Let us say that average candy weight is 2.8 grams according to the factory.

If we want to test whether the candies have the specified average weight, our research hypothesis would specify this average weight: Do candies weigh on average 2.8 grams in the population? Specifying a particular value, the research hypothesis must be the null hypothesis (Section \@ref(nullhypothesis)). In this example, H~0~: Average candy weight is 2.8 grams in the population.

```{r reshyp-althyp, eval=FALSE, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="Sample size calculator."}
# Adapt app power-calculator-toy
# (https://github.com/alice-i-cecile/power-calculator-toy).
# Show sampling distribution (blue) for average candy weight according to
# research hypothesis (= null hypothesis): 2.8. Colour (green) the area for not
# rejecting the null and add 95% as label.
# Add sampling distribution (red) for true average candy weight with slider to
# change true population average (range [2, 6], initial value 3.1. different
# from 2.8. Colour (red) areas above the rejection region and add labels with
# percentages for areas (rounded).
# Adjusting the slider changes the location of true sampling distribution, the
# size of the areas under it representing test power, and associated
# percentages.

1. Figure \@ref(fig:reshyp-althyp) shows two sampling distributions for average candy weight. What does the green area represent?

2. What does the red area represent?

3. How large must the difference between true and hypothesized average candy weight be to obtain a power of .80?
```

If the research hypothesis is the null hypothesis because it contains a single (two-sided) or boundary (one-sided) value for the population parameter, we find support for our research hypothesis if we do *not* reject the null hypothesis. We can be wrong in not rejecting the null hypothesis. If we do not reject a null hypothesis that is actually false, we make a Type II error.

The significance level is irrelevant now because the significance level is the probability of making a Type I error. We do not reject the null hypothesis, so we can never reject a true null hypothesis (Type I error). Instead, the probability of making a Type II error is important, or rather, the probability of not making this error. This is the power of the test.

So if our research hypothesis represents the null hypothesis and our research hypothesis is supported (not rejected), we need test power to know how confident we can be about the support that we have found. Here, test power is relevant, not statistical significance.

```{r eval=FALSE, echo=FALSE}
#Removed section.
## Sample Size and Confidence Intervals

Is sample size relevant only for null hypothesis testing? No, of course not. Sample size determines the certainty and precision of our inferences: The larger the sample, the more certain and precise our inferences. A larger sample, then, yields narrower confidence intervals at a given confidence level or a higher confidence level for a confidence interval with a given width (precision).

If you are going to estimate a confidence interval for a parameter, you have to decide on the precision and confidence that you want to attain. Precision and confidence depend on the practical situation at hand. Imagine that you do an exit poll during elections. In an exit poll, you sample voters who have just voted and you ask them for which party or candidate they voted. It is your aim to predict the winner of the elections: Who is going to receive most votes?

If parties or candidates have very different vote shares, it may be satisfactory to have a relatively imprecise (wide) confidence interval. If one party is much larger than all other parties, a vote share confidence interval of ten percentage points may suffice to pinpoint the winner, for instance, Party A is going to have forty to fifty per cent of the votes with 95% confidence. In contrast, a very competitive election with little variation in vote shares among several leading parties or candidates requires a much more precise estimate and therefore a larger sample.

We need not go into the details of how to calculate the sample size to obtain a confidence interval with the required precision and confidence. Suffice it to say that we have to take into account two factors: the confidence level, which is just one minus the significance level, and the effect size, for example, the minimum difference in vote shares, that we deem relevant. Test power is not relevant here because we do not use a statistical test.
```

## Test Your Understanding

```{r test-power, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="Effect size, power, Type I and Type II error.", screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Use app type1vs2.
knitr::include_app("http://82.196.4.233:3838/apps/type1vs2fixed/", height="310px")
```

Figure \@ref(fig:test-power) shows sampling distributions for two worlds. To the left is the hypothetical world of the researcher. In this hypothetical world, the researcher's null hypothesis is true, namely that average candy weight is 2.8 grams in the population. To the right is the real world in which average candy weight is 2.9 grams. The standard deviation of candy weights in the sample is 0.5.

<A name="question5.6.1"></A>

```{block2, type='rmdquestion'}
1. What do the values on the horizontal axes (top and bottom) mean? [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.1)
```

<A name="question5.6.2"></A>

```{block2, type='rmdquestion'}
2. What is the unstandardized and standardized effect size in the sample? Is the effect weak, moderate, or strong? [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.2)
```

<A name="question5.6.3"></A>

```{block2, type='rmdquestion'}
3. What do the blue horizontal lines in Figure \@ref(fig:test-power) represent? [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.3)
```

<A name="question5.6.4"></A>

```{block2, type='rmdquestion'}
4. What does the yellow section in the graph mean? [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.4)
```

<A name="question5.6.5"></A>

```{block2, type='rmdquestion'}
5. What does the green section in the graph mean? [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.5)
```

```{r sample-size-power2, fig.pos='H', fig.align='center', fig.cap="How does test power depend on effect size, type of test, significance level, and sample size? Sampling distributions of the sample mean under the null hypothesis (H~0~, left-hand curve) and under the assumed true value of the population mean (H~1~, right-hand curve) for a one-sample _t_ test.", echo=FALSE, out.width="775px", screenshot.opts = list(delay = 5), dev="png"}
# Shiny app to determine sample size for a specified (standardized) effect size,
# significance level, and test power for a (simple) one-sample t test, using the
# pwr:: package. Illustrate power with hypothesized and true sampling
# distributions as t distributions (as in app reshyp-althyp) with effect size on
# x axis. SLiders or inputs vor standardized effect size (0.2 - small, 0.5 -
# medium, 0.8 -large), significance level (90% two-sided, 90% one-sided, 95%
# two-sided, 95% one-sided, 99% two-sided, 99% one-sided), and test power {50%,
# 80%, 90%, 95%, 99%}.
# simplify PS.shiny_master (doesn't work yet?) Use R code from
# http://powerandsamplesize.com/Calculators/ in our own app?
knitr::include_app("http://82.196.4.233:3838/apps/sample-size-power2/", height="385px")
```

<A name="question5.6.6"></A>

```{block2, type='rmdquestion'}
6. Which test has higher power: a one-sided test at 5% significance level or a two-sided test at 1% significance level? Explain how significance level and type of test (one-sided versus two-sided) affect test power. Use Figure \@ref(fig:sample-size-power2) to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.6)
```

<A name="question5.6.7"></A>

```{block2, type='rmdquestion'}
7. Explain what happens to the two sampling distributions in Figure \@ref(fig:sample-size-power2) when you change the standardized effect size. [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.7)
```

<A name="question5.6.8"></A>

```{block2, type='rmdquestion'}
8. Explain what happens to the two sampling distributions in Figure \@ref(fig:sample-size-power2) when you change the size of the sample. [<img src="icons/2answer.png" width=115px align="right">](#answer5.6.8)
```

### Answers {.unnumbered}

```{block2, type='rmdanswer', echo=!ch5}
Answers to the Test Your Understanding questions will be shown in the web book when the last tutor group has discussed this chapter.
```

<A name="answer5.6.1"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 1.

* The values on the horizontal axes are all average candy weights. The bottom axis contains average candy weights in the sample: 2.73 grams is average candy weight corresponding to the lower critical value of a statistical test with 5% significance level; 2.84 grams is average candy weight in the sample that we have drawn; 2.87 grams is average candy weight corresponding to the upper critical value of a statistical test at 5% significance level. The top axis contains average candy weights in the population: 2.8 grams is the hypothesized average candy weight in the population; 2.9 grams is the (assumedly) true average candy weight in the population. [<img src="icons/2question.png" width=161px align="right">](#question5.6.1)
```

<A name="answer5.6.2"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 2.

* The effect size in the sample is the difference between the hypothesized population value and the value found in the sample.
* The hypothesized population mean of candy weights is 2.8 grams and the sample mean is 2.84 grams, so the effect size equals 2.8 - 2.84 = (-)0.04 grams. This is the unstandardized effect size. Note that we do not care about the sign of the difference.
* For the standardized effect size, we divide the unstandardized effect size by the standard deviation of the variable. The standard deviation of candy weight (in the sample) is given as 0.5, so the standardized effect size is 0.04 / 0.5 = 0.08.
* This value is Cohen's _d_.
* According to the rules of thumb for interpreting Cohen's _d_, the effect is very weak because it is much less than 0.20.
* Note that this is the effect size *in the sample*. As an alternative, you can calculate the *true effect size* as the difference between the hypothesized value and the true value in the population. The true unstandardized effect size is 2.8 - 2.9 = (-)0.1. If we assume that the reported standard deviation of candy weights (0.5) applies to the population, the true standardized effect size is (2.8 - 2.9)/0.5 = (-)0.2. [<img src="icons/2question.png" width=161px align="right">](#question5.6.2)
```

<A name="answer5.6.3"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 3.

* The blue horizontal lines represent the values of the sample outcome (average candy weight) for which we reject the null hypothesis. This is called the rejection region of the test. [<img src="icons/2question.png" width=161px align="right">](#question5.6.3)
```

<A name="answer5.6.4"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 4.

* The yellow section represents the probability of *not* rejecting a null hypothesis
if it is false. This is the probability of a Type II error. [<img src="icons/2question.png" width=161px align="right">](#question5.6.4)
```

<A name="answer5.6.5"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 5.

* The green section represents the probability of rejecting a null hypothesis if it is false (because H~1~ is true). This is the power of the test. [<img src="icons/2question.png" width=161px align="right">](#question5.6.5)
```

<A name="answer5.6.6"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 6.

* A lower significance level (1% instead of 5%) makes it more difficult to reject a null hypothesis. As a consequence, it is more difficult to reject a false null hypothesis: test power decreases.
* A two-sided test has only half of its significance level at the side of the true population value, whereas a one-sided test has all of it at the side of the true population value. This means that the relevant rejection region is smaller for a two-sided test than for a one-sided test. This reduces the probability of rejecting a false null hypothesis: lower test power.
* A one-sided test at 5% significance level, then, has two reasons for having more power than a two-sided test at 1% significance level. [<img src="icons/2question.png" width=161px align="right">](#question5.6.6)
```

<A name="answer5.6.7"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 7.

* Note that effect size in general is the difference between what we expect according to the null hypothesis and what we (are going to) find in the sample. The true population value is the expected value of what we are going to find in a sample (if the estimator is unbiased), so this is what we expect to find in our sample when we haven't yet drawn a sample.
* The effect size is represented by the distance between H~0~ (the population value according to the null hypothesis) and H~1~ (the assumed true population value, that is, a population value for which the researcher hopes to reject the null hypothesis).
* With a larger effect size, then, the distance between H~0~ and H~1~ becomes larger: The distributions move away from each other. Actually, the assumed true sampling distribution moves away from the sampling distribution according to the null hypothesis. The latter distribution remains in the same place because it is fixed by the researcher's null hypothesis.
* If the distributions move apart, they overlap less and test power increases (the green area).
* A smaller effect size has the opposite result. [<img src="icons/2question.png" width=161px align="right">](#question5.6.7)
```

<A name="answer5.6.8"></A>

```{block2, type='rmdanswer', echo=ch5}
Answer to Question 8.

* A larger sample has a smaller standard error, so both sampling distributions become narrower, more peaked.
* As a consequence, the critical values move towards the center of the sampling distribution according to the null hypothesis. Sample outcomes that are closer to the null hypothesis are now statistically significant. It is easier to reject the null hypothesis.
* A larger part of the sampling distribution according to the (assumed) true population value is in the area where we reject the null hypothesis. We have higher test power.
* A smaller sample has the opposite effect. [<img src="icons/2question.png" width=161px align="right">](#question5.6.8)
```

## Take-Home Points

-   Effect size is the difference between the hypothesized value and the true (population) or observed (sample) value.

-   Effect size is related to practical relevance. Effect sizes are expressed by (standardized) mean differences, regression coefficients, and measures of association such as the correlation coefficient, *R*^2^, and eta^2^.

-   Statistical significance of a test depends on effect size and sample size. Because sample size affects statistical significance, it is wrong to use significance or a *p* value as an indication of effect size.

-   If we do not reject a null hypothesis, this does *not* mean that the null hypothesis is true. We may make a Type II error: not rejecting a false null hypothesis. A researcher can make this error only if the null hypothesis is not rejected.

-   The probability of making a Type II error is commonly denoted with the Greek letter beta ($\beta$).

-   The probability of *not* making a Type II error is the power of the test.

-   The power of a test tells us the probability that we reject the null hypothesis if there is an effect of a particular size in the population. The larger this probability, the more confident we are that we do not overlook an effect when we do not reject the null hypothesis.

-   A practical way to increase test power: Draw a larger sample.