Statistical-Inference/04-hypothesis.qmd at master · ShKlinkenberg/Statistical-Inference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r setup, include=FALSE}
source("R_setup.R")
```

# Hypothesis testing {#sec-hypothesis}

> Key concepts: research hypothesis, statistical null and alternative hypothesis, nil hypothesis, test statistic, _p_ value, conditional probability, alpha, significance level (Type I error rate), Type I error, inflated type I error, type II error, Power, critical value, capitalization on chance, one-sided and two-sided tests and tests to which this distinction does not apply, rejection region, Bayesion statistics, confidence intervals.

### Summary {-}

In the preceding chapter, we have learned that a confidence interval contains the population values that are plausible, given the sample that we have drawn. In the current chapter, we apply our knowledge of sampling distributions, probability models and parameter estimation to hypothesis testing.

This chapter explores various methods for testing hypotheses. While we primarily focus on the widely used null hypothesis significance testing (NHST), we also discuss how confidence intervals and Bayesian statistics can aid in making decisions about hypotheses.

We will first extensively cover the framework of null hypothesis significance testing (NHST). @sec-null-hypothesis-significance-testing covers key concepts such as the null and alternative hypotheses, significance level (alpha), power of a test, p-values, and effect sizes. The section also discusses one-sided and two-sided tests and the importance of sample size in determining the power of a test.

In @sec-reporting we offer guidelines for reporting statistical test results. It emphasizes clarity and transparency in presenting findings to different audiences, including fellow scientists and general readers. The section covers the necessary components of a statistical report, such as test statistics, p-values, effect sizes, and confidence intervals.

@sec-test-selection on Statistical Test Selection guides the selection of appropriate statistical tests based on the data and research questions. It provides a framework for choosing tests by considering factors such as the type of data, the number of groups being compared, and the study design. The section includes flowcharts and examples to illustrate the decision-making process.

We continue with a discussion of confidence intervals as an alternative to hypothesis testing in @sec-null-ci. It explains how confidence intervals provide a range of plausible values for population parameters and how they can be used to make inferences about hypotheses. The section also discusses bootstrapped confidence intervals and their application.

We follow up with Bayesian hypothesis testing, contrasting it with frequentist methods. We explain the Bayesian approach of updating prior beliefs with data to obtain posterior probabilities. The section (@sec-bayesian-hypothesis-testing) covers the concepts of prior, likelihood, and posterior distributions, and how they are used to make decisions about hypotheses.

In the final section (@sec-critical-discussion) we critically examine the limitations and criticisms of null hypothesis significance testing. We discuss issues such as the misinterpretation of p-values, the overemphasis on statistical significance over practical significance, and the risks of data dredging and publication bias. The section advocates for a more nuanced understanding and reporting of statistical results.

## Hypothesis {#sec-binarydecision}

The assumption that a researcher wants to test is called a _research hypothesis_. It is a statement about the empirical world that can be tested against data. Communication scientists, for instance, may hypothesize that:

* a television station reaches half of all households in a country,
* media literacy is below a particular standard (for instance, 5.5 on a 10-point scale) among children,
* opinions about immigrants are not equally polarized among young and old voters,
* the celebrity endorsing a fundraising campaign makes a difference to adult's willingness to donate,
* more exposure to brand advertisements increases brand awareness among consumers,
* and so on.

These are statements about populations: all households in a country, children, voters, adults, and consumers. As these examples illustrate, research hypotheses seldom refer to statistics such as means, proportions, variances, or correlations. Still, we need a statistic to test a hypothesis. The researcher must translate the research hypothesis into a new hypothesis that refers to a statistic in the population, for example, the population mean. The new hypothesis is called a statistical hypothesis.

A statistical hypothesis is a statement about the empirical world that can be tested against data. It is a statement about the population, not the sample. For example, a hypothesis could be that the average age of a population is 30 years, that the average waight of candy bags is 500 grams, that the proportion of people that like a certain brand is .5, or that the correlation between two variables is .3. These are all statements about the population, not the sample.

Scientists test these hypotheses by following the _empirical cycle_ [@deGrootMethodologyFoundationsInference1969], which involves a systematic process of induction, deduction, testing, and evaluation. Based on the results, hypotheses can be either rejected or not rejected. If the hypothesis is based on theory and previous research, the scientist uses previous knowledge. As a next step, the researcher tests the hypothesis against data collected for this purpose. If the data contradict the hypothesis, the hypothesis is rejected and the researcher has to improve the theory. If the data does not contradict the hypothesis, it is not rejected and, for the time being, the researcher does not have to change the theory.

Statistical hypotheses usually come in pairs: a null hypothesis (_H_~0~) and an _alternative hypothesis_ (_H_~1~ / _H_~A~). We met the null hypothesis in the preceding sections. We use it to create a (hypothetical) sampling distribution. To this end, a null hypothesis must specify one value for the population statistic that we are interested in, for example, .5 as the proportion of yellow candies.

### Null hypothesis

The null hypothesis reflects the skeptical stance in research. It assumes that there is nothing going on. There is no difference between different conditions, the new intervention is not better than the previous, there is no correlation between variables, there is no predictive value to your regression model, a coin is fair, and so forth. @eq-Hnull shows some examples of null hypotheses expressed in test statistics.

$$
\begin{eqnarray*}
H_{0} & : \theta & = .5 \\
H_{0} & : \mu & = 100 \\
H_{0} & : t & = 0 \\
H_{0} & : \mu_1 & = \mu_2 \\
\end{eqnarray*}
$$ {#eq-Hnull}

The null hypothesis does not always assume that the population parameter is zero; it can take any specified value. For instance, a null hypothesis might state that there is no difference in intelligence between communication science students and the general population. In this case, we could compare the average intelligence score of a sample of communication science students to the known population average of 100. Although we are testing whether the difference is zero, in practical terms, we express the null hypothesis as the sample mean being equal to 100.

Though a null hypothesis can be expressed as a single value, that does not mean that we always get that specific value when we take a random sample. Given the null hypothesis that our candy factory machine produces bags with an average of 5 out of 10 yellow candies, there remains a probability that some bags will contain as few as one yellow candy or even none at all. This variability is illustrated in @fig-nulldistribution, which shows that while values near the expected 5 yellow candies are most likely, deviations can occur.

```{r}
#| label: fig-nulldistribution
#| fig-cap: "Discrete binomial distribution."
#| echo: false
#| warning: false

title = expression(paste("Null distributions ", theta, " = .5"))

colh0 = 'yellow'
colha = 'lightyellow'

barplot( dbinom(0:10, 10, .5),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = colh0
         # beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

# abline(v = c(6.5,27.5), lty=2, col='red')

# legend("topright", legend=c("H0"), fill=c(colh0))

```

### Alternative hypothesis {#sec-alternative-hypothesis}

The alternative hypothesis indicates what the researcher expects in terms of effects, differences, deviation from null. It is the operationalization of what you expect to find if your theory would be accurate. This would mean that our expected effect of difference would indeed reflect the true population value.

Lets assume for the case of our candy factory example, that the machines parameter is .2. We would expect the machine to produce bags with 2 out of 10 yellow candies, one in five yellow candies per bag. Assuming .2 as the machine's parameter does not ensure that every bag will contain exactly 2 yellow candies. Some bags will contain 0, 1, 3, 4, 5, 6, 7, 8, 9, or even 10 yellow candies. The probabilities for each can again be visualized using the exact discrete binomial probability distribution (@fig-altdistribution) as we did for the null hypothesis.

```{r}
#| label: fig-altdistribution
#| fig-cap: "Discrete alternative binomial distribution."
#| echo: false
#| warning: false

title = expression(paste("Alternative distribution ", theta, " = .2"))

colh0 = 'yellow'
colha = 'lightyellow'

barplot( dbinom(0:10, 10, .2),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = colha
         # beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

# abline(v = c(6.5,27.5), lty=2, col='red')

# legend("topright", legend=c("H0"), fill=c(colh0))

```

Note that the probability distribution for $H_0$ indicates the null assumption (for instance, that there is no effect), while the probability distribution for $H_A$ is based on the assumed population value. At this stage only the sample size, the number of candies in a bag (N=10), the null assumption and our knowledge about reality has been used to determine the distributions. No data has been gathered yet in determining these distributions.

When we do research, we do not know the true population value. If we did, no research would be needed of course. What we do have is theories, previous research, knowledge about what happened to a population (for instance, it was exposed to a public health campaign) and other empirical evidence. Based on this, we can make an educated guess about the true population value. This educated guess is expressed as the alternative hypothesis. The alternative hypothesis is the hypothesis that the researcher expects to find if the theory is accurate. It is the hypothesis that the researcher wants to test using data.

The alternative hypothesis is usually formulated as either not equal to the null hypothesis, or greater than or less than the null hypothesis. @eq-Halt shows some examples of alternative hypotheses expressed in test statistics.

$$
\begin{eqnarray*}
\text{Two-sided} \\
H_{A} & : \theta & \neq .5 \\
H_{A} & : t & \neq 0 \\
H_{A} & : \mu_1 & \neq \mu_2 \\
\text{One-sided} \\
H_{A} & : \theta & < .5 \\
H_{A} & : t & > 0 \\
H_{A} & : \mu_1 & < \mu_2 \\
\end{eqnarray*}
$$ {#eq-Halt}

The alternative hypothesis can be one-sided or two-sided. A two-sided alternative hypothesis states that the population value is not equal to the null hypothesis. A one-sided alternative hypothesis states that the population value is either greater than or less than the null hypothesis. The choice between a one-sided and two-sided test depends on the research question and the theory. We will cover one and two-sided testing more extensively in @sec-one-twosidedtests.

### Testing hypotheses

In the empirical cycle, the researcher tests a hypothesis against data collected for this purpose. The most widely used method for testing hypotheses is null hypothesis significance testing (NHST). As we will read in this chapter, there are other methods that can be used to test hypotheses, such as confidence intervals and Bayesian statistics.

All methods serve as decision frameworks that enable researchers to establish rules for evaluating their hypotheses. These rules are determined before data collection and are designed to minimize the risk of incorrect decisions. Null Hypothesis Significance Testing (NHST) manages this risk by defining it probabilistically. Confidence intervals provide a measure of accuracy through their width, while Bayesian statistics express this risk in terms of the credibility interval.

In the next chapters, we will cover the logic behind NHST, confidence intervals, and Bayesian statistics. We will also discuss how to select the appropriate statistical test for your research question, and how to report the results of your statistical tests.

## Null Hypothesis Significance Testing {#sec-null-hypothesis-significance-testing}

Null Hypothesis Significance Testing (NHST) is the most widely used method for statistical inference in the social sciences and beyond. The logic underlying NHST is called the Neyman Pearson approach [@RefWorks:3930]. Though these names are not widely known, the work of Jerzy Neyman (1894–1981) and Egon Pearson (1895–1980) still has a profound impact on the way current research is conducted, reviews are considered, and papers are published.

```{block2, type='rmdpearson', echo=FALSE}
Our habit of formulating a null hypothesis and an alternative hypothesis for all situations not covered by the null hypothesis is generally attributed to the statistician R.A. Fisher. This, however, is not entirely correct [see, e.g., @RefWorks:3931]. Fisher introduced the concept of a null hypothesis [@RefWorks:3932: 18] but not the concept of an alternative hypothesis.
The statisticians Jerzy Neyman and Egon Pearson introduced the idea of working with two or more hypotheses. But the two hypotheses do not cover all possible population values and they were usually not called a null and alternative hypothesis. They specify two or more different population values. A statistical test is used to determine which of the hypotheses fits the sample best. [@RefWorks:3906]

Egon Pearson. Photo by Grasso Luigi, Wikimedia Commons, CC BY-SA 4.0
```

The Neyman Pearson approach ensures tight control on the probability of making correct and incorrect decisions. It is a decision framework that gives you a clear criterion and also an indication of what the probability is that your decision is wrong. The decision in this regard, is either the acceptance or rejection of the $H_0$ hypothesis.

The Neyman Pearson approach is about choosing your desired probability of making correct and incorrect decisions, setting up the right conditions for this, and making a decision. It considers the following:

1. Alpha - Determine your desired risk of drawing the wrong conclusion.
2. Power - Determine your desired probability of drawing the correct conclusion.
3. The true effect size
4. The sample size needed to achieve desired power.
5. Conduct your research with this sample size.
6. Determine the test statistic.
7. Determine if $p$-value $\leq \alpha$. If so, reject $H_0$.

The two decisions can be visualized in a $2 \times 2$ table where in reality $H_0$ can be true or false, and the decision can either be to reject $H_0$ or not. @fig-decisiontable illustrates the correct and incorrect decisions that can be made. The green squares obviously indicate that it is a good decision to reject $H_0$ when it is in fact false, and not to reject $H_0$ if it is in reality true. And the red squares indicate that it is a wrong decision to reject $H_0$ when it is actually true (Type I error), or not reject $H_0$ if it is in reality false (Type II error).

![NHST decision table.](figures/decision_table.svg){width=500 #fig-decisiontable fig-align="left"}


Intuitively it is easy to understand that you would want the probability of an incorrect decision to be low, and the probability of a correct decision to be high. But how do we actually set these probabilities? Let's consider the amount of yellow candies from the candy factory again. In @sec-discreterandomvariable we learned that the factory produces candy bags where one fifth of the candies are supposed to be yellow. Now suppose we don't know this and our null hypothesis would be that half of the candies would be yellow. In @fig-expected-value you can set the parameter values to .5 and .2 and see what the discrete probability distributions look like.


As the candy factory produces bags with ten candies, we can look at both probability distributions. @fig-twobinom shows both distributions.

:::: {.columns}

::: {.column width="50%"}

* $H_0$ Distribution
    * Half of the candies in the bag are yellow
    * The parameter of the candy machine is .5
    * With expected value 5 out of 10

:::

::: {.column width="50%"}

* $H_T$ True distribution
    * One fifth of the candies in the bag are yellow
    * The parameter of the candy machine is .2
    * With expected value 2 out of 10

:::

::::

```{r}
#| label: fig-twobinom
#| fig-cap: "Discrete alternative binomial distributions."
#| echo: false
#| warning: false

title = "Both distributions"

colh0 = 'yellow'
colha = 'lightyellow'

barplot( rbind(dbinom(0:10, 10, .5),
               dbinom(0:10, 10, .2)),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = rbind(colh0,colha),
         beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

# abline(v = c(6.5,27.5), lty=2, col='red')

legend("topright", legend=c("H0","HT"), fill=c(colh0, colha))

```

We will use both distributions in @fig-twobinom to clarify the different components within the Neyman Pearson approach later in this chapter. For now, take a good look at both probability distributions, and consider a bag of candy containing 4 yellow candies. Are you able to determine if this bag is the result of a manufacturing process that produces bags with 20% or 50% yellow candies?

Doing research is essentially the same. You collect one sample, and have to determine if the effect of your study is non existent ($H_0 = \text{true}$) or that there is something going on ($H_0 \neq \text{true}$).

### Alpha {#sec-alpha}

The first step in the Neyman Pearson approach is to set the desired type I error rate, also known as the significance level, $\alpha$. This is the probability of rejecting the null hypothesis when it is assumed to be true. In the $2 \times 2$ decision table in @fig-decisiontable, this corresponds to the top left quadrant.

As a researcher, you decide how much risk you are willing to take to make a type I error. As the Neyman Pearson approach is a decision framework, you have to set this probability before you start collecting data. The most common value for $\alpha$ is .05, which means that you accept a 5% chance of making a type I error of rejecting the null hypothesis when it is in reality true.

In our yellow candy example, assuming the null hypothesis to be true, relates to the parameter value of .5 and the associated probability distribution shown in @fig-nulldistribution. We have already determined that if $H_0$ is true, it is still possible we could get a bag with 0 or 10 yellow candies. Deciding to reject the null hypothesis in any of these cases, would be wrong, because the null hypothesis is assumed to be true. The exact probabilities can be found on the y-axis of @fig-nulldistribution, and are also shown in the @tbl-nullprobtable below. Looking at the probability of getting 0 or 10 candies in @tbl-nullprobtable, we see that together this amounts to .002 or 0.2%. If we would decide to only reject the null hypothesis if we would get 0 or 10 candies, this would be a wrong decision, but we would also know that the chance of such a decision is pretty low. Our type I error, alpha, significance level, would be .002.

```{r}
#| label: tbl-nullprobtable
#| tbl-cap: "Probabilities of drawing a certain amount of yellow candies from a bag of 10 candies, assuming the null hypothesis to be true."
#| echo: false
#| warning: false


 library("tidyverse")
 library("kableExtra")
 options(kableExtra.auto_format = FALSE)

binomprob <- data.frame(Yellow = as.character(0:10),
                        Prob   = round(dbinom(0:10, 10, .5), 3))

names(binomprob) = c("#Y", "Pr H0")

kable(t( binomprob ),
              booktabs = TRUE ) %>% kable_styling(font_size = 10)
```

Choosing such an alpha level would result in a threshold between 0 and 1 and 9 and 10. We call this the **critical value** associated with the chosen alpha level. Where on the outside of the threshold we would reject the null hypothesis, and inside the threshold we would not reject the null hypothesis. So, if that is our decision criterion, we would reject the null hypothesis if we would draw a bag with 0 or 10 yellow candies, and not reject the null hypothesis if we would draw a bag with 1, 2, 3, 4, 5, 6, 7, 8, or 9 yellow candies. Amounting to a type I error rate of .002 or 0.2%. @fig-nulldistributionalpha shows the critical values for the null hypothesis distribution, and indicate what the decision would be for values on the outside and inside of the decision boundary.

```{r}
#| label: fig-nulldistributionalpha
#| fig-cap: "H0 binomial distribution with critical values."
#| echo: false
#| warning: false


title = expression(paste("Null distributions ", theta, " = .5"))

colh0 = 'yellow'
colha = 'lightyellow'

barplot( dbinom(0:10, 10, .5),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = colh0
         # beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

lines(c(1.5, 11.9), c(.22, .22), col='green', lwd=20)
text(6.7, .235, "Not reject H0", col='black', pos=1)

lines(c(0, 1.1), c(.22, .22), col='red', lwd=20)
text(.5, .235, "Reject", col='white', pos=1)

lines(c(12.3, 13.5), c(.22, .22), col='red', lwd=20)
text(12.9, .235, "Reject", col='white', pos=1)

abline(v = c(1.3,12.1), lwd = 2, col='darkgreen')
text(1.3, .1, "Critial value", col='darkgreen', pos=3, srt=90)
text(12.1, .1, "Critial value", col='darkgreen', pos=3, srt=-90)


# legend("topright", legend=c("H0"), fill=c(colh0))

```

In the social sciences, we allow ourselves to make a wrong decision more often. We usually set the alpha level to .05. For our discrete example setting the alpha level to .05 is not really possible. Looking at @tbl-nullprobtable, we could raise the significance level to .022 if we would reject the null hypothesis if we would draw 0, 1 or 9, 10 yellow candies. This would result in a type I error rate of 2.2%. Though if we would also reject the null hypothesis with 2 or 8 yellow candies, we would have a type I error rate of 11%. For a discrete probability distribution with a limited number of outcomes, it is not always possible to set the alpha level exactly to .05.

For continuous probability distributions, such as the normal distribution, it is possible to set the alpha level to exactly .05. For example the null hypothesis that average media literacy in the population of children equals 5.5 on a scale from one to ten.

For such continuous variables, we can estimate a sampling distribution around the hypothesized population value using a theoretical approach (@sec-theoretical-approx). Remember (@sec-expectedvalue) that the population value is the expected value of the sampling distribution, that is, its mean (if the estimator is unbiased). The sampling distribution, then, is centered around the population value specified in the null hypothesis. This sampling distribution tells us the probabilities of all possible sample outcomes _if the null hypothesis is true_. It allows us to identify the most unlikely samples. In **Step 2** in @fig-nullsampling, we set the alpha level to .05. This means that we cut off 2.5% of the area in each tail of the sampling distribution. The **critical values** are the values that separate the 2.5% of the area in each tail from the 95% of the area in the middle. If we assume the population parameter to be 5.5, rejecting the null hypothesis would again be a wrong decision. Thus setting the boundary by using an alpha level of .05, would yield a wrong decision in 5% of the samples we take. Just like the discrete candy color case, we decide to reject $H_0$ on the outside of the critical value and not reject $H_0$ on the inside of the critical value. In **Step 4** in @fig-nullsampling, we add the result of a sample. You can redraw multiple samples by clicking the button in the app.

Note that the reasoning for the discrete case and the continuous case is the same. The only difference is that for the continuous case we can set the alpha level exactly to .05.

::: {#fig-nullsampling .column-page-inset-right}

```{=html}
<iframe src="https://sharon-klinkenberg.shinyapps.io/nullsampling/" width="100%" height="410px" style="border:none;">
</iframe>
```
Sampling distribution of average media literacy according to the null hypothesis.
:::

### 1 - Alpha

The decision to not reject the null hypothesis when it is in reality true is indicated by $1 - \alpha$. It does not go by any other name, but in terms of probability, it is directly dependent on your desired type I error rate, your chosen alpha level. It therefore corresponds to the probabilities in @tbl-nullprobtable of 1, 2, 3, 4, 5, 6, 7, 8, or 9 yellow candies in the candy factory example. We have a 99.8% (1 - .002) chance of making the correct decision to not reject $H_0$ when we assume it to be true. The inside of the critical value in @fig-nulldistributionalpha is the area where we do not reject the null hypothesis. In the $2 \times 2$ decision table in @fig-decisiontable, this corresponds to the bottom left green quadrant.

Now that we have determined our critical value (for our particular sample size) based on our desired alpha, significance level, we can use this critical value to look at the power.

### Power (true) {#sec-power}

The true power is the probability of making the correct decision to reject the null hypothesis when it is in fact false. In the $2 \times 2$ decision table in @fig-decisiontable, this corresponds to the top right quadrant. As we have already set our decision criterion by choosing our alpha level in the previous step, we already know when we decide to reject the null hypothesis. In @fig-nulldistributionalpha, based on the null distribution, we determined our type I error to be 0.2%, if we would reject the null hypothesis if we would draw 0 or 10 yellow candies. The critical value would in that case be between 0 and 1 and 9 and 10. We use this same critical value to determine the power of the test, as it establishes our decision boundary.

As the right column of @fig-decisiontable only states that $H_0 = \text{FALSE}$, it does not state what this entails. Within the Neyman Pearson approach, this would be the true population value with its associated probability distribution. We already established that this would be the distribution with a parameter value of .2. In @fig-altdistributionpower, we see that our decision criterion is still the same. That we decide to reject the null when we sample 0 or 10 yellow candies. But the distribution has now changed.

```{r}
#| label: fig-altdistributionpower
#| fig-cap: "True binomial distributions with critical values."
#| echo: false
#| warning: false


title = expression(paste("True distributions ", theta, " = .2"))

colh0 = 'yellow'
colha = 'lightyellow'

barplot( dbinom(0:10, 10, .2),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = colha
         # beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

lines(c(1.5, 11.9), c(.22, .22), col='green', lwd=20)
text(6.7, .235, "Not reject H0", col='black', pos=1)

lines(c(0, 1.1), c(.22, .22), col='red', lwd=20)
text(.5, .235, "Reject", col='white', pos=1)

lines(c(12.3, 13.5), c(.22, .22), col='red', lwd=20)
text(12.9, .235, "Reject", col='white', pos=1)

abline(v = c(1.3,12.1), lwd = 2, col='darkgreen')
text(1.3, .1, "Critial value", col='darkgreen', pos=3, srt=90)
text(12.1, .1, "Critial value", col='darkgreen', pos=3, srt=-90)


# legend("topright", legend=c("H0"), fill=c(colh0))

```

With this distribution based on the actually true population parameter, deciding to reject the null would be a good decision. Though, we can also see that if this is true, if the parameter truly is .2, getting a bag with 0 or 10 yellow candies does not happen that often. The probabilities for 10 yellow candies is almost none, and the probability for getting 0 yellow candies is about 11%. This means that if this hypothesis is true, if our sample originates from this distribution, we would only make the decision to reject the null hypothesis in 11% of the samples we get out of it. So, the power of the test, correctly rejecting the null is only 11%.

> The only way to increase the power is to increase the sample size of the study, or increase the type I error.

As stated earlier we would rather have a higher probability of making the correct decision. In the social sciences we are striving for a power of .80. This means that we want to make the correct decision in 80% of the cases when the null hypothesis is false. In our candy factory example, this would mean that we would want to reject the null hypothesis in 80% of the replications. With our machine producing bags with 10 candies, this is just not possible. The only way to increase the power is to increase the sample size of the study. In the candy factory example, this would mean that we would have to increase the number of candies in the candy bags. We will come back to this in the @sec-sample-size on sample size.

One more thing to note, is that the true power of the test can only be determined if you know the true population value. In practice, we do not know if the null or the alternative hypothesis is true. We can only calculate the power of the test when we assume some alternative hypothesis. It is good practice to base your assumptions about the alternative hypothesis on previous research, theory, or other empirical evidence. This is mostly expressed as the expected effect size, the expected difference between the null and the alternative hypothesis.

In all statistical software, the power of the test is not calculated based on the true effect size, but on the found effect size in your sample. This is called the observed power and will be covered in @sec-observed-effect-size.

### Beta

The probability of making a type II error is indicated by $\beta$. It is the probability of not rejecting the null hypothesis when it is in reality false. In the $2 \times 2$ decision table in @fig-decisiontable, this corresponds to the bottom right quadrant. The power of the test is $1 - \beta$. In our candy factory example, the power of the test is .11, so the probability of making a type II error is .89. It is the sum of the probabilities of getting 1, 2, 3, 4, 5, 6, 7, 8, or 9 yellow candies, when the machine actually produces bags with 2 yellow candies with the corresponding probabilities as shown in @fig-altdistributionpower.

### Test statistic {#sec-test-statistic}

In @sec-samplestatistic we discussed the **sample statistic**, and defined it as any value describing a characteristic of the sample. This could be the mean, or the proportion, or the correlation, or the regression coefficient. It is a value that is calculated from the sample. Note that conversions of the sample statistic, such as the difference between two sample means, or the ratio of two sample variances, $t$-values, $F$-values, and $\chi^2$-values are also sample statistics.

The **test statistic** is a sample statistic that is used to test the null hypothesis. In our candy factory example, the test statistic would be the number of yellow candies in the bag we sample. If we would draw a bag with 4 yellow candies, the test statistic would be 4.

In the previous sections, we have determined our decision criterion, the critical value, based on our desired alpha level. We have also determined the power of the test, based on the alternative hypothesis. The test statistic is used to determine if we reject the null hypothesis or not. If the test statistic is equal to the critical value or more extreme, we reject the null hypothesis. If the test statistic is inside the critical value, we do not reject the null hypothesis.

Looking at @fig-nulldistributionalpha, we see that the critical value is between 0 and 1 and 9 and 10. If we would draw a bag with 4 yellow candies, we can check if the value 4 is inside or outside the critical value. As 4 is inside the critical value, we would not reject the null hypothesis.

> The test statistic is the value that is used to decide if we reject the null hypothesis or not.

For continuous variables, as described in @fig-nullsampling, the test statistic is the sample mean. If the sample mean is outside the critical value, we reject the null hypothesis. If the sample mean is inside the critical value, we do not reject the null hypothesis. If you select Step 4 in @fig-nullsampling, and draw a few samples, you can see if the test statistic, the sample mean, is inside or outside the critical value. Again, the reasoning for continuous variables is the same as for the discrete variables.

### P-value {#sec-pvalue}

We have learned that a test is statistically significant if the test statistic is in the rejection region. Statistical software, however, usually does not report the rejection region for the sample statistic. Instead, it reports the _p-value_ of the test, which is sometimes referred to as _significance_ or _Sig._ in SPSS.

> The **p-value** is the probability of obtaining a test statistic at least as extreme as the result actually observed, under the assumption that the null hypothesis is true.

In the previous section we considered a sample with 4 yellow candies. The p-value gives the probability of randomly drawing a sample that is as extreme or more extreme than our current sample assuming that the null hypothesis is true. “As extreme or more extreme” here means as far or further removed from the value specified by the null hypothesis. Concretely, in our case that means the probability of drawing a sample with 4 or fewer yellow candies. The p-value considers the probability of such a sample, but also ads the probability of getting a sample with less yellow candies. This is what is meant with “at least as extreme”. this is not really intuitive, but it refers to the less likely test statistics, iIn our case 0, 1, 2 and 3, are even less probable than 4 yellow candies. The assumption that the null hypothesis is true indicates that we need to look at the probabilities from the sampling distribution that is created based on the null distribution hypothesis. Looking at @tbl-nullprobtable, we see that the probability of drawing a random sample with 0, 1, 2, 3 or 4 yellow candies under the null distribution is 0.001 + 0.010 + 0.044 + 0.117 + 0.205 = 0.377 according to the sampling distribution belonging to the null hypothesis. This 0.377 is the p-value. The conditional (conditional on H0 being true) probability of getting a sample that is as or less likely than the test statistic that we have of our current sample.

> Rejecting the null hypothesis does not mean that this hypothesis is false or that the alternative hypothesis is true. Please, never forget this.

The reasoning applied when comparing our test statistic to the critical value is the same as when comparing the p-value to the alpha level. If the p-value is smaller or equal to than the alpha level, we reject the null hypothesis. If the p-value is larger than the alpha level, we do not reject the null hypothesis.

If the test statistic is within the critical values, the p-value is always larger than the alpha level. If the test statistic lies outside the critical value, the p-value is always smaller than the alpha level. In the case that the test statistic is exactly the same as the critical value, the p-value is exactly equal to the alpha level, we still decide to reject the null hypothesis.

::: {.callout-important appearance="simple"}

Reject $H_0$ when $p$-value $\leq \alpha$

:::

As both the p-value and the alpha level assume the null to be true, you can find both probabilities under the null distribution. For continuous variables, the p-value is the area under the curve of the probability distribution that is more extreme than the sample mean. The significance level is chosen by you as a researcher and is fixed.

> It is important to remember that a _p_-value is a probability _under the assumption that the null hypothesis is true_. Therefore, it is a _conditional probability_.

Compare it to the probability that we throw sixes with a dice. This probability is one out of six under the assumption that the dice is fair. Probabilities rest on assumptions. If the assumptions are violated, we cannot calculate probabilities.

If the dice is not fair, we don't know the probability of throwing sixes. In the same way, we have no clue whatsoever of the probability of drawing a sample like the one we have if the null hypothesis is not true in the population.

@fig-t-alpha-p shows a t-distribution, which represents the null distribution. A statistical test was set up with an alpha level of 5% (blue area). The and the p-value (red area) indicates the probability of drawing a random sample with a t-value of 2 or values that are even further removed from the null hypothesis (more extreem). The figure shows what this test would look like for a two sided test (left) and a one two sided hypothesis test (right). We will cover one and two sided testing in @sec-one-twosidedtests. For now, just notice that, looking at the left graph, the p-value is greater than 0.05, because the test statistic is not as or more extreme than the critical value. In other words, the test is not significant. In the one-sided test depicted on the right, the p-value lies in the rejection region and is, thus, significant.

```{r}
#| label: fig-t-alpha-p
#| fig-cap: "T-distributions with alpha level and p-value."
#| echo: false
#| warning: false

layout(matrix(1:2, 1, 2))

plot(function(x) dt(x, df = 10), -4, 4,
     main     = "T distribution (df=10) with two sided alpha",
     cex.main = .7,
     xlab     = "T-value",
     ylab     = "",
     bty      = "n",
     lwd      = 1.5)

t = 2

p = pt(1,10)

alpha_r = 1-p


# add p-value
pol_left = seq(-4,-t,length=20)
polygon(c(pol_left,rev(pol_left)),  c(rep(0,20), dt( rev(pol_left),  df = 10)), col="red", angle=45, density=15, lwd=2)
pol_right = seq(4,t,length=20)
polygon(c(pol_right,rev(pol_right)),c(rep(0,20), dt( rev(pol_right), df = 10)), col="red", angle=45, density=15, lwd=2)

# add alpha

tp95  = qt(.95,10)
tp975 = qt(.975,10)
tp25  = qt(.025,10)

pol_left = seq(-4,tp25,length=20)
polygon(c(pol_left,rev(pol_left)),  c(rep(0,20), dt( rev(pol_left),  df = 10)), col="blue", angle=-45, density=15, lwd=2)

pol_left = seq(tp975,4,length=20)
polygon(c(pol_left,rev(pol_left)),  c(rep(0,20), dt( rev(pol_left),  df = 10)), col="blue", angle=-45, density=15, lwd=2)

# label alpha
text(-3,dt(tp25 , df = 10),expression(paste("½",alpha       ," = 2.5%",sep="")),pos=3, col='blue')
text( 3,dt(tp975, df = 10),expression(paste("½",alpha       ," = 2.5%",sep="")),pos=3, col='blue')

xbracket = seq(-4,tp25,length=39)
ybracket = dt(tp25 , df = 10)*c(c(1,1.02),rep(1.03,16),c(1.04,1.05,1.04),rep(1.03,16),c(1.02,1))
lines(xbracket,ybracket,col="gray")

xbracket = seq(tp975, 4,length=39)
ybracket = dt(tp975 , df = 10)*c(c(1,1.02),rep(1.03,16),c(1.04,1.05,1.04),rep(1.03,16),c(1.02,1))
lines(xbracket,ybracket,col="gray")

#text(-4.2,.38,expression(paste("2 sided: P = ",P[left] + P[right]," = 7.34%")),pos=4)

plot(function(x) dt(x, df = 10), -4, 4,
     main="T distribution (df=10) with one sided alpha",
     cex.main = .7,
     xlab="T-value",
     ylab="",
     bty="n",
     lwd=1.5)

pol_left = seq(tp95,4,length=20)
polygon(c(pol_left,rev(pol_left)),  c(rep(0,20), dt( rev(pol_left),  df = 10)), col="blue", angle=-45, density=15, lwd=2)

# label alpha
text( 3,dt(tp95 , df = 10),expression(paste(    alpha[right]," = 5%"  ,sep="")),pos=3, col='blue')

# add p-value
# pol_left = seq(-4,-t,length=20)
# polygon(c(pol_left,rev(pol_left)),  c(rep(0,20), dt( rev(pol_left),  df = 10)), col="blue", angle=45, density=15, lwd=2)
pol_right = seq(4,t,length=20)
polygon(c(pol_right,rev(pol_right)),c(rep(0,20), dt( rev(pol_right), df = 10)), col="red", angle=45, density=15, lwd=2)

#text((-4+tp25)/2,dt(tp25  , df = 10),"{",srt=-90,col="grey")
#text( (4+tp975)/2,dt(tp975, df = 10),"{",srt=-90,col="grey")
#text( (4+tp95)/2,dt(tp95  , df = 10),"{",srt=-90,col="grey")

# label p-values
# text(-t,.01,expression(P[left] ),pos=2,col="blue")
text( t+.5,.03,expression(P[right]),pos=4,col="red")

# text(-t-.1,dt(-t, df = 10)+.003,"T")
text( t+.1,dt( t, df = 10)+.003,"T")


# text(-4.2,.38 ,"1 sided: P/2 = 3.67%",pos=4)

## 1%   T = 1.81246
## .95% T = 2.228

#polygon(c(0,1,1,0),c(0,0,.3,.3))

pol_left = seq(-4,-t,length=20)

xbracket = seq(tp95,4,length=39)
ybracket = dt(tp95 , df = 10)*c(c(1,1.02),rep(1.03,16),c(1.04,1.05,1.04),rep(1.03,16),c(1.02,1))
lines(xbracket,ybracket,col="gray")
```

In @fig-t-alpha-p, the blue vertical boundaries represent the critical value associated with a chosen alpha level of 5%, the blue area under the curve. The red vertical line represents the t-value from the sample, which in this example was 2. The red area under the curve represents the p-value, the probability of getting this t-value or more extreme.

@fig-twosided represents the sampling distribution of average media literacy. You can take a sample and play around with the population mean according to some null hypothesis. If the mean in the sample is outside the critical value, it falls in the alpha rejection region.

::: {#fig-twosided .column-page-inset-right}

```{=html}
<iframe src="https://sharon-klinkenberg.shinyapps.io/twosided/" width="100%" height="268px" style="border:none;">
</iframe>
```
Sampling distribution of average media literacy according to the null hypothesis.
:::

The reasoning is again the same as for discrete variables. If the p-value is smaller or equal to the alpha level, we reject the null hypothesis. If the p-value is larger than the alpha level, we do not reject the null hypothesis.

### True effect size {#sec-true-effect-size}

The true effect size is the difference between the null hypothesis and the true population value. This can also be expressed in terms of the test statistic. For example, if the IQ scores for communication science students are 120 in the population, the true affect size can be expressed as 20 IQ points, but also as a  _t_-value. The **true** effect size denotes the genuine effect within the population, representing the actual difference, correlation, or parameter value.

In the candy factory example, the true effect size is .5 - .2 = .3. This is the difference in the proportion of yellow candies in the bags. In @fig-twobinomeffect you can see the difference in the two distributions. The true effect size is the difference in the expected value of the two distributions. In absolute terms, it is 5 - 3 expected number of yellow candies in the bag. In terms of the parameter it is the proportion .5 - .2.

```{r}
#| label: fig-twobinomeffect
#| fig-cap: "Effect in discrete binomial distributions."
#| echo: false
#| warning: false

title = "Both distributions"

colh0 = 'yellow'
colha = 'lightyellow'

barplot( rbind(dbinom(0:10, 10, .5),
               dbinom(0:10, 10, .2)),
         main=title,
         names.arg = 0:10,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = rbind(colh0,colha),
         beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         )

# abline(v = c(6.5,27.5), lty=2, col='red')

legend("topright", legend=c("H0","HT"), fill=c(colh0, colha))

lines(c(8.5, 16.5), c(.23, .23), col='red', lwd=4)
text(12.5, .23, "True effect size", col='red', pos=3)
```

**True** refers to the actual difference in the population, which is unknown to us. In our candy factory example, we can only observe the sample from a candy bag and make assumptions based on the null and alternative hypotheses.

Depending on the true value in the population, a true effect size could be small, medium, or large. In order to detect small true effect sizes, we need a large sample size. A larger sample offers more precision, so the difference between our sample outcome and the hypothesized value is more often sufficient to reject the null hypothesis. For example, we would reject the null hypothesis that average candy weight is 2.8 grams in the population if average weight in our sample bag is 2.70 grams and our sample is large. But we may not reject this null hypothesis if we have the same outcome in a small sample bag.

The larger our sample, the more sensitive our test will be, so we will get statistically significant results more often. If we think of our statistical test as a security metal detector, a more sensitive detector will go off more often.

#### Practical relevance

Investigating the effects of a new medicine on a person’s health, we may require some minimum level of health improvement to make the new medicine worthwhile medically or economically. If a particular level of improvement is clinically important, it is practically relevant (sometimes called practically significant).

If we have decided on a minimum level of improvement that is relevant to us, we want our test to be statistically significant if the average true health improvement in the population is at least of this size. We want to reject the null hypothesis of no improvement in this situation.

::: {.callout-important appearance="simple"}

* A larger sample size makes a statistical test more sensitive. The test will pick up (be statistically significant for) smaller effect sizes.

* A larger effect size is more easily picked up by a statistical test. Larger effect sizes yield statistically significant results more easily, so they require smaller samples.

:::

For media interventions such as health, political, or advertisement campaigns, one could think of a minimum change of attitude affected by the campaign in relation to campaign costs. A choice between different campaigns could be based on their efficiency in terms of attitudinal change per cost unit.

Note the important difference between practical relevance and statistical significance. Practical relevance is what we are interested in. If the new medicine is sufficiently effective, we want our statistical test to signal it. In the security metal detector example: If a person carries too much metal, we want the detector to pick it up.

Statistical significance is just a tool that we use to signal practically relevant effects. Statistical significance is not meaningful in itself. For example, we do not want to have a security detector responding to a minimal quantity of metal in a person’s dental filling. Statistical significance is important only if it signals practical relevance. We will return to this topic in @sec-sample-size on sample size.

### Observed effect size {#sec-observed-effect-size}

In @sec-true-effect-size we discussed the true effect, the difference between the null hypothesis and the true alternative hypothesis. The problem is that we do not know the true effect, we do not know which of the two hypothesis is actually true.

We can only estimate the true effect using the sample statistic. The difference between the sample statistic and the null hypothesis is called the **observed effect size**. In the candy factory example, the observed effect size is the difference between the number of yellow candies in the sample and the number of yellow candies in the null hypothesis. If the null hypothesis is that the machine produces bags with 5 yellow candies, and the sample contains 4 yellow candies, the observed effect size is 1.

The same definition holds for the continuous case. If the null hypothesis is that the average media literacy in the population is 5.5, and the sample mean is 3.9, the observed effect size is 1.6. Or if we hypothesize that average candy weight in the population is 2.8 grams and we find an average candy weight in our sample bag of 2.75 grams, the effect size is -0.05 grams. If a difference of 0.05 grams is a great deal to us, the effect is practically relevant.

Note that the effect sizes depend on the scale on which we measure the sample outcome. The unstandardized effect size of average candy weight changes if we measure candy weight in grams, micro grams, kilograms, or ounces. Of course, changing the scale does not affect the meaning of the effect size but the number that we are looking at is very different: 0.05 grams, 50 milligrams, 0.00005 kilos, or 0.00176 ounces. For this reason, we do not have rules of thumb for interpreting these **unstandardized effect sizes** in terms of small, medium, or large effects. But we do have rules of thumb for **standardized effect sizes**. Unstandardized effect sizes are very useful for reporting the practical results of your study, but they are not very useful for comparing studies or for meta-analysis.

You can imagine that estimating the true effect size on just one sample is not very reliable. The observed effect size could be the result of our sample being the result of the null being true, or the alternative being true. The way researchers try to get a notion of the true effect size is by replicating the study. If the observed effect size is consistent over multiple replications, we can be more confident that the average observed effect size is the true effect size. This is what we will cover in @sec-meta-analysis about meta analysis.

#### Cohen's *d*

In scientific research, we rarely have precise norms for raw differences (unstandardized effects) that are practically relevant or substantial. For example, what would be a practically relevant attitude change among people exposed to a health campaign?

To avoid answering this difficult question, we can take the variation in scores (standard deviation) into account. In the context of the candies example, we will not be impressed by a small difference between observed and expected (hypothesized) average candy weight if candy weights vary a lot. In contrast, if candy weight is quite constant, a small average difference can be important.

For this reason, standardized effect sizes for sample means divide the difference between the sample mean and the hypothesized population mean by the standard deviation in the sample. Thus, we take into account the variation in scores. This standardized observed effect size for tests on one or two means is known as *Cohen's* d. @eq-CoD illustrates how the sample mean $\bar{x}$ is compared to the hypothesized population mean $\mu_{H_0}$, and how this difference is standardized by deviding through the standard deviation $s$. In appendix @sec-CohenCalculations we will cover the calculation of the paired and independent t-tests.

$$
d = \frac{\bar{x} - \mu_{H_0}}{s_x}
$$ {#eq-CoD}

Using an inventory of published results of tests on one or two means, @RefWorks:3933 proposed rules of thumb for standardized effect sizes (ignore a negative sign if it occurs):

-   0.2: weak (small) effect,
-   0.5: moderate (medium) effect,
-   0.8: strong (large) effect.

Note that Cohen's *d* can take values above one. These are not errors, they reflect very strong or huge effects [@sawilowskyNewEffectSize2009].

#### Association as effect size {#sec-assoc-size}

Measures of association such as Pearson's product-moment correlation coefficient or Spearman's rank correlation coefficient express effect size if the null hypothesis expects no correlation in the population. If zero correlation is expected, a correlation coefficient calculated for the sample expresses the difference between what is observed (sample correlation) and what is expected (zero correlation in the population).

Effect size is also zero according to the standard null hypotheses used for tests on the regression coefficient (*b*), *R*^2^ for the regression model, and eta^2^ for analysis of variance. As a result, we can use the standardized regression coefficient (Beta in SPSS and *b*\* according to APA), *R*^2^, and eta^2^ as standardized effect sizes.

Because they are standardized, we can interpret their effect sizes using rules of thumb. The rule of thumb for interpreting a standardized regression coefficient (*b*\*) or a correlation coefficient, for example, could be:

* Very weak: between 0 and .10
* Weak: between .10 and .30
* Moderate: between .30 and .50
* Strong: between .50 and .80
* Very strong: between .80 and 1.00
* Perfect association: 1.00

Note that we ignore the sign (plus or minus) of the effect when we interpret its size.

### Post hoc power

Just as the observed effect size is based on the test statistic acquired from your sample, so is the post hoc power. It is also known as: observed, retrospective, achieved power [@doi:10.1080/19312450701641375].

> The power of a test assuming a population effect size equal to the observed effect size in the current sample.
>
> --- [@doi:10.1080/19312450701641375]

The post hoc power refers to the probability of rejecting the null hypothesis assuming the alternative hypothesis has a population mean equal to the observed sample mean or more accurately the observed test statistic.

```{r}
#| label: fig-posthocpower
#| fig-cap: "Discrete binomial distributions showing post hoc power."
#| echo: false
#| warning: false

title = "Both distributions"

colh0 = 'yellow'
colha = 'lightyellow'

n = 10

barplot( rbind(dbinom(0:n, n, .5),
               dbinom(0:n, n, .4)),
         main=title,
         names.arg = 0:n,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = rbind(colh0,colha),
         beside    = TRUE
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         ) -> p


abline(v = p[1,1]+2, col='darkgreen', lwd=2)
abline(v = p[1,10]+2, col='darkgreen', lwd=2)


# abline(v = c(6.5,27.5), lty=2, col='red')


legend("topright", legend=c("H0","HA"), fill=c(colh0, colha))

```

@fig-posthocpower shows the post hoc power for a sample of 10 candies. The null hypothesis is that the machine produces bags with 5 yellow candies. The alternative hypothesis is that the machine produces bags with 2 yellow candies. But the post hoc power assumes the found test statistic of 4 candies to be the alternative population parameter of .4. Following the same decision criterion as defined in the previous sections, the post hoc power is almost zero. This is the probability of 0 or 10 yellow candies under the true distribution when rejecting the null hypothesis on the outside of the critical values.

You can imagine that if we look at a different candy bag and we would find 7 yellow candies, the post hoc power would not be the same. The post hoc power does not have much practical use, though SPSS produces this when you ask it, it is obvious that multiple replications of a research study will yield different results. As the true population mean is not a random variable, the actual power is fixed and should not vary.

<!-- [To-Do] create a new shiny app with null distribution, and random sample with mean and alternative distribtuion and post hoc power area under the HA distribution -->

### Meta analysis {#sec-meta-analysis}

As mentioned in @sec-observed-effect-size, the observed effect size is based on the sample statistic, and is likely to differ with every sample you take. If our research hypothesis is actually true, a random sample from a population described by the sampling distribution of the alternative hypothesis would be mot likely to result in us holding a bag with 2 yellow candies. But as we have seen in @fig-altdistribution, getting 4 yellow candies is reasonably probable as well.

Now imagine that we would take multiple samples, and calculate the observed effect size for each sample. If we would plot these observed effect sizes, we would get a distribution of observed effect sizes.

In research we can conduct replication studies to see if the observed effect size is consistent over multiple replications. If this is the case, we can be more confident that the average observed effect size is the true effect size and we can determine the true population mean. As we have seen in @sec-samp-dist, it is in practical to draw many number of samples to create a sampling distribution. But we can use the results from multiple studies to get an indication of the true population mean.

Imagine that we get a hundred bags of candy (100 replications) and we consistently find 7 to 9 yellow candies, this would give us an indication that the true population value is 8. It would also indicate that our initial alternative hypothesis is highly unlikely.
This is essentially what meta analysis is about. Collecting effect sizes from multiple studies and combining them to get an indication of the true effect size.

<!-- [to-do] add communication meta analysis example. -->

Meta-analysis is a good example of combining research efforts to increase our understanding. It is useful to obtain more precise estimates of population values or effects. Meta-analysis is strongly recommended as a research strategy by Geoff Cumming, who coined the concept _New Statistics_. See Cumming's book [-@RefWorks:3883], [website](http://www.latrobe.edu.au/psychology/research/research-areas/cognitive-and-developmental-psychology/esci), or [YouTube channel](https://www.youtube.com/user/geoffdcumming) if you are curious to learn more.

### Sample size {#sec-sample-size}

As stated in @sec-power, the only way to increase the power of a test is to increase the sample size. In the candy factory example, the sample size is the total number of candies in the bag. With only 10 candies in the bag, the power of the test is only 0.11. To reach our desired power of 80%, we clearly need to increase the sample size. In @fig-twobinomN20, we increased the number of candies in the bag to 20. We can see on the x-axis that the possible outcome space for the number of yellow candies in the bag is now 0 to 20. This still assumes our $H_0$ to be true, and the parameter of the machine is still  $\theta = .5$, half of the candies in the bag should be yellow. Though the parameter is still the same, the expected value when we have bags of 20 candies is now $.5 \times 20 = 10$, right in the middle of our distribution.

@fig-twobinomN20 still follows the reasoning scheme we have setup earlier. We decide to reject $H_0$ on the outside of our critical values (Red vertical line). We determined the position of the critical value based on our chosen alpha level. Because our outcome space is larger we can be more accurate in striving for an $\alpha = .05$. Our alpha is now 4.1%, we get this by adding the yellow bars 0, 1, 2, 3, 4, 5 and 15 up until 20, under the null distribution. This is not exactly 5 percent, but shifting the critical value inwards, would make the alpha level to high. So, this is close enough.

With this sample size, we can acquire our desired power of 80%. If we would assume our alternative hypothesis to be true, our decision to reject the null when you get 5 or less yellow candies, would be correct 80% of the time. The power of 80% is the sum of the light yellow bars under the assumption that $H_A$ is true on the outside of our critical value. So, the power is the probability of getting 0, 1, 2, 3 ,4 ,5 or 15, 16,17, 18, 19 ,20 yellow candies under the true distribution.

```{r}
#| label: fig-twobinomN20
#| fig-cap: "Discrete binomial distributions for higher sample size."
#| echo: false
#| warning: false

title = "Both distributions"

colh0 = 'yellow'
colha = 'lightyellow'

n = 20

barplot( rbind(dbinom(0:n, n, .5),
               dbinom(0:n, n, .2)),
         main=title,
         names.arg = 0:n,
         xlab      = "number of yellow candies",
         ylab      = "Probability",
         col       = rbind(colh0,colha),
         beside    = TRUE,
         cex.names  = .6
         # density   = c(0, 10)
         #legend.text = c("H0", "HT")
         ) -> p


abline(v = p[1,6]+2, col='darkgreen', lwd=2)
abline(v = p[1,15]+2, col='darkgreen', lwd=2)


# abline(v = c(6.5,27.5), lty=2, col='red')

legend("topright", legend=c("H0","HT"), fill=c(colh0, colha))

```

The same reasoning is applied when using continuous sample statistics. Let's revisit the candy weight example. We could have a null hypothesis that the average yellow candy weight is the same as the weight of all other candy colors. But if in reality the yellow candies would be heavier, let's say with an effect size of .3, we would need to determine what sample size we would need to get a power of 80% and a alpha of 5%.

@fig-sample-size-power shows the relation between sample size, power, alpha and effect size. You can play around with the sliders de determine what sample size you would need to obtain a power of 80% for an effect size of .3.

::: {#fig-sample-size-power .column-page-inset-right
}

```{=html}
<iframe src="https://sharon-klinkenberg.shinyapps.io/sample-size-power2/" width="100%" height="385px" style="border:none;">
</iframe>
```
How does test power depend on effect size, type of test, significance level, and sample size? Sampling distributions of the sample mean under the null hypothesis (H~0~, left-hand curve) and under the assumed true value of the population mean (H~1~, right-hand curve) for a one-sample _t_ test.
:::

For continuous sample statistics, we choose an alpha level, and we can see the critical value in the null distribution. The alpha level of 5% is the area under the curve of the null distribution on the outside of the critical values. The power is the area under the alternative distribution that is on the outside of the critical values.

The reasoning is again the same as in the discrete case, when we use categorical sample statistics. We first determine our desired alpha and power, make sure our sample size is large enough to get the desired power, for our effect size of interest. Then, when we collect our data, we can calculate our test statistic and determine if we can reject the null hypothesis or not, being confident that we will be wrong in our conclusion in 5% of the cases, and that we will be right in 80% of the cases when the alternative hypothesis is actually true.

#### How to determine sample size

As stated in @sec-power about the power of a test, we already considered that we do not know the parameter for the alternative distribution and that we therefore also don't know the true effect size. We stated that you can make an educated guess about the true effect size based on previous research, theory, or other empirical evidence.

In research you can take these assumptions into account by conducting a power analysis. A power analysis is a statistical method to determine the sample size you need to get a desired power for a given effect size.

It can be difficult to specify the effect size that we should expect or that is practically relevant. If there is little prior research comparable to our new project, we cannot reasonably specify an effect size and calculate sample size. Though, if there are meta analyses available for your research topic of interest or you have the effect sizes from a few previous studies, you can use programs such as G\*Power to calculate the sample size you need to get a desired power for a given effect size. G\*Power is a stand alone program that can be downloaded for free from the internet, and is specifically designed to calculate the required sample size for a wide range of statistical tests.

> Download [G\*Power here](http://www.gpower.hhu.de/)

In G\*Power you can specify the test you want to conduct, the effect size you expect, the alpha level you want to use, and the power you want to achieve. G\*Power will then calculate the sample size you need to get the desired power for the given effect size.

For our candy color example, we can use G\*Power to calculate the sample size we need to get a power of 80% for a given effect size of .3.

![Power analysis in G*Power for a binomial distribution](figures/g-power.png){#fig-g-power width=400 fig-align="left"}

In @fig-g-power you can see that for the binomial test we have set the proportion p1 to .5 ($H_0$) and the proportion p2 ($H_A$) to .2, indirectly setting the effect size to .3. We have set the alpha level to 5% and the power to 80%. By hitting the calculate button, G\*Power will calculate the sample size we need. In this case we need 20 candies in the bag to get a power of 80%. The plot shows exactly the same information as in @fig-twobinomN20, though with lines instead of bars.

As mentioned in @sec-true-effect-size about the true effect size, the sensitivity of a test is determined by the sample size. The larger the sample size, the more sensitive the test will be. This means that if we want to detect a small effect size, we need a large sample size. If we want to detect a large effect size, we can suffice with a smaller sample.

Try to determine what sample size you would need using @fig-sample-size-power, if you would want to detect an effect size of .2, .5 or .8 with a power of 80% and an alpha level of 5%. You can see that the sample size ranges from 197 to about 15 for these effect sizes.

Something to consider is that with extremely large sample sizes you will very easily find significant results. Even if these results are not practically relevant. This is why it is important to determine the sample size you need before you start collecting data.

### One-Sided and Two-Sided Tests {#sec-one-twosidedtests}

As was explained in @sec-alternative-hypothesis, the alternative hypothesis can be one-sided or two-sided. The choice between a one-sided or two-sided test is based on the research question. In our media literacy example, we could have a one-sided alternative hypothesis that the average media literacy is below 5.5. This would be the case if we hypothesize that children on average score very low on media literacy. We could also have a different hypothesis, that a media literacy intervention program will increase media literacy. Both would be a one-sided alternative hypothesis. We could also have no idea about the media literacy of children, and just want to know if children score below or above 5.5 on media literacy. This would be a two-sided alternative hypothesis. @eq-MediaAlt formalizes these different hypothesis.

$$
\begin{eqnarray*}
\text{Two-sided} \\
H_{A} & : \theta & \neq 5.5 \\
\text{One-sided} \\
H_{A} & : \theta & < 5.5 \\
H_{A} & : \theta & > 5.5 \\
\end{eqnarray*}
$$ {#eq-MediaAlt}


In null hypothesis significance testing, testing one or two-sided has some consequences for the critical values. In a two-sided test, the critical values are on both sides of the null hypothesis value. In a one-sided test, the critical value is only on one side of the null hypothesis value. If we are using an alpha significance level of 5%, the critical value for a two-sided test results in 2.5% on both sides, while for a one sided test the 5% would only be on one side of the null distribution.

::: {#fig-nonsig-1sided}

```{=html}
<iframe src="https://sharon-klinkenberg.shinyapps.io/nonsig-1sided/" width="100%" height="310px" style="border:none;">
</iframe>
```
One-sided and two-sided tests of a null hypothesis.
:::

In the right-sided test of the media literacy hypothesis, the researcher is not interested in demonstrating that average media literacy among children can be lower than 5.5. She only wants to test if it is above 5.5, because an average score above 5.5 indicates that the intervention worked.

If it is deemed important to note values well over 5.5 as well as values well below 5.5, the alternative hypotheses should be two-sided. Then, a sample average well below 5.5 would also have resulted in a rejection of the null hypothesis.

@fig-nonsig-1sided shows the $H_0$ distribution of the sample mean 5.5. The dark blue areas represent the 5% probability for a two-sided te  st, 2.5% on either side. The light blue areas represent the 5% probability for a one-sided (right-sided) test. The critical value for the one-sided test is 7.8, and the critical values for the two-sided test are 2.9 and 8.1. The critical value is the value that separates the rejection region from the non-rejection region. Where the rejection region are the values on the x-axis that are on the outside of the critical value.

You can take a sample and see the result of the sample in the figure. You can then determine if the sample mean is significant at a 5% significance level for a right-sided test, and a two-sided test.

#### From one-sided to two-sided _p_ values and back again

Statistical software like SPSS usually reports either one-sided or two-sided _p_ values. What if a one-sided _p_ value is reported but you need a two-sided _p_ value or the other way around?

In @fig-onetwosided, the sample mean is 3.9 and we have .015 probability of finding a sample mean of 3.9 or less if the null hypothesis is true that average media literacy is 5.5 in the population. This probability is the surface under the curve to the left of the solid red line representing the sample mean. It is the one-sided _p_ value that we obtain if we only take into account the possibility that the population mean can be smaller than the hypothesized value. We are only interested in the left tail of the sampling distribution.

::: {#fig-onetwosided .column-page-inset-right}

```{=html}
<iframe src="https://sharon-klinkenberg.shinyapps.io/onetwosided/" width="100%" height="268px" style="border:none;">
</iframe>
```
Halve a two-sided _p_ value to obtain a one-sided _p_ value, double a one-sided _p_ value to obtain a two-sided _p_ value.
:::

In a two-sided test, we have to take into account two different types of outcomes. Our sample outcome can be smaller or larger than the hypothesized population value. The p-value still represents the probability of drawing a random sample with a sample statistic (here the mean) that is as extreme or more extreme than the sample statistics in our current sample. In the one-sided test example described above, more extreme can only mean “even smaller”. In a two-sided test, more extreme means even more distant from the null hypothesis on either end of the sampling distribution.

In @fig-onetwosided, you can see tha sample meen as indicated by the solid red line. The dotted red line is the mirror image of the sample mean on the other side of the hypothesized population mean. When testing two-sided, we not only consider the sample mean, but also its mirror opposite. The two-sided _p_ value is the probability of finding a sample mean as extreme or more extreme than the sample mean in the sample, and also its mirror opposite. Hence, the two-sided _p_ value is the sum of the probabilities for both the left tail and the right tail of the sampling distribution. As these tails are symmetrical, the two-sided _p_ value is twice the one-sided _p_ value.

So, if our statistical software tells us the two-sided _p_ value and we want to have the one-sided _p_ value, we can simply halve the two-sided _p_ value. The two-sided _p_ value is divided equally between the left and right tails. If we are interested in just one tail, we can ignore the half of the _p_ value that is situated in the other tail.

Be careful if you divide a two-sided _p_ value to obtain a one-sided _p_ value. If your left-sided test hypothesizes that average media literacy is below 5.5 but your sample mean is well above 5.5, the two-sided _p_ value can be below .05. But your left-sided test can never be significant because a sample mean above 5.5 is fully in line with the null hypothesis. Check that the sample outcome is at the correct side of the hypothesized population value.

You might have already realized that if you use the same alpha criterion for rejecting the null hypothesis (e.g. 5%) as is usually done, it is easier to reject a one-sided null hypothesis, because the entire 5% of most extreme samples is located on one side of the distribution, whereas a two-sided null hypothesis would require us to highlight 2.5% of samples in the lower tail of the distribution and 2.5% in the upper tail. To avoid making too many unnecessary type 1 errors (@sec-cap-chance), we should always have a good theoretical justification for using one-sided null hypotheses and tests.

One final warning: Two-sided tests are only relevant if the probability distribution that you are using to test your hypothesis is symmetrical. If you are using a non-symmetrical distribution, such as the chi-square distribution, or the F-distribution you should always use a one-sided test. This is because such distributions do not have negative values, and the critical values are always on the right side of the distribution. As the F-value, for example, represents a signal to noise ratio, it can never be negative.

## Reporting test results {#sec-reporting}

### Reporting to fellow scientists

Fellow scientists need to be able to see the precise statistical test results. According to the APA guidelines, we should report the test statistic, the associated degrees of freedom (if any), the value of the test statistic, the _p_ value of the test statistic, and the confidence interval (if any). APA requires a particular format for presenting statistical results and it demands that the results are included at the end of a sentence.

The statistical results for a _t_ test on one mean, for example, would be:_t_ (67) = 2.73, _p_ = .004, 95% CI [4.13, 4.87].

* The degrees of freedom are between parentheses directly after the name of the test statistic. Chi-squared tests add sample size to the degrees of freedom, for instance: chi-squared (12, _N_ = 89) = 23.14, _p_ = .027.
* The value of the test statistic is 2.73 in this example.
* The _p_ value is .004. Note that we report all results with two decimal places except probabilities, which are reported with three decimals. We are usually interested in small probabilities---less than .05---so we need the third decimal here. If SPSS rounds the _p_ value to .000, report: _p_ < .001. Add (one-sided) after the _p_ value if the test is one-sided.
* The 95% confidence interval is 4.13 to 4.87, so we 95% confident that the population mean lies within the CI. Add (bootstrapped) after the confidence interval if the confidence interval is bootstrapped.

Not all tests produce all results reported in the example above. For example, a _z_ test does not have degrees of freedom and _F_ or chi-squared tests do not have confidence intervals. Exact tests or bootstrap tests usually do not have a test statistic. Just report the items that your statistical software produces, and give them in the correct format.

### Reporting to the general reader

For fellow scientists and especially for the general reader, it is important to read an interpretation of the results that clarifies both the subject of the test and the test results. Make sure that you tell your reader who or what the test is about:

* What is the population that you investigate?
* What are the variables?
* What are the values of the relevant sample statistics?
* Which comparison(s) do you make?
* Are the results statistically significant and, if so, what are the estimates for the population?
* If the results are statistically significant, how large are the differences or associations?

A test on one proportion, for example, the proportion of all households reached by a television station, could be reported as follows:

::: {.callout-important appearance="simple"}

"The television station reaches significantly and substantially (61%) more than half of all households in Greece in 2012, _z_ = 4.01, _p_ < .001."

:::

The interpretation of this test tells us the population ("all households in Greece"), the variable ("reaching a household") and the sample statistic of interest (61%, indicating a proportion). It tells us that the result is statistically significant, which a fellow scientist can check with the reported _p_ value.

Finally, the interpretation tells us that the difference from .5 is substantial. Sometimes, we can express the difference in a number, which is called the _effect size_, and give a more precise interpretation (see @sec-power for more information).

## Statistical test selection {#sec-test-selection}

Knowing what statistical test fits your research question is crucial for the success of your research. If you do not know what test to apply or even choose the wrong test, you may draw the wrong conclusions. This can lead to a waste of time and resources, and it can even lead to harm if the wrong conclusions are used to make decisions.

Statistics such as means, proportions, variances, and correlations are calculated on variables. For translating a research hypothesis into a statistical hypothesis, the researcher has to recognize the dependent and independent variables addressed by the research hypothesis and their variable types. The main  distinction is between dichotomies (two groups), (other) categorical variables (three or more groups), and numerical variables. Once you have identified the variables, the flow chart in @fig-flowchart helps you to identify the right statistical test.

```{r}
#| label: fig-flowchart
#| fig-cap: "Flow chart for selecting a test in SPSS."
#| echo: false
#| warning: false
#| fig-width: 12

#Create and draw flow chart for selecting tests in SPSS.
source("flowchart.R")
#Show plot.
p
#Cleanup.
rm(p, x, y)
```

You use the flow chart by identifying the type of your dependent variable (categorical or numerical) and the number and type of your independent variable (categorical or numerical). The flow chart then guides you to the right statistical test.

Consider the following example. You want to measure the difference in media literacy between man and women, and want to control for age. You measure media literacy on a scale from 1 to 7, and age in years. You have a numerical dependent variable (media literacy) and a categorical independent (biological sex), and a numerical independent variable (age). As your dependent variable is numerical, you follow the flow chart to the right. As you have two independent variables, you follow the flow chart to the right to indicate that you have both a categorical and numerical independent variable. The flow chart guides you to the _F test multiple regression model_.

## Confidence Intervals to test hypotheses {#sec-null-ci}

In @sec-param-estim, we learned how to calculate a confidence interval for the population mean. We also learned that the confidence interval is a range of values that is likely to contain the true population mean. We learned that the true population mean falls within the confidence in 95% of a hundred samples. We can use this knowledge to test hypotheses. If we, again, have the hypothesis that the average media literacy in the population is 5.5, we can use the confidence interval to test this hypothesis. If we draw a sample and calculate the confidence interval, we can see if the hypothesized population mean falls within the confidence interval. If it does, we can conclude that we are probably right. If it does not, we can conclude that we are probably wrong.

With the hypothesis that we can improve media literacy through some intervention program, we can also use the confidence interval to test this hypothesis. If the lower bound of the confidence interval is higher than 5.5, we can conclude that the intervention program works.

### Estimation in addidion to NHST

Following up on a report commissioned by the American Psychological Association APA [@RefWorks:3934], the 6^th^ edition of the _Publication Manual of the American Psychological Association_ recommends reporting and interpreting confidence intervals in addition to null hypothesis significance testing.

Estimation is becoming more important: Assessing the precision of our statements about the population rather than just rejecting or not rejecting our hypothesis about the population. This is an important step forward and it is easy to accomplish with your statistical software.

```{r}
#| label: fig-ci-nullhyp
#| fig-cap: "What is the most sensible interpretation of the results represented by the confidence interval for the regression coefficient, which estimates brand awareness from campaign exposure?"
#| echo: false
#| warning: false


#REPLACED BY STATIC IMAGE

# Display an x-axis labeled "Effect size" with values "none
# (H0)"/"tiny"/"small"/"moderate"/"large" with a vertical line (unlabeled) at
# "none (H0)". As in app sig-effect-power.
# Generate a confidence interval for a positive effect and represent it by a
# horizontal errorbar with the sample value (point estimate) as a fat dot.
# Successive confidence intervals should differ on one or two of the following characteristics:
# (1) includes/excludes H0,
# (2) small/wide,
# (3) tiny versus moderate-large effect size.
# Add a slider to adjust sample size (range [10, 250], initial setting 30), which changes the width of the confidence interval.
# Finally, add a button to generate a new confidence interval.
d <- data.frame(x = c(6, 5, 4, 3, 2, 1),
                 lb95 = c(-2, -0.1, 0.1, -0.1, 0.1, 2),
                 ub95 = c(2.6, 0.7, 0.5, 3.2, 3.0, 3.0),
                 lab = c("A", "B", "C", "D", "E", "F"))
d$point <- (d$ub95 + d$lb95)/2
ggplot2::ggplot(d) +
  geom_errorbar(aes(x = x, ymin = lb95, ymax = ub95), colour = brewercolors["Blue"]) +
  geom_point(aes(x = x, y = point), size = 3, colour = brewercolors["Blue"]) +
  geom_hline(yintercept = 0, colour = brewercolors["Red"]) +
  geom_text(aes(x = x, y = ub95, label = lab), nudge_y = 0.15) +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "Unstandardized effect size (b)",
                     breaks = c(-2, -1, 0, 0.3, 1.55, 2.5),
                     sec.axis = sec_axis(~./5.6, name = "Standardized effect size (b*)",
                     breaks = c(-0.3, -0.1, 0, 0.1, 0.3, 0.5),
                     labels = c("-0.3\nmoderate", "-0.1\nweak", "0\nH0", "0.1\nweak", "0.3\nmoderate", "0.5\nstrong"))) +
  coord_flip() +
  theme_general()