statthink/17-Solutions.Rmd at master · eleuven/statthink · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
  pdf_document: default
  html_document: default
---
# Exercise Solutions {-}

## Chapter 1 {-}

### Exercise 1.1 {-}

1. According to the information in the question the polling was conducted among 500 registered voters. The 500 registered voters corresponds to the sample.

2. The percentage, among all registered voters of the given party, of those that prefer a male candidate is a parameter. This quantity is a characteristic of the population.

3. It is given that 42% of the sample prefer a female candidate. This quantity is a numerical characteristic of the data, of the sample. Hence, it is a statistic.

4. The voters in the state that are
registered to the given party is the target population.

### Exercise 1.2 {-}

One may read the data into `R` and create a table using the code:

```{r}
    n.cost <- c(4,2,1,1,0,2,1,2,4,2,5,3,1,5,1,5,1,2,1,1,3,4,2,4,3)
    table(n.cost)
```

For convenience, one may also create the bar plot of the data using the code:

```{r, eval=FALSE, echo=TRUE}
plot(table(n.cost))
```

1. The number of days in which 5 costumers where waiting is 3, since the frequency of the value “5" in the data is 3. That can be seen from the table by noticing the number below value “5" is 3. It can also be seen from the bar plot by observing that the hight of the bar above the value “5" is equal to 3.

2. The number of waiting costumers that occurred the largest number of times is 1. The value “1” occurred 8 times, more than any other value. Notice that the bar above this value is the highest.

3. The value “0”, which occurred only once, occurred the least number of times.

## Chapter 2 {-}

### Exercise 2.1 {-}

1. The relative frequency of direct hits of category 1 is 0.3993. Notice that the cumulative relative frequency of category 1 and 2 hits, the sum of the relative frequency of both categories, is 0.6630. The relative frequency of category 2 hits is 0.2637. Consequently, the relative frequency of direct hits of category 1 is 0.6630 - 0.2637 = 0.3993.

2. The relative frequency of direct hits of category 4 or more is 0.0769. Observe that the cumulative relative of the value “3" is 0.6630 + 0.2601 = 0.9231. This follows from the fact
that the cumulative relative frequency of the value “2" is 0.6630 and the relative frequency of the value “3" is 0.2601. The total cumulative relative frequency is 1.0000. The relative frequency of direct hits of category 4 or more is the difference between the total cumulative relative frequency and cumulative relative frequency of 3 hits: 1.0000 - 0.9231 = 0.0769.

### Exercise 2.2 {-}

1. The total number of cows that were involved in this study is 45. The object “`freq`" contain the table of frequency of the cows, divided according to the number of calves that they had. The cumulative frequency of all the cows that had 7 calves or less, which includes all cows in the study, is reported under the number “7" in the output of the expression “`cumsum(freq)`". This number is 45.

2. The number of cows that gave birth to a total of 4 calves is 10. Indeed, the cumulative frequency of cows that gave birth to 4 calves or less is 28. The cumulative frequency of cows
that gave birth to 3 calves or less is 18. The frequency of cows that gave birth to exactly 4 calves is the difference between these two numbers: 28 - 18 = 10.

3. The relative frequency of cows that gave birth to at least 4 calves is $27/45 = 0.6$. Notice that the cumulative frequency of cows that gave at most 3 calves is 18. The total number of cows is 45. Hence, the number of cows with 4 or more calves is the difference between these two numbers: 45 - 18 = 27. The relative frequency of such cows is the ratio between this number and the total number of cows: $27/45 = 0.6$.

## Chapter 3 {-}

### Exercise 3.1 {-}

1. Consider the data “`x1`". From the summary we see that it is distributed in the range between 0 and slightly below 5. The central 50% of the distribution are located between 2.5 and 3.8. The mean and median are approximately equal to each other, which suggests an approximately symmetric distribution. Consider the histograms in Figure \@ref(fig:DStat9). Histograms 1 and 3 correspond to a distributions in the appropriate range. However, the distribution in Histogram 3 is concentrated in lower values than suggested by the given first and third quartiles. Consequently, we match the summary of “`x1`" with Histogram 1.

    Consider the data “`x2`". Again, the distribution is in the range
    between 0 and slightly below 5. The central 50% of the distribution are
    located between 0.6 and 1.8. The mean is larger than the median, which
    suggests a distribution skewed to the right. Therefore, we match the
    summary of “`x2`" with Histogram 3.

    For the data in “`x3`" we may note that the distribution is in the range
    between 2 and 6. The histogram that fits this description is
    Histograms 2.

    The box plot is essentially a graphical representation of the
    information presented by the function “`summary`". Following the
    rational of matching the summary with the histograms we may obtain that
    Histogram 1 should be matched with Box-plot 2 in
    Figure \@ref(fig:DStat10), Histogram 2 matches Box-plot 3, and
    Histogram 3 matches Box-plot 1. Indeed, it is easier to match the box
    plots with the summaries. However, it is a good idea to practice the
    direct matching of histograms with box plots.

2. The data in “`x1`" fits
Box-plot 2 in Figure \@ref(fig:DStat10). The value 0.000 is the
smallest value in the data and it corresponds to the smallest point in
the box plot. Since this point is below the bottom whisker it follows
that it is an outlier. More directly, we may note that the
inter-quartile range is equal to $IQR = 3.840 - 2.498 = 1.342$. The
lower threshold is equal to $2.498  - 1.5 \times 1.342 = 0.485$, which
is larger that the given value. Consequently, the given value 0.000 is
an outlier.

3. Observe that the data in
“`x3`" fits Box-plot 3 in Figure \@ref(fig:DStat10). The vale
6.414 is the largest value in the data and it corresponds to the
endpoint of the upper whisker in the box plot and is not an outlier.
Alternatively, we may note that the inter-quartile range is equal to
$IQR =  4.690 - 3.391 = 1.299$. The upper threshold is equal to
$4.690 + 1.5 \times 1.299 = 6.6385$, which is larger that the given
value. Consequently, the given value 6.414 is not an outlier.

### Exercise 3.2 {-}

1. In order to compute the mean
of the data we may write the following simple `R` code:

    ```{r}
    x.val <- c(2,4,6,8,10)
    freq <- c(10,6,10,2,2)
    rel.freq <- freq/sum(freq)
    x.bar <- sum(x.val*rel.freq)
    x.bar
    ```

    We created an object “`x.val`" that contains the unique values of the
    data and an object “`freq`" that contains the frequencies of the values.
    The object “`rel.freq`" contains the relative frequencies, the ratios
    between the frequencies and the number of observations. The average is
    computed as the sum of the products of the values with their relative
    frequencies. It is stored in the objects “`x.bar`" and obtains the value
    4.666667.

    An alternative approach is to reconstruct the original data from the
    frequency table. A simple trick that will do the job is to use the
    function “`rep`". The first argument to this function is a sequence of
    values. If the second argument is a sequence of the same length that
    contains integers then the output will be composed of a sequence that
    contains the values of the first sequence, each repeated a number of
    times indicated by the second argument. Specifically, if we enter to
    this function the unique value “`x.val`" and the frequency of the values
    “`freq`" then the output will be the sequence of values of the original
    sequence “`x`":

    ```{r}
    x <- rep(x.val,freq)
    x
    mean(x)
    ```

    Observe that when we apply the function “`mean`" to “`x`" we get again
    the value 4.666667.

2. In order to compute the
sample standard deviation we may compute first the sample variance and
then take the square root of the result:

    ```{r}
    var.x <- sum((x.val-x.bar)^2*freq)/(sum(freq)-1)
    sqrt(var.x)
    ```

    Notice that the expression “`sum((x.val-x.bar)^2*freq)`" compute the sum
    of square deviations. The expression “`(sum(freq)-1)`" produces the
    number of observations minus 1 ($n-1$). The ratio of the two gives the
    sample variance.

    Alternatively, had we produced the object “`x`" that contains the data,
    we may apply the function “`sd`" to get the sample standard deviation:

    ```{r}
    sd(x)
    ```

    Observe that in both forms of computation we obtain the same result:
    2.425914.

3. In order to compute the
median one may produce the table of cumulative relative frequencies of
“`x`":

    ```{r}
    data.frame(x.val,cumsum(rel.freq))
    ```

    Recall that the object “`x.val`" contains the unique values of the data.
    The expression “`cumsum(rel.freq)`" produces the cumulative relative
    frequencies. The function “`data.frame`" puts these two variables into a
    single data frame and provides a clearer representation of the results.

    Notice that more that 50% of the observations have value 4 or less.
    However, strictly less than 50% of the observations have value 2 or
    less. Consequently, the median is 4. (If the value of the cumulative
    relative frequency at 4 would have been exactly 50% then the median
    would have been the average between 4 and the value larger than 4.)

    In the case that we produce the values of the data “`x`" then we may
    apply the function “`summary`" to it and obtain the median this way

    ```{r}
    summary(x)
    ```

4. As for the inter-quartile
range (IQR) notice that the first quartile is 2 and the third quartile
is 6. Hence, the inter-quartile range is equal to 6 - 2 = 4. The
quartiles can be read directly from the output of the function
“`summary`" or can be obtained from the data frame of the cumulative
relative frequencies. For the later observe that more than 25% of the
data are less or equal to 2 and more 75% of the data are less or equal
to 6 (with strictly less than 75% less or equal to 4).

5. In order to answer the last
question we conduct the computation:
$(10 - 4.666667)/2.425914 = 2.198484$. We conclude that the value 10 is
approximately 2.1985 standard deviations above the mean.


## Chapter 4 {-}

### Exercise 4.1 {-}

1. Consult
Table \@ref(tab:tab4). The probabilities of the different values
of $Y$ are $\{p, 2p, \ldots, 6p\}$. These probabilities sum to 1,
consequently

$$p + 2p + 3 p + 4 p + 5 p + 6p = (1+2+3+4+5+6)p = 21 p = 1 \Longrightarrow p = 1/21\;.$$

2. The event $\{Y < 3\}$ contains
the values 0, 1 and 2. Therefore,

$$\Prob(Y < 3) = \Prob(Y=0) + \Prob(Y=1) + \Prob(Y=2) = \frac{1}{21} + \frac{2}{21} + \frac{3}{21} = \frac{6}{21}= 0.2857\;.$$

3. The event $\{Y = \mbox{odd}\}$
contains the values 1, 3 and 5. Therefore,

$$\Prob(Y = \mbox{odd}) = \Prob(Y=1) + \Prob(Y=3) + \Prob(Y=5) = \frac{2}{21} + \frac{4}{21} + \frac{6}{21} = \frac{12}{21}= 0.5714\;.$$

4. The event $\{1 \leq Y < 4\}$
contains the values 1, 2 and 3. Therefore,

$$\Prob(1 \leq Y < 4) = \Prob(Y=1) + \Prob(Y=2) + \Prob(Y=3) = \frac{2}{21} + \frac{3}{21} + \frac{4}{21} = \frac{9}{21}= 0.4286\;.$$

5. The event $\{|Y -3| < 1.5\}$
contains the values 2, 3 and 4. Therefore,

$$\Prob(|Y -3| < 1.5) = \Prob(Y=2) + \Prob(Y=3) + \Prob(Y=4) = \frac{3}{21} + \frac{4}{21} + \frac{5}{21} = \frac{12}{21}= 0.5714\;.$$

6. The values that the random
variable $Y$ obtains are the numbers 0, 1, 2, …, 5, with probabilities
$\{1/21, 2/21, \ldots, 6/21\}$, respectively. The expectation is
obtained by the multiplication of the values by their respective
probabilities and the summation of the products. Let us carry out the
computation in `R`:

    ```{r}
    Y.val <- c(0,1,2,3,4,5)
    P.val <- c(1,2,3,4,5,6)/21
    E <- sum(Y.val*P.val)
    E
    ```

    We obtain an expectation $\Expec(Y) = 3.3333$.

7. The values that the random
variable $Y$ obtains are the numbers 0, 1, 2, …, 5, with probabilities
$\{1/21, 2/21, \ldots, 6/21\}$, respectively. The expectation is equal
to $\Expec(Y) = 3.333333$. The variance is obtained by the
multiplication of the squared deviation from the expectation of the
values by their respective probabilities and the summation of the
products. Let us carry out the computation in `R`:

    ```{r}
    Var <- sum((Y.val-E)^2*P.val)
    ```

    We obtain a variance $\Var(Y) = 2.2222$.

8. The standard deviation is the
square root of the variance: $\sqrt{\Var(Y)} = \sqrt{2.2222} = 1.4907$.

### Exercise 4.2 {-}

1. An outcome of the game of chance
may be represented by a sequence of length three composed of the letters
“H" and “T". For example, the sequence “THH“ corresponds to the case
where the first toss produced a ”Tail", the second a “Head" and the
third a “Head".

    With this notation we obtain that the possible outcomes of the game are
    $\{\mbox{HHH}, \mbox{THH},\mbox{HTH}, \mbox{TTH},\mbox{HHT}, \mbox{THT},\mbox{HTT},          \mbox{TTT}\}$. All outcomes are equally likely. There are 8 possible outcomes and only
    one of which corresponds to winning. Consequently, the probability of winning is 1/8.

2. Consider the previous solution.
One looses if any other of the outcomes occurs. Hence, the probability
of loosing is 7/8.

3. Denote the gain of the player by
$X$. The random variable $X$ may obtain two values: 10-2 = 8 if the
player wins and -2 if the player looses. The probabilities of these
values are {1/8, 7/8}, respectively. Therefore, the expected gain, the
expectation of $X$ is:

$$\Expec(X) = 8 \times \frac{1}{8} + (-2) \times \frac{7}{8} =-0.75\;.$$

## Chapter 5 {-}
### Exercise 5.1 {-}

1. The Binomial distribution is a
reasonable model for the number of people that develop high fever as
result of the vaccination. Let $X$ be the number of people that do so in
a give day. Hence, $X \sim \mbox{Binomial}(500,0.09)$. According to the
formula for the expectation in the Binomial distribution, since $n=500$
and $p=0.09$, we get that:

$$\Expec(X) = n p = 500 \times 0.09 = 45\;.$$

2. Let
$X \sim \mbox{Binomial}(500,0.09)$. Using the formula for the variance
for the Binomial distribution we get that:

$$\Var(X) = n p(1-p) = 500 \times 0.09\times 0.91 = 40.95\;.$$ Hence,
since $\sqrt{Var(X)} = \sqrt{40.95} = 6.3992$, the standard deviation is
6.3992.

3. Let
$X \sim \mbox{Binomial}(500,0.09)$. The probability that more than 40
people will develop a reaction may be computed as the difference between
1 and the probability that 40 people or less will develop a reaction:

$$\Prob(X > 40) = 1- \Prob(X \leq 40)\;.$$ The probability can be
computes with the aid of the function “`pbinom`" that produces the
cumulative probability of the Binomial distribution:

    ```{r}
    1 - pbinom(40,500,0.09)
    ```

4. The probability that the number of
people that will develop a reaction is between 50 and 45 (inclusive) is
the difference between $\Prob(X\leq 50)$ and
$\Prob(X < 45) = \Prob(X \leq 44)$. Apply the function “`pbinom`" to
get:

    ```{r}
    pbinom(50,500,0.09) - pbinom(44,500,0.09)
    ```

### Exercise 5.2 {-}

1. The plots can be produced with the
following code, which should be run one line at a time:

    ```{r, out.width = '60%', fig.align = "center"}
    x <- 0:15
    plot(x,dnbinom(x,2,0.5),type="h")
    plot(x,dnbinom(x,4,0.5),type="h")
    plot(x,dnbinom(x,8,0.8),type="h")
    ```

    The first plot, that corresponds to
    $X_1 \sim \mbox{Negative-Binomial}(2,0.5)$, fits Barplot 3. Notice that
    the distribution tends to obtain smaller values and that the probability
    of the value “0" is equal to the probability of the value “1".

    The second plot, the one that corresponds to
    $X_2 \sim \mbox{Negative-Binomial}(4,0.5)$, is associated with
    Barplot 1. Notice that the distribution tends to obtain larger values.
    For example, the probability of the value “10" is substantially larger
    than zero, where for the other two plots this is not the case.

    The third plot, the one that corresponds to
    $X_3 \sim \mbox{Negative-Binomial}(8,0.8)$, matches Barplot 2. Observe
    that this distribution tends to produce smaller probabilities for the
    small values as well as for the larger values. Overall, it is more
    concentrated than the other two.

2. Barplot 1 corresponds to a
distribution that tends to obtain larger values than the other two
distributions. Consequently, the expectation of this distribution should
be larger. The conclusion is that the pair $\Expec(X) = 4$,
$\Var(X) = 8$ should be associated with this distribution.

    Barplot 2 describes a distribution that produce smaller probabilities
    for the small values as well as for the larger values and is more
    concentrated than the other two. The expectations of the two remaining
    distributions are equal to each other and the variance of the pair
    $\Expec(X) = 2$, $\Var(X) =  2.5$ is smaller. Consequently, this is the
    pair that should be matched with this box plot.

    This leaves only Barplot 3, that should be matched with the pair
    $\Expec(X) = 2$, $\Var(X) =  4$.

## Chapter 6 {-}
### Exercise 6.1 {-}

1. Let $X$ be the total weight of 8
people. By the assumption, $X \sim \mbox{Normal}(560, 57^2)$. We are
interested in the probability $\Prob(X > 650)$. This probability is
equal to the difference between 1 and the probability
$\Prob(X \leq 650)$. We use the function “`pnorm`" in order to carry out
the computation:

    ```{r}
    1 - pnorm(650,560,57)
    ```

    We get that the probability that the total weight of 8 people exceeds
    650kg is equal to 0.05717406.

2. Let $Y$ be the total weight of 9
people. By the assumption, $Y \sim \mbox{Normal}(630, 61^2)$. We are
interested in the probability $\Prob(Y > 650)$. This probability is
equal to the difference between 1 and the probability
$\Prob(Y \leq 650)$. We use again the function “`pnorm`" in order to
carry out the computation:

    ```{r}
    1 - pnorm(650,630,61)
    ```

    We get that the probability that the total weight of 9 people exceeds
    650kg is much higher and is equal to 0.3715054.

3. Again,
$X \sim \mbox{Normal}(560, 57^2)$, where $X$ is the total weight of 8
people. In order to find the central region that contains 80% of the
distribution we need to identify the 10%-percentile and the
90%-percentile of $X$. We use the function “`qnorm`" in the code:

    ```{r}
    qnorm(0.1,560,57)
    qnorm(0.9,560,57)
    ```

    The requested region is the interval \[486.9516, 633.0484\].

4. As before,
$Y \sim \mbox{Normal}(630, 61^2)$, where $Y$ is the total weight of 9
people. In order to find the central region that contains 80% of the
distribution we need to identify the 10%-percentile and the
90%-percentile of $Y$. The computation this time produces:

    ```{r}
    qnorm(0.1,630,61)
    qnorm(0.9,630,61)
    ```

    and the region is \[551.8254, 708.1746\].

### Exercise 6.2 {-}

1. The probability $\Prob(X > 11)$ can be
  computed as the difference between 1 and the probability
  $\Prob(X \leq 11)$. The latter probability can be computed with the
  function “`pbinom`":

    ```{r}
    1 - pbinom(11,27,0.32)
    ```

    Therefore, $\Prob(X > 11) = 0.1203926$.

2. Refer again to the probability
$\Prob(X > 11)$. A formal application of the Normal approximation
replaces in the computation the Binomial distribution by the Normal
distribution with the same mean and variance. Since
$\Expec(X) = n \cdot p = 27 \cdot 0.32 = 8.64$ and
$\Var(X) = n \cdot p \cdot (1-p) = 27 \cdot 0.32 \cdot 0.68 = 5.8752$.
If we take $X \sim \mbox{Normal}(8.64,5.8752)$ and use the function
“`pnorm`" we get:

    ```{r}
    1 - pnorm(11,27*0.32,sqrt(27*0.32*0.68))
    ```

    Therefore, the current Normal approximation proposes
    $\Prob(X > 11) \approx 0.1651164$.

3. The continuity correction, that
consider interval of range 0.5 about each value, replace
$\Prob(X > 11)$, that involves the values $\{12, 13, \ldots, 27\}$, by
the event $\Prob(X > 11.5)$. The Normal approximation uses the Normal
distribution with the same mean and variance. Since $\Expec(X) =  8.64$
and $\Var(X) = 5.8752$. If we take $X \sim \mbox{Normal}(8.64,5.8752)$
and use the function “`pnorm`" we get:

    ```{r}
    1 - pnorm(11.5,27*0.32,sqrt(27*0.32*0.68))
    ```

    The Normal approximation with continuity correction proposes
    $\Prob(X > 11) \approx 0.1190149$.

4. The Poisson approximation replaces the
Binomial distribution by the Poisson distribution with the same
expectation. The expectation is
$\Expec(X) = n \cdot p = 27 \cdot 0.32 = 8.64$. If we take
$X \sim \mbox{Poisson}(8.64)$ and use the function “`ppois`" we get:

    ```{r}
    1 - ppois(11,27*0.32)
    ```

Therefore, the Poisson approximation proposes
$\Prob(X > 11) \approx 0.1651164$.

## Chapter 7 {-}
### Exercise 7.1 {-}

1. After placing the file “`pop2.csv`" in the working directory one may produce
a data frame with the content of the file and compute the average of the
variable “`bmi`" using the code:

    ```{r}
    pop.2 <- read.csv(file="_data/pop2.csv")
    mean(pop.2$bmi)
    ```

    We obtain that the population average of the variable is equal to 24.98446.

2. Applying the function “`sd`" to the
sequence of population values produces the population standard
deviation:

    ```{r}
    sd(pop.2$bmi)
    ```

    In turns out that the standard deviation of the measurement is 4.188511.

3. In order to compute the expectation
under the sampling distribution of the sample average we conduct a
simulation. The simulation produces (an approximation) of the sampling
distribution of the sample average. The sampling distribution is
represented by the content of the sequence “`X.bar`":

    ```{r}
    X.bar <- rep(0,10^5)
    for(i in 1:10^5) {
      X.samp <- sample(pop.2$bmi,150)
      X.bar[i] <- mean(X.samp)
    }
    mean(X.bar)
    ```

    Initially, we produce a vector of zeros of the given lenght (100,000).
    In each iteration of the “`for`" loop a random sample of size 150 is
    selected from the population. The sample average is computed and stored
    in the sequence “`X.bar`". At the end of all the iterations all the
    zeros are replaced by evaluations of the sample average.

    The expectation of the sampling distribution of the sample average is
    computed by the application of the function “`mean`" to the sequence
    that represents the sampling distribution of the sample average. The
    result for the current is 24.98681, which is vary similar[^3] to the
    population average 24.98446.

4. The standard deviation of the sample
average under the sampling distribution is computed using the function
“`sd`":

    ```{r}
    sd(X.bar)
    ```

    The resulting standard deviation is 0.3422717. Recall that the standard
    deviation of a single measurement is equal to 4.188511 and that the
    sample size is $n=150$. The ratio between the standard deviation of the
    measurement and the square root of 150 is
    $4.188511/\sqrt{150} =0.3419905$, which is similar in value to the
    standard deviation of the sample average[^4].

5. The central region that contains 80%
of the sampling distribution of the sample average can be identified
with the aid of the function “`quantile`":

    ```{r}
    quantile(X.bar,c(0.1,0.9))
    ```

    The value 24.54972 is the 10%-percentile of the sampling distribution.
    To the left of this value are 10% of the distribution. The value
    25.42629 is the 90%-percentile of the sampling distribution. To the
    right of this value are 10% of the distribution. Between these two
    values are 80% of the sampling distribution.

6. The Normal approximation, which is
the conclusion of the Central Limit Theorem substitutes the sampling
distribution of the sample average by the Normal distribution with the
same expectation and standard deviation. The percentiles are computed
with the function “`qnorm`":

    ```{r}
    qnorm(c(0.1,0.9),mean(X.bar),sd(X.bar))
    ```

    Observe that we used the expectation and the standard deviation of the
    sample average in the function. The resulting interval is
    $[24.54817, 25.42545]$, which is similar to the interval
    $[24.54972, 25.42629]$ which was obtained via simulations.

### Exercise 7.2 {-}

1. Denote by $X$ the distance from the
specified endpoint of a random hit. Observe that
$X \sim \mbox{Uniform}(0,10)$. The 25 hits form a sample
$X_1, X_2, \ldots, X_{25}$ from this distribution and the sample average
$\bar X$ is the average of these random locations. The expectation of
the average is equal to the expectation of a single measurement. Since
$\Expec(X) = (a + b)/2 = (0 + 10)/2 = 5$ we get that
$\Expec(\bar X) = 5$.

2. The variance of the sample average
is equal to the variance of a single measurement, divided by the sample
size. The variance of the Uniform distribution is
$\Var(X) = (a + b)^2/12 = (10-0)^2/12 = 8.333333$. The standard
deviation of the sample average is equal to the standard deviation of
the sample average is equal to the standard deviation of a single
measurement, divided by the square root of the sample size. The sample
size is $n=25$. Consequently, the standard deviation of the average is
$\sqrt{8.333333/25}=0.5773503$.

3. The left-most third of the detector
is the interval to the left of 10/3. The distribution of the sample
average, according to the Central Limit Theorem, is Normal. The
probability of being less than 10/3 for the Normal distribution may be
computed with the function “`pnorm`":

    ```{r}
    mu <- 5
    sig <- sqrt(10^2/(12*25))
    pnorm(10/3,mu,sig)
    ```

    The expectation and the standard deviation of the sample average are
    used in computation of the probability. The probability is 0.001946209,
    about 0.2%.

4. The central region in the
$\mbox{Normal}(\mu,\sigma^2)$ distribution that contains 99% of the
distribution is of the form
$\mu \pm \mbox{\texttt{qnorm(0.995)}}\cdot \sigma$, where
“`qnorm(0.995)`" is the 99.5%-percentile of the Standard Normal
distribution. Therefore, $c =\mbox{\texttt{qnorm(0.995)}}\cdot \sigma$:

    ```{r}
    qnorm(0.995)*sig
    ```

    We get that $c=1.487156$.

## Chapter 9 {-}
### Exercise 9.1 {-}

1. Let us read the data into a data
frame by the name “`magnets`" and apply the function “`summary`" to the
data frame:

    ```{r}
    magnets <- read.csv("_data/magnets.csv")
    summary(magnets)
    ```

    The variable “`change`" contains the difference between the patient’s
    rating before the application of the device and the rating after the
    application. The sample average of this variable is reported as the
    “`Mean`" for this variable and is equal to 3.5.

2. The variable “`active`" is a
factor. Observe that the summary of this variable lists the two levels
of the variable and the frequency of each level. Indeed, the levels are
coded with numbers but, nonetheless, the variable is a factor[^12].

3. Based on the hint we know that the
expressions “`change[1:29]`" and “`change[30:50]`" produce the values of
the variable “`change`" for the patients that were treated with active
magnets and by inactive placebo, respectively. We apply the function
“`mean`" to these sub-sequences:

    ```{r}
    mean(magnets$change[1:29])
    mean(magnets$change[30:50])
    ```

    The sample average for the patients that were treated with active
    magnets is 5.241379 and sample average for the patients that were
    treated with inactive placebo is 1.095238.

4. We apply the function “`sd`" to
these sub-sequences:

    ```{r}
    sd(magnets$change[1:29])
    sd(magnets$change[30:50])
    ```

    The sample standard deviation for the patients that were treated with
    active magnets is 3.236568 and sample standard deviation for the
    patients that were treated with inactive placebo is 1.578124.

5. We apply the function “`boxplot`"
to each sub-sequences:

    ```{r, out.width = '60%', fig.align = "center"}
    boxplot(magnets$change[1:29])
    boxplot(magnets$change[30:50])
    ```

    The first box-plot
    corresponds to the sub-sequence of the patients that received
    an active magnet. There are no outliers in this plot. The second box-plot
    corresponds to the sub-sequence of the patients that received
    an inactive placebo. Three values, the values “3", “4", and “5" are
    associated with outliers. Let us see what is the total number of
    observations that receive these values:

    ```{r}
    table(magnets$change[30:50])
    ```

    One may see that a single observation obtained the value “3", another
    one obtained the value “5" and 2 observations obtained the value “4", a
    total of 4 outliers[^13]. Notice that the single point that is
    associated with the value “4" actually represents 2 observations and not
    one.

### Exercise 9.2 {-}

1. Let us run the following
simulation:

    ```{r}
    mu1 <- 3.5
    sig1 <- 3
    mu2 <- 3.5
    sig2 <- 1.5
    test.stat <- rep(0,10^5)
    for(i in 1:10^5) {
      X1 <- rnorm(29,mu1,sig1)
      X2 <- rnorm(21,mu2,sig2)
      X1.bar <- mean(X1)
      X2.bar <- mean(X2)
      X1.var <- var(X1)
      X2.var <- var(X2)
      test.stat[i] <- (X1.bar-X2.bar)/sqrt(X1.var/29 + X2.var/21)
    }
    quantile(test.stat,c(0.025,0.975))
    ```

    Observe that each iteration of the simulation involves the generation of
    two samples. One sample is of size 29 and it is generated from the
    $\mathrm{Normal}(3.5,3^2)$ distribution and the other sample is of size
    21 and it is generated from the $\mathrm{Normal}(3.5,1.5^2)$
    distribution. The sample average and the sample variance are computed
    for each sample. The test statistic is computed based on these averages
    and variances and it is stored in the appropriate position of the
    sequence “`test.stat`".

    The values of the sequence “`test.stat`" at the end of all the
    iterations represent the sampling distribution of the static. The
    application of the function “`quantile`" to the sequence gives the
    0.025-percentiles and the 0.975-percentiles of the sampling
    distribution, which are -2.014838 and 2.018435. It follows that the
    interval $[-2.014838, 2.018435]$ contains about 95% of the sampling
    distribution of the statistic.

2. In order to evaluate the statistic
for the given data set we apply the same steps that were used in the
simulation for the computation of the statistic:

    ```{r}
    x1.bar <- mean(magnets$change[1:29])
    x2.bar <- mean(magnets$change[30:50])
    x1.var <- var(magnets$change[1:29])
    x2.var <- var(magnets$change[30:50])
    (x1.bar-x2.bar)/sqrt(x1.var/29 + x2.var/21)
    ```

    In the first line we compute the sample average for the first 29
    patients and in the second line we compute it for the last 21 patients.
    In the third and fourth lines we do the same for the sample variances of
    the two types of patients. Finally, in the fifth line we evaluate the
    statistic. The computed value of the statistic turns out to be 5.985601,
    a value that does not belong to the interval $[-2.014838, 2.018435]$.

## Chapter 10 {-}
### Exercise 10.1 {-}

1. We simulate the sampling
distribution of the average and the median in a sample generated from
the Normal distribution. In order to do so we copy the code that was
used in Subsection \@ref(ComparingEstimators), replacing the object
“`mid.range`" by the object “`X.med`" and using the function “`median`"
in order to compute the sample median instead of the computation of the
mid-range statistic:

    ```{r}
    mu <- 3
    sig <- sqrt(2)
    X.bar <- rep(0,10^5)
    X.med <- rep(0,10^5)
    for(i in 1:10^5) {
      X <- rnorm(100,mu,sig)
      X.bar[i] <- mean(X)
      X.med[i] <- median(X)
    }
    ```

    The sequence “`X.bar`" represents the sampling distribution of the
    sample average and the sequence “`X.med`" represents the sampling
    distribution of the sample median. We apply the function “`mean`" to
    these sequences in order to obtain the expectations of the estimators:

    ```{r}
    mean(X.bar)
    mean(X.med)
    ```

    The expectation of the measurement, the parameter of interest is equal
    to 3. Observe that expectations of the estimators are essentially equal
    to the expectation[^13]. Consequently, both estimators are unbiased
    estimators of the expectation of the measurement.

    In order to obtain the variances of the estimators we apply the function
    “`var`" to the sequences that represent their sampling distributions:

    ```{r}
    var(X.bar)
    var(X.med)
    ```

    Observe that the variance of the sample average is essentially equal to
    $0.020$ and the variance of the sample median is essentially equal to
    $0.0312$. The mean square error of an unbiased estimator is equal to its
    variance. Hence, these numbers represent the mean square errors of the
    estimators. It follows that the mean square error of the sample average
    is less than the mean square error of the sample median in the
    estimation of the expectation of a Normal measurement.

2. We repeat the same steps as before
for the Uniform distribution. Notice that we use the parameters $a=0.5$
and $b=5.5$ the same way we did in
Subsection \@ref(ComparingEstimators). These parameters produce an
expectation $\Expec(X) = 3$ and a variance $\Var(X) = 2.083333$:

    ```{r}
    a <- 0.5
    b <- 5.5
    X.bar <- rep(0,10^5)
    X.med <- rep(0,10^5)
    for(i in 1:10^5) {
      X <- runif(100,a,b)
      X.bar[i] <- mean(X)
      X.med[i] <- median(X)
    }
    ```

    Applying the function “`mean`" to the sequences that represent the
    sampling distribution of the estimators we obtain that both estimators
    are essentially unbiased[^14]:

    ```{r}
    mean(X.bar)
    mean(X.med)
    ```

    Compute the variances:

    ```{r}
    var(X.bar)
    var(X.med)
    ```

    Observe $0.021$ is, essentially, the value of the variance of the sample
    average[^15]. The variance of the sample median is essentially equal to
    0.061. The variance of each of the estimators is equal to it’s mean
    square error. This is the case since the estimators are unbiased.
    Consequently, we again obtain that the mean square error of the sample
    average is less than that of the sample median.

### Exercise 10.2 {-}

1. Assuming that the file “`ex2.csv`"
is saved in the working directory, one may read the content of the file
into a data frame and produce a summary of the content of the data frame
using the code:

    ```{r}
    ex2 <- read.csv("_data/ex2.csv")
    summary(ex2)
    ```

    Examine the variable “`group`". Observe that the sample contains 37
    subjects with high levels of blood pressure. Dividing 37 by the sample
    size we get:

    ```{r}
    37/150
    ```

    Consequently, the sample proportion is $0.2466667$.

    Alternatively, we compute the sample proportion using the code:

    ```{r}
    mean(ex2$group == "HIGH")
    ```

    Notice that the expression “`ex2$group == HIGH`" produces a sequence of
    length 150 with logical entries. The entry is equal to “`TRUE`" if the
    equality holds and it is equal to “`FALSE`" if it dos not[^17]. When the
    function “`mean`" is applied to a sequence with logical entries it
    produces the relative frequeny of the `TRUE`s in the sequence. This
    corresponds, in the corrent context, to the sample proportion of the
    level “`HIGH`" in the variable “`ex2$group`".

2. Make sure that the file
“`pop2.csv`" is saved in the working directory. In order to compute the
proportion in the population we read the content of the file into a data
frame and compute the relative frequency of the level “`HIGH`" in the
variable “`group`":

    ```{r}
    pop2 <- read.csv("_data/pop2.csv")
    mean(pop2$group == "HIGH")
    ```

    We get that the proporion in the population is $p = 0.28126$.

3. The simulation of the sampling
distribution involves a selection of a random sample of size 150 from
the population and the computation of the proportion of the level
“`HIGH`" in that sample. This procedure is iterated 100,000 times in
order to produce an approximation of the distribution:

    ```{r}
    P.hat <- rep(0,10^5)
    for(i in 1:10^5) {
      X <- sample(pop2$group,150)
      P.hat[i] <- mean(X == "HIGH")
    }
    mean(P.hat)
    ```

    Observe that the sampling distribution is stored in the object
    “`P.hat`". The function “`sample`" is used in order to sample 150
    observation from the sequence “`pop2$group`". The sample is stored in
    the object “`X`". The expression “`mean(X == HIGH)`" computes the
    relative frequency of the level “`HIGH`" in the sequence “`X`".

    At the last line, after the production of the sequence “`P.hat`" is
    completed, the function “`mean`" is applied to the sequence. The result
    is the expected value of estimator $\hat P$, which is equal to
    $0.2812307$. This expectation is essentially equal to the probability of
    the event $p = 0.28126$.[^18]

4. The application of the function
“`var`" to the sequence “`P.hat`" produces:

    ```{r}
    var(P.hat)
    ```

    Hence, the variance of the estimator is (approximately) equal to
    $0.00135$.

5. Compute the variance according to
the formula that is proposed in Section:

    ```{r}
    p <- mean(pop2$group == "HIGH")
    p*(1-p)/150
    ```

    We get that the proposed variance in Section \[sec::Estimation\_5\] is
    $0.0013476850$, which is in good agreement with the value $0.00135$ that
    was obtained in the simulation[^19].

## Chapter 11 {-}
### Exercise 11.1 {-}

1. We read the content of the file
“`teacher.csv`" into a data frame by the name “`teacher`" and produce a
summary of the content of the data frame:

    ```{r}
    teacher <- read.csv("_data/teacher.csv")
    summary(teacher)
    ```

    There are two variables: The variable “`condition`" is a factor with two
    levels, “`C`" that codes the Charismatic condition and “`P`" that codes
    the Punitive condition. The second variable is “`rating`", which is a
    numeric variable.

2. The sample average for the
variable “`rating`" can be obtained from the summary or from the
application of the function “`mean`" to the variable. The standard
deviation is obtained from the application of the function “`sd`" to the
variable:

    ```{r}
    mean(teacher$rating)
    sd(teacher$rating)
    ```

    Observe that the sample average is equal to $2.428567$ and the sample