-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.json
More file actions
867 lines (867 loc) · 243 KB
/
search.json
File metadata and controls
867 lines (867 loc) · 243 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
[
{
"objectID": "Probability Theory.html",
"href": "Probability Theory.html",
"title": "Probability Theory",
"section": "",
"text": "The foundation of statistics is probability, which is analyzing the chances that an event will or will not occur. Here is the basic formula for probability:\n\\[ p(A) = \\frac{Number\\ of\\ events\\ classifiable\\ as\\ A}{Total\\ number\\ of\\ possible\\ events} \\]\nWe can start with a basic example using a deck of 52 playing cards. What’s the probability of drawing an Ace?\n\nWhat are the number of events classifiable as Ace?\n\nA deck of playing cards has 4 aces of the four suits, hearts, diamonds, clubs, and spades. So our numerator is 4.\nThe total number of possible events or total number in a deck of cards is 52. So our denominator is 52. So we can write the equation like this: \\[ p (Ace)=\\frac{4}{52}\\]\nWe can use R to find this probability\n\n4/52\n\n[1] 0.07692308\n\n\nProbabilities are always given in proportions or numbers between 0 and 1. If a given probability is 0 it is not possible for a particular event to occur while if a given probability is 1 it is certain that a particular event will occur. For example the probability of getting a 7 on one roll of a 6 sided die is 0 because the only possible outcomes on one roll of a six sided die is between 1 and 6. Whereas when you add together the probabilities of rolling a 1 through 6 together you get 1, because that is the total number of possible outcomes for a six sided die. To demonstrate, the probability of rolling a “4” on a six-sided die is one out of six or: \\[p(4) = \\frac{1}{6}\\] Or in R\n\n1/6\n\n[1] 0.1666667\n\n\nSo if we add together all the probabilities for each side of the die we’ll get 1. \\[p(1) = \\frac{1}{6} + p(2) = \\frac{1}{6} + p(3) = \\frac{1}{6} + p(4) = \\frac{1}{6} + p(5) = \\frac{1}{6} + p(6) = \\frac{1}{6} = 1\\] Or in R\n\n1/6+1/6+1/6+1/6+1/6+1/6\n\n[1] 1\n\n\n\n\nThere are two basic rules to probability, the addition rule and the multiplication rule. The addition rule applies to the single occurrence of two or more events. For example, what is the probability of drawing an Ace or a King on single draw from a deck of cards. In this case we add the probability of drawing an Ace and a King together to find the correct probability. Recall, that the probability for drawing an Ace looks like this:\n\\[ p (Ace)=\\frac{4}{52}\\]\nThe probability of drawing a king would be the same as an Ace because there are 4 Kings of each suit in the deck of cards, so the formula looks like this:\n\\[ p (Ace\\ or\\ a\\ King)=\\frac{4}{52} + \\frac{4}{52}\\]\nSo the probability of drawing an Ace or a King would be \\(\\frac{8}{52}\\).\nUsing R we would find\n\n4/52+4/52\n\n[1] 0.1538462\n\n\nThe multiplication rule applies for more than one draw or successive events. For example, what would be the probability of drawing an Ace on the first draw and a King on the second. In that case, you would mutlitply rather than add.\n\\[ p (Ace\\ and\\ a\\ King)=\\frac{4}{52} \\times \\frac{4}{51}\\]\nNotice, how in the second fraction the denominator is 51 rather than 52 to account for the fact that after you’ve drawn the Ace there are only 51 cards left. In this case, the second card is drawn without replacement.\n\n4/52*4/51\n\n[1] 0.006033183\n\n\n\n\n\nMathematics creates the ability to go beyond probabilities for single events and look at probability distributions or a set of probability values based on a certain number of events. Statistics is based on probability distributions that assign a particular probability to an observed outcome. In psychological science, probability distributions are used to analyze the probability of a particular outcome observed in an experiment. Different probability distributions are used for various statistical tests such as the t test, analysis of variance or ANOVA, and the chi-square test.\nA good place to begin to understand probability distributions is the binomial distribution. For example, imagine you are flipping an evenly weighted coin. What’s the probability of getting heads? Let’s go back to our original formula.\n\nNumber of events classifiable as “heads” = 1\nTotal number of possible events = 2\n\nHere’s the formula:\n\\[p(heads) = \\frac{1}{2}\\]\nWe can use R to get the proportion.\n\n1/2\n\n[1] 0.5\n\n\nWhat about the probability of getting 2 heads in a row on 2 flips of the coin? To analyze the probability of successive events or outcomes you need to multiply the probability for each event. This is called the multiplication rule for probability. So the formula looks like:\n\\[p(2heads) = \\frac{1}{2}\\times\\frac{1}{2} = \\frac{1}{4}\\]\nUsing r we find:\n\n1/2*1/2\n\n[1] 0.25\n\n\nSo we could go on and figure out the probability for 3 heads and 4 heads and so on, but instead the binomial distribution can provide a distribution of different probability values based on a given number of events. The binomial distribution is a just an extension of the mathematics we would need to do by hand to find the probability for different events and luckily, as usual, R is able to do this for us.\nThe function we’ll use is called dbinom(). The main arguments for the function are:\nx The number of outcomes for the given probability currently being calculated\nsize The number of the overall size of the experiment\nprob The given probability value\nSo let’s calculate the original probability problem of two heads in two flips of a fair-sided coin. Notice that the answer is the same to the formula we used earlier.\n\ndbinom(x=2, size = 2, prob = 1/2)\n\n[1] 0.25\n\n\nProbabilities can be listed as fractions or proportions\n\ndbinom(x=2, size = 2, prob = 0.5)\n\n[1] 0.25\n\n\nWe can also use it for other probability values such as rolling a six sided die, which has the probability value of \\(\\frac{1}{6}\\)\n\ndbinom(x=1, size = 1, prob = 1/6)\n\n[1] 0.1666667\n\n\nOf course, it’s most helpful for probabilities for larger numbers of events. Like what’s the probability of rolling 4 sixes over 20 trials.\n\ndbinom(x =4, size = 20, prob = 1/6)\n\n[1] 0.2022036\n\n\nWe can also represent this as distribution graph.\n\nsuccess <- 0:20\nplot(success, dbinom(success, size=20, prob=1/6),type='h',\n col = \"blue\")\n\n\n\n\n\n\n\n\nEach of the lines represents the probability for a given outcome. Notice how the outcome for rolling 4 sixes in 20 trials is around 0.20, which was found in the original calculation using the dbinom() function. The highest probability is to roll a “6” three times out of 20 rolls of the dice. Notice how the graph of the binomial distribution shows us the same answer as using the dbinom() function.\n\ndbinom(x=3, size = 20, prob = 1/6)\n\n[1] 0.2378866\n\n\nAfter around 10 trials or rolls of the dice, the probability does a steep decline and stays very low, which would make sense. We can see the actual number by using R.\n\ndbinom(x = 10, size = 20, prob = 1/6)\n\n[1] 0.0004934846\n\n\nRolling a “6” ten times out of 20 would not be a very probable outcome. You are much more likely to roll one of the other numbers (1 to 5) rather than 6 so many times.\n\n\n\nMost football games and soccer matches start with a coin flip to see who gets the ball first. A coin flip is a great way to think about probabilities. Let’s say it’s the first game of the season and you are the captain of the soccer team. You are asked to call the coin flip in the air and you call heads. What is the probability of your team starting with the ball?\n\ndbinom(x = 1, size = 1, prob = 0.5)\n\n[1] 0.5\n\n\nThis probability was found earlier and it’s simply the original formula from a simple coin flip.\n\\[\np(heads)=\\frac{1}{2}\n\\]\nLet’s imagine that you are the captain of the soccer team and you decide to go with heads every time you call the coin flip. If you got heads, 2, 3, or 4 times in a row, you would probably think you were pretty lucky, but what if you got it 25 times in a row? It would probably be on the nightly news and other teams would start to assume you were cheating. Intuitively, random outcomes seem normal and when someone gets lucky at something, persons begin to take notice. So if the captain of a soccer team picked heads 25 times in a row and got it right every time, that would be a very low probability. Let’s take a look:\n\ndbinom(x = 25, size = 25, prob = 0.5)\n\n[1] 2.980232e-08\n\n\nThat would be a very, very low probability. It’s also easy to test. Take a quarter and flip it 25 times. How many times did you get heads? Someone may get it now and again, but not often.\n\n\n\nThe binomial distribution is helpful here. Let’s graph the distribution of getting heads over 25 trials.\n\nsuccess <- 0:25\nplot(success, dbinom(success, size=25, prob=.5),type='h', \n col = \"blue\")\n\n\n\n\n\n\n\n\nDoes this graph remind you of anything? It should because it takes on the shape of a normal curve. Notice that the lowest probabilities are for getting 4 or less heads and 21 or more heads. The highest probabilities are somewhere in the middle, half heads and half tails, which makes sense because the overall probability is 0.5 or \\(\\frac{1}{2}\\). If we increase the number of trials the shape of the distribution is more pronounced and it looks more like a normal curve.\n\nsuccess <- 0:100\nplot(success, dbinom(success, size=100, prob=.5),type='h', col = \"blue\")\n\n\n\n\n\n\n\n\nSo statistics uses distributions like the binomial distribution to look at the probability of different outcomes, similar to flipping a coin. Flipping a coin and getting heads about half the time is more probable than flipping a coin and getting a larger number of heads or a very small number of heads.",
"crumbs": [
"Intro to Statistics",
"Probability Theory"
]
},
{
"objectID": "Probability Theory.html#introduction-to-probability",
"href": "Probability Theory.html#introduction-to-probability",
"title": "Probability Theory",
"section": "",
"text": "The foundation of statistics is probability, which is analyzing the chances that an event will or will not occur. Here is the basic formula for probability:\n\\[ p(A) = \\frac{Number\\ of\\ events\\ classifiable\\ as\\ A}{Total\\ number\\ of\\ possible\\ events} \\]\nWe can start with a basic example using a deck of 52 playing cards. What’s the probability of drawing an Ace?\n\nWhat are the number of events classifiable as Ace?\n\nA deck of playing cards has 4 aces of the four suits, hearts, diamonds, clubs, and spades. So our numerator is 4.\nThe total number of possible events or total number in a deck of cards is 52. So our denominator is 52. So we can write the equation like this: \\[ p (Ace)=\\frac{4}{52}\\]\nWe can use R to find this probability\n\n4/52\n\n[1] 0.07692308\n\n\nProbabilities are always given in proportions or numbers between 0 and 1. If a given probability is 0 it is not possible for a particular event to occur while if a given probability is 1 it is certain that a particular event will occur. For example the probability of getting a 7 on one roll of a 6 sided die is 0 because the only possible outcomes on one roll of a six sided die is between 1 and 6. Whereas when you add together the probabilities of rolling a 1 through 6 together you get 1, because that is the total number of possible outcomes for a six sided die. To demonstrate, the probability of rolling a “4” on a six-sided die is one out of six or: \\[p(4) = \\frac{1}{6}\\] Or in R\n\n1/6\n\n[1] 0.1666667\n\n\nSo if we add together all the probabilities for each side of the die we’ll get 1. \\[p(1) = \\frac{1}{6} + p(2) = \\frac{1}{6} + p(3) = \\frac{1}{6} + p(4) = \\frac{1}{6} + p(5) = \\frac{1}{6} + p(6) = \\frac{1}{6} = 1\\] Or in R\n\n1/6+1/6+1/6+1/6+1/6+1/6\n\n[1] 1\n\n\n\n\nThere are two basic rules to probability, the addition rule and the multiplication rule. The addition rule applies to the single occurrence of two or more events. For example, what is the probability of drawing an Ace or a King on single draw from a deck of cards. In this case we add the probability of drawing an Ace and a King together to find the correct probability. Recall, that the probability for drawing an Ace looks like this:\n\\[ p (Ace)=\\frac{4}{52}\\]\nThe probability of drawing a king would be the same as an Ace because there are 4 Kings of each suit in the deck of cards, so the formula looks like this:\n\\[ p (Ace\\ or\\ a\\ King)=\\frac{4}{52} + \\frac{4}{52}\\]\nSo the probability of drawing an Ace or a King would be \\(\\frac{8}{52}\\).\nUsing R we would find\n\n4/52+4/52\n\n[1] 0.1538462\n\n\nThe multiplication rule applies for more than one draw or successive events. For example, what would be the probability of drawing an Ace on the first draw and a King on the second. In that case, you would mutlitply rather than add.\n\\[ p (Ace\\ and\\ a\\ King)=\\frac{4}{52} \\times \\frac{4}{51}\\]\nNotice, how in the second fraction the denominator is 51 rather than 52 to account for the fact that after you’ve drawn the Ace there are only 51 cards left. In this case, the second card is drawn without replacement.\n\n4/52*4/51\n\n[1] 0.006033183\n\n\n\n\n\nMathematics creates the ability to go beyond probabilities for single events and look at probability distributions or a set of probability values based on a certain number of events. Statistics is based on probability distributions that assign a particular probability to an observed outcome. In psychological science, probability distributions are used to analyze the probability of a particular outcome observed in an experiment. Different probability distributions are used for various statistical tests such as the t test, analysis of variance or ANOVA, and the chi-square test.\nA good place to begin to understand probability distributions is the binomial distribution. For example, imagine you are flipping an evenly weighted coin. What’s the probability of getting heads? Let’s go back to our original formula.\n\nNumber of events classifiable as “heads” = 1\nTotal number of possible events = 2\n\nHere’s the formula:\n\\[p(heads) = \\frac{1}{2}\\]\nWe can use R to get the proportion.\n\n1/2\n\n[1] 0.5\n\n\nWhat about the probability of getting 2 heads in a row on 2 flips of the coin? To analyze the probability of successive events or outcomes you need to multiply the probability for each event. This is called the multiplication rule for probability. So the formula looks like:\n\\[p(2heads) = \\frac{1}{2}\\times\\frac{1}{2} = \\frac{1}{4}\\]\nUsing r we find:\n\n1/2*1/2\n\n[1] 0.25\n\n\nSo we could go on and figure out the probability for 3 heads and 4 heads and so on, but instead the binomial distribution can provide a distribution of different probability values based on a given number of events. The binomial distribution is a just an extension of the mathematics we would need to do by hand to find the probability for different events and luckily, as usual, R is able to do this for us.\nThe function we’ll use is called dbinom(). The main arguments for the function are:\nx The number of outcomes for the given probability currently being calculated\nsize The number of the overall size of the experiment\nprob The given probability value\nSo let’s calculate the original probability problem of two heads in two flips of a fair-sided coin. Notice that the answer is the same to the formula we used earlier.\n\ndbinom(x=2, size = 2, prob = 1/2)\n\n[1] 0.25\n\n\nProbabilities can be listed as fractions or proportions\n\ndbinom(x=2, size = 2, prob = 0.5)\n\n[1] 0.25\n\n\nWe can also use it for other probability values such as rolling a six sided die, which has the probability value of \\(\\frac{1}{6}\\)\n\ndbinom(x=1, size = 1, prob = 1/6)\n\n[1] 0.1666667\n\n\nOf course, it’s most helpful for probabilities for larger numbers of events. Like what’s the probability of rolling 4 sixes over 20 trials.\n\ndbinom(x =4, size = 20, prob = 1/6)\n\n[1] 0.2022036\n\n\nWe can also represent this as distribution graph.\n\nsuccess <- 0:20\nplot(success, dbinom(success, size=20, prob=1/6),type='h',\n col = \"blue\")\n\n\n\n\n\n\n\n\nEach of the lines represents the probability for a given outcome. Notice how the outcome for rolling 4 sixes in 20 trials is around 0.20, which was found in the original calculation using the dbinom() function. The highest probability is to roll a “6” three times out of 20 rolls of the dice. Notice how the graph of the binomial distribution shows us the same answer as using the dbinom() function.\n\ndbinom(x=3, size = 20, prob = 1/6)\n\n[1] 0.2378866\n\n\nAfter around 10 trials or rolls of the dice, the probability does a steep decline and stays very low, which would make sense. We can see the actual number by using R.\n\ndbinom(x = 10, size = 20, prob = 1/6)\n\n[1] 0.0004934846\n\n\nRolling a “6” ten times out of 20 would not be a very probable outcome. You are much more likely to roll one of the other numbers (1 to 5) rather than 6 so many times.\n\n\n\nMost football games and soccer matches start with a coin flip to see who gets the ball first. A coin flip is a great way to think about probabilities. Let’s say it’s the first game of the season and you are the captain of the soccer team. You are asked to call the coin flip in the air and you call heads. What is the probability of your team starting with the ball?\n\ndbinom(x = 1, size = 1, prob = 0.5)\n\n[1] 0.5\n\n\nThis probability was found earlier and it’s simply the original formula from a simple coin flip.\n\\[\np(heads)=\\frac{1}{2}\n\\]\nLet’s imagine that you are the captain of the soccer team and you decide to go with heads every time you call the coin flip. If you got heads, 2, 3, or 4 times in a row, you would probably think you were pretty lucky, but what if you got it 25 times in a row? It would probably be on the nightly news and other teams would start to assume you were cheating. Intuitively, random outcomes seem normal and when someone gets lucky at something, persons begin to take notice. So if the captain of a soccer team picked heads 25 times in a row and got it right every time, that would be a very low probability. Let’s take a look:\n\ndbinom(x = 25, size = 25, prob = 0.5)\n\n[1] 2.980232e-08\n\n\nThat would be a very, very low probability. It’s also easy to test. Take a quarter and flip it 25 times. How many times did you get heads? Someone may get it now and again, but not often.\n\n\n\nThe binomial distribution is helpful here. Let’s graph the distribution of getting heads over 25 trials.\n\nsuccess <- 0:25\nplot(success, dbinom(success, size=25, prob=.5),type='h', \n col = \"blue\")\n\n\n\n\n\n\n\n\nDoes this graph remind you of anything? It should because it takes on the shape of a normal curve. Notice that the lowest probabilities are for getting 4 or less heads and 21 or more heads. The highest probabilities are somewhere in the middle, half heads and half tails, which makes sense because the overall probability is 0.5 or \\(\\frac{1}{2}\\). If we increase the number of trials the shape of the distribution is more pronounced and it looks more like a normal curve.\n\nsuccess <- 0:100\nplot(success, dbinom(success, size=100, prob=.5),type='h', col = \"blue\")\n\n\n\n\n\n\n\n\nSo statistics uses distributions like the binomial distribution to look at the probability of different outcomes, similar to flipping a coin. Flipping a coin and getting heads about half the time is more probable than flipping a coin and getting a larger number of heads or a very small number of heads.",
"crumbs": [
"Intro to Statistics",
"Probability Theory"
]
},
{
"objectID": "Pres Trial.html#slide-1",
"href": "Pres Trial.html#slide-1",
"title": "Presentation Trial",
"section": "Slide 1",
"text": "Slide 1\n\nTurn off alarm\nGet out of bed\nEat spaghetti"
},
{
"objectID": "Pres Trial.html#slide-2",
"href": "Pres Trial.html#slide-2",
"title": "Presentation Trial",
"section": "Slide 2",
"text": "Slide 2\n\nGet in bed\nCount sheep"
},
{
"objectID": "Why R.html",
"href": "Why R.html",
"title": "Why R?",
"section": "",
"text": "Why R & R Studio?\nMany persons are curious about the shift to using R rather than SPSS or one of the other basic computer programs. A great overview is offered here:\nSPSS is dying\n\n\nReasons to Move to R Studio\nHere are the reasons I think it’s important to move to R Studio.\n\n\n1. It’s free.\nYou can put it on your home computer for free. You can use it on Windows and Macs and even inexpensive educational computers like Raspberry pi\nRaspberry pi\nIf you learn SPSS, most likely your company won’t have it and your educational organization may or may not have it. If you learn R and R Studio you can take it into any work setting and use it. This is especially helpful for nonprofits who often can’t afford statistical software.\n\n\n2. There’s this new thing called a ‘com-pu-tor’\nComputers are ubiquitous, you probably carried one with you into class in your pocket. A contemporary iPhone is 120 million times more powerful than the computers that sent humans to the moon. However, not everyone is as familiar with the various types of code that makes modern computers, iPhones, apps, Instagram, etc., run. Learning R gives you a very rudimentary understanding of how to code. More importantly, it teaches you how to manipulate code. No one ever writes code on their own from scratch, but parts are taken and then transformed to work for your specific needs. The introduction to r and code in this class allows you to do basic statistics, make professional looking graphs, and even make websites (This website was created using R Studio and Quarto).\n\n\n3. Simplifying replication\nA big story in the last couple of years is the problem with replication in the social sciences, some calling it a replication crisis. Replication is also requiring access to the code you used, not just the output shared in the research article. Mistakes have been found with scientific research findings because of the computations performed and how the data was processed, which haven’t been a normal part of research reports in the past. R Studio makes this easier because as you learn to code, you can also share the code you used to give a complete picture of the data analysis process from start to finish. Many journals are beginning to require that authors include their data as well as their code in the submission of research manuscripts.\n\n\n4. Collaboration\nWebsites like GitHub are making it more and more easier to collaborate on various projects together. Github enables users to monitor how data is manipulated and changed over time. R Studio helps to facilitate this collaboration. More and more websites and companies are making their data free and open source (just like R is) so that persons can learn from each other or look at the data for themselves and check the sources. For example, here’s a site that actively pulls data from The NY Times data repository on Covid-19 to be used in R.\n\n\n5. Self-Teaching\nMost of the resources you need to learn R and R Studio are free. I taught myself R Studio through using free open source books, websites, and YouTube videos. Any time I had a question, I just googled it! I could usually find the answer with a little work. R also includes different packages that allow you to do different types of tasks (we use tidyverse for our class and ggplot) These are free as well and have tons of online help for free training. If you start with a little bit of R Studio, you can literally go anywhere in the world of data science and analysis.\n\n\n6. Social Justice\nR Studio has committed to donate time and resources to various social justice groups such as Black Lives Matter and the ACLU\nAs a company, R Studio is committed to providing open-source free software for data science (they also have an enterprise wing that provides data services for a fee). They are also a Designated Public Benefit Corporation, which means that their open source mission is written into their charter and stakeholders must uphold that mission in their decision-making\nRStudio was recently renamed Posit and you can see their annual report here",
"crumbs": [
"R Basics",
"Why R?"
]
},
{
"objectID": "Intro to R.html",
"href": "Intro to R.html",
"title": "Intro to R & R Studio",
"section": "",
"text": "R is the base program for R Studio, it does all the calculations, while R studio is the addition of several windows around R that helps with your analysis\nIn R Studio, “R” performs calculations in the console. The console is in the lower left window. You can think of the console as a big calculator. You can do all the basics with R.\nThe Editor is where you write the code and notes to yourself in scripts. The console is where you code gets entered and run and the output is in the lower right hand corner. The upper left hand window is the environment and keeps track of the data and other things used in your analysis.\nHere is a picture of what the program looks like and the different sections.",
"crumbs": [
"R Basics",
"Intro to R & R Studio"
]
},
{
"objectID": "Intro to R.html#vectors",
"href": "Intro to R.html#vectors",
"title": "Intro to R & R Studio",
"section": "Vectors",
"text": "Vectors\nBesides individual numbers your can also create vectors or arrays of numbers, which are a set of number that are saved in an object\nHere is a basic vector of numbers\n\nVector <- c(23, 26, 45, 22, 43, 91, 82, 12, 57, 2)\n\n\nSpecial Vectors\nYou can create a sequence of numbers without needing to write them all down. Just list the first and last number of the sequence in your code\n\nSequence <- seq(1,10)\n\nYou can also repeat numbers\n\nRepeat <- rep(10, times = 10)\n\nAnd of course you can always store it in an object\n\nRepeat <- rep(10, times = 25)\nRepeat\n\n [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10\n\n\nThen you can use that to make a new database\nFirst we’ll add a sequence of numbers to match our repeat variable\n\nSequence <- seq(1,25)\nSequence\n\n [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25\n\n\nThen create a dataset using the data.frame command and include the two objects that were just recently created.\n\nDataset <- data.frame(Repeat, Sequence)\nDataset\n\n Repeat Sequence\n1 10 1\n2 10 2\n3 10 3\n4 10 4\n5 10 5\n6 10 6\n7 10 7\n8 10 8\n9 10 9\n10 10 10\n11 10 11\n12 10 12\n13 10 13\n14 10 14\n15 10 15\n16 10 16\n17 10 17\n18 10 18\n19 10 19\n20 10 20\n21 10 21\n22 10 22\n23 10 23\n24 10 24\n25 10 25",
"crumbs": [
"R Basics",
"Intro to R & R Studio"
]
},
{
"objectID": "ANCOVA.html",
"href": "ANCOVA.html",
"title": "ANCOVA",
"section": "",
"text": "library(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.2 ✔ tibble 3.2.1\n✔ lubridate 1.9.4 ✔ tidyr 1.3.1\n✔ purrr 1.0.4 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(car)\n\nLoading required package: carData\n\nAttaching package: 'car'\n\nThe following object is masked from 'package:dplyr':\n\n recode\n\nThe following object is masked from 'package:purrr':\n\n some",
"crumbs": [
"Statistical Tests",
"ANCOVA"
]
},
{
"objectID": "ANCOVA.html#graphing-the-results",
"href": "ANCOVA.html#graphing-the-results",
"title": "ANCOVA",
"section": "Graphing the results",
"text": "Graphing the results\nFor graphing the results, a bar graph is the best way to look at the means. We can use the same basic formula we used for the one way anova to find our descriptive statistics.\n\nANCOVA_Descriptives <- Puppy_love |>\n group_by(Dose) |>\n summarize(n = n(),\n mean = mean(Happiness),\n sd = sd(Happiness),\n se = sd / sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nAnd then create the graph\n\nggplot(ANCOVA_Descriptives, \n aes(x = Dose, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci))\n\n\n\n\n\n\n\n\nOf course, it’s better to improve the graph a bit.\n\nggplot(ANCOVA_Descriptives, \n aes(x = Dose,\n y = mean)) +\n theme_minimal() +\n geom_bar(stat = \"identity\", fill=\"steelblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.3, size=1) +\n labs(title = \"Does Puppy Therapy Effect Happiness?\", \n y=\"Mean Level of Happiness\", x=\"Time Spent in Puppy Therapy\") \n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.",
"crumbs": [
"Statistical Tests",
"ANCOVA"
]
},
{
"objectID": "Chi Square.html",
"href": "Chi Square.html",
"title": "Chi Square",
"section": "",
"text": "Introduction to Chi-Square\nNot all data that is collected meets the standards for being parametric or based on populations. Some data that is collected is based on frequencies. Polling data is often of this type. If a person is interested in who might win the next election, they may run a poll like many did in the most recent presidential election between Biden and Trump.\nWhat makes data based on frequencies different is that the mean cannot be used for analysis. Frequency data is often based on categories or nominal data, thus it doesn’t make sense to compare these variables based on the types of statistical analyses we’ve done so far. Instead, we’ll use what is known as a Chi-Square test. \\[\nChi\\;Square\\;Test = \\chi^2\n\\] Let’s start with a very simple example. Perhaps, you are trying to decide whether a certain proposition will pass in the county and you want to decide if there is a preference. You collect data from 200 participants about whether they are in favor of the proposition or against the proposition on a simple yes vs. no question.\nIf there is no preference in the county, what would you expect the outcome of your survey to be? Well, if about the same amount of people were for the proposition as were against it, the outcome would be 50% yes and 50% no or half the participants would be against it, half for it. This is called the expected frequency or the expected frequency if the null hypothesis is correct, which in this case would be no particular preference in the sample for the proposition. Here’s the formula. \\[\nf_e = \\frac {total\\;in\\;survey}{number\\;of\\;categories}= \\frac {200}{2}\n\\] So in this case the expected frequency would be 100. So if there was no preference for a particular county proposition we would expect 100 persons to be for it and 100 persons to be against it. So Let’s say here is what the actual data looked like. 150 were for it and 50 were against it. This is called the observed frequency. \\[\nobserved\\;frequency = f_o\n\\] The chi square test is a combination of these two numbers. Basically the larger the difference between the observed and expected frequencies, the more likely there is a real preference either for or against the proposition. Here is the formula. \\[\nX^2 = \\Sigma \\frac{(f_o-f_e)^2}{f_e}\n\\] So the chi square test is the sum of the observed minus expected frequencies squared divided by the expected frequencies for each cell. In this case we have 2 cells one for those who answered yes in favor of the propsosition (150) and those who answered no against the proposition (50). So the computation would look like this: \\[\n\\frac{(150-100)^2}{100}+\\frac{(50-100)^2}{100}\n\\]\nSo in this case our answer turns out to be a Chi Square value of 50, which definitely reaches statistical significance (p < .001). Thus there does seem to be a preference in favor for the proposition in the county.\nR allows for the same types of statistical analysis without having to calculate the entire formula.\nThe chi square test just calculated is called a single variable chi square because we are just looking at a single variable. It can also be calculated using R.\nFirst we need to create a contingency table, which is simply a table of contingent frequencies based on the data we’ve collected. Here I’ll use a tibble to create the dataset.\n\nProposition_Data <- tribble(\n ~Yes, ~No,\n 150, 50)\n\nThen we just simply run the chi square test. Note the code correct = FALSE. We add this because we don’t want to use the continuity correction, which is necessary when expected frequencies are below 5 (Remember in this instance our expected frequency would be 100).\n\nchisq.test(Proposition_Data, correct = FALSE)\n\n\n Chi-squared test for given probabilities\n\ndata: Proposition_Data\nX-squared = 50, df = 1, p-value = 1.537e-12\n\n\n\n\nTwo Variable Chi-Square\nAnother way that the chi square test is used is the testing of relationships between variables. Thus, are the variables related to each other or are they independent of each other?\nFor example, Look at dataset ch19ds2.\n\nhead(ch19ds2)\n\n Sex Vote\n1 Male Yes\n2 Male Yes\n3 Male Yes\n4 Male Yes\n5 Male Yes\n6 Male Yes\n\n\nThere are two variables, gender (labeled here sex) and Vote, which was whether they voted yes or no on a recent ballot measure. First let’s look at a contingency table to get an overview of the data.\n\nVote_Gender_table <- table(ch19ds2)\nVote_Gender_table\n\n Vote\nSex No Yes\n Female 31 32\n Male 20 37\n\n\nSo we want to see whether there was a relationship between gender and how persons voted on a particular measure. Here again, the Null hypothesis would assume that these variables are independent of each other. The frequency of No and Yes would be roughly proportional for both males and females. The alternative hypothesis assumes these frequencies are different based on whether you are a male or female.\nTo run a chi-square test and construct a graph we need to create a different way of representing the data called a matrix. A matrix is a display of different variables using dimensions. A matrix counts how many cases there are for various categories.\nFor example, based on our Chi_table above there are 31 females who voted No, 20 males who votes no, 32 females who voted yes and 37 males who voted yes. In order to run the Chi Square test, the data must be entered as a matrix.\nHere is how we create a matrix\n\n# Create a vector with frequency values\n# Write the values by filling in the rows of the first\n# column, then the second, and so on. \nMatrix1 <- c(31, 20, 32, 37)\n\n#Change the vector into a matrix by assigning the dimensions. \n#In this case we want a 2 x 2 matrix, so 2 rows and 2 columns\ndim(Matrix1) <- c(2, 2)\n\n# change column names\ncolnames(Matrix1) <- c(\"No\",\"Yes\")\n# change row names\nrownames(Matrix1) <- c(\"Female\",\"Male\")\n\n#Check out the finished product\nMatrix1\n\n No Yes\nFemale 31 32\nMale 20 37\n\n\nLook a little bit closer at the code.\n\nStart with creating an object as a vector of numbers, in this case we called it Matrix1. The numbers will fill in starting at row 1 column 1 and fill in column 1 followed by column 2.\nNext, turn the vector into a matrix using the dim() argument. Use the “cbind” argument c(), to specify how many rows and columns you need. In this case we want 2 rows and 2 columns. The first number is the number of rows and the second number is the number of columns.\nFinally, use the colnames and rownames arguments to name what you want your rows and columns to be.\nThen you’re ready for the Chi Square test! \\(X^2\\)\n\nPerforming the test is fairly straight forward. The matrix is the dataset and the continuity correction is unnecessary because our expected frequencies for each cell is above 5. So we add the code, correct = FALSE, after the name of the matrix.\n\nchisq.test(Matrix1, correct = FALSE)\n\n\n Pearson's Chi-squared test\n\ndata: Matrix1\nX-squared = 2.441, df = 1, p-value = 0.1182\n\n\nThe chi square value is very low and the p value is well above .05, so there is not a relationship between these two variables.\n\n\nEffect size for chi square\nFor an effect size, we use the odds ratio, which looks at the odds of the outcome we obtained. Obviously the higher the odds ratio the stronger the relationship. To find the odds ratio, first let’s look at the ratio for yes vs. no based on gender. For no it was 31 females to 20 males.\n\n31/20\n\n[1] 1.55\n\n\nSo this was 1.55 and we’ll also look at yes.\n\n32/37\n\n[1] 0.8648649\n\n\nSo the odds of their being a difference between males and females on yes vs. no. would be dividing these two numbers.\n\n(31/20)/(32/37)\n\n[1] 1.792188\n\n\nSo females were 1.79 times more likely to answer no then yes on the survey, which isn’t very large and since the overall test was not significant the effect size is not evaluated.\nNext, a bar graph helps to show the differences in frequencies and the direction of the differences. In this case, the base R package can be used to create a simple bar graph based on the frequencies or counts in the cells using the barplot argument.\n\nbarplot(Matrix1, beside = TRUE,\n col = c(\"red\", \"green\"), legend.text = TRUE, \n xlab = \"Voting based on Gender\", \n ylab = \"Frequency of Yes and No Votes\", \n main = \"Gender and Voting Frequency on Recent Bill\")\n\n\n\n\n\n\n\n\nFinally, here is how you report the results:\nThere was a not a significant association between gender and their voting preference (yes vs. no), \\(\\chi^2(1) = 2.441; p = 0.12\\). Based on the odds ratio, females were 1.8 times more likely to vote no than males.",
"crumbs": [
"Statistical Tests",
"Chi Square"
]
},
{
"objectID": "Two-Way ANOVA.html",
"href": "Two-Way ANOVA.html",
"title": "Two-Way ANOVA",
"section": "",
"text": "Before starting this section please review the section One-Way ANOVA first\n\n\nOne-way ANOVA is based on the general idea that the total variability \\(SS_T\\) is partitioned (divided or separated) into two types of variability. The variability between the groups \\(SS_{between}\\) and the variability within the groups \\(SS_{within}\\). Remember that the variability between the groups needed to be greater than the variability within the groups because that would indicate that the difference between the groups was greater than the measurement error that existed within the groups.\n\n\nTwo-way ANOVA goes beyond one-way ANOVA by analyzing the effects of two independent variables in the same experiment. In a two-way ANOVA the two independent variables are called main effects or factors and each main effect has its own individual hypothesis.\n\n\n\nFinally, the two-way ANOVA analyzes if there is an interaction effect between the two main effects or factors. An interaction effect occurs when the effect of one of the independent variables is not the same at all levels of the second independent variable. So the set up for a two-way ANOVA is:\n\nIndependent variable 1 = Main effect 1 = Factor 1\nIndependent variable 2 = Main effect 2 = Factor 2\nInteraction effect = Interaction between Factors 1 & 2\n\n\n\n\nThe example dataset tests what’s called the “beer googles effect”. Sometimes alcohol can have an effect on perceptions of attraction for potential dates, especially later in the evening at bars. The dataset tests whether perceptions of attractiveness change after drinking alcohol and whether males and females are effected by this phenomenon differently.\n\n\nTwo Independent Variables (Main Effects or Factors)\n\nMain Effect 1 = Alcohol 3 Levels (None, 2 Pints, 4 Pints)\nMain Effect 2 = Gender 2 Levels (Male and Female)\nInteraction Effect = Gender x Alcohol\nDV = Attractiveness of the partner selected at the end of the evening\n\nAlternative Hypotheses\n\n\\(H_1\\) Alcohol has an effect on the attractiveness level of the selected partner\n\\(H_2\\) Gender has an effect on the attractiveness level of the partner.\n\\(H_3\\) There is an interaction effect between Alcohol and Gender\n\nSum of Squares variation estimates\n\n\n\n\n\nGet the dataset and import it\n\nlibrary(haven)\ngoggles <- read_sav(\"goggles.sav\")\n\nCheck out Gender variable\n\ngoggles$Gender\n\n<labelled<double>[48]>: Gender\n [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n[39] 0 0 0 0 0 0 0 0 0 0\n\nLabels:\n value label\n 0 Male\n 1 Female\n\n\nCheck out Alcohol variable\n\ngoggles$Alcohol\n\n<labelled<double>[48]>: Alcohol Consumption\n [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2\n[39] 2 2 3 3 3 3 3 3 3 3\n\nLabels:\n value label\n 1 None\n 2 2 Pints\n 3 4 Pints\n\n\nCheck out Attractiveness variable\n\ngoggles$Attractiveness\n\n [1] 65 70 60 60 60 55 60 55 70 65 60 70 65 60 60 50 55 65 70 55 55 60 50 50 50\n[26] 55 80 65 70 75 75 65 45 60 85 65 70 70 80 60 30 30 30 55 35 20 45 40\nattr(,\"label\")\n[1] \"Attractiveness of Date\"\nattr(,\"format.spss\")\n[1] \"F8.0\"\nattr(,\"display_width\")\n[1] 13\n\n\nAlcohol and Gender variables are factors. When they get imported from SPSS they don’t function as well because the focus is on the numbers, not the labels or words.We can use tidyverse and the mutate function to fix this.\n\ngoggles <- goggles %>% \n mutate(Gender = factor(Gender, levels = c(0,1), \n labels = c(\"Male\", \"Female\")))\n\nWe can do the same thing with the Alcohol variable. Make sure you know the levels or numbering of the variable\n\ngoggles$Alcohol\n\n<labelled<double>[48]>: Alcohol Consumption\n [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2\n[39] 2 2 3 3 3 3 3 3 3 3\n\nLabels:\n value label\n 1 None\n 2 2 Pints\n 3 4 Pints\n\n\nThen go ahead and mutate the variable as you did with Gender\n\ngoggles <- goggles %>% \n mutate(Alcohol = factor(Alcohol, levels = c(1,2,3), \n labels = c(\"None\", \"2 Pints\", \"4 Pints\")))\n\n\n\nNow we can move to the Two-Way ANOVA analysis. “m3” is the new object to save results so the formula will have this structure.\n\n# m3 <- aov (this is the computation you are using, like t.test) Then the rest of your formula should look like this (Dependent variable ~ Varible 1 + Variable 2 + Variable 1*Variable 2, data = [your dataset])\n\n\nm3 <- aov(Attractiveness ~ Gender + Alcohol + \n Gender*Alcohol, data = goggles)\n\nCheck the results\n\nsummary(m3)\n\n Df Sum Sq Mean Sq F value Pr(>F) \nGender 1 169 168.8 2.032 0.161 \nAlcohol 2 3332 1666.1 20.065 7.65e-07 ***\nGender:Alcohol 2 1978 989.1 11.911 7.99e-05 ***\nResiduals 42 3487 83.0 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nKey Findings from our output\nMain effect of Alcohol on attractiveness\nInteraction effect of Alcohol and Gender\nWhat does this mean?\n\n\n\n\nLet’s use a graph to understand this better\nFirst let’s look at each variable individually\n\nGogglesAlcohol <- goggles %>%\n group_by(Alcohol) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nGraph the Alcohol variable individually\n\nggplot(GogglesAlcohol, aes(x = Alcohol,\n y = mean)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1)\n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n\n\n`geom_line()`: Each group consists of only one observation.\nℹ Do you need to adjust the group aesthetic?\n\n\n\n\n\n\n\n\n\nGraph the Gender variable individually\n\nGogglesGender <- goggles %>%\n group_by(Gender) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nGraph it\n\nggplot(GogglesGender, aes(x = Gender,\n y = mean)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1) +\n ylim(0,70)\n\n`geom_line()`: Each group consists of only one observation.\nℹ Do you need to adjust the group aesthetic?\n\n\n\n\n\n\n\n\n\nGraph the Interaction Effect\nFinally, we can graph relationships for both variables.\nFirst find your descriptive statistics, but this time based on two independent variables\n\nGogglesDescriptives <- goggles %>%\n group_by(Alcohol, Gender) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\n`summarise()` has grouped output by 'Alcohol'. You can override using the\n`.groups` argument.\n\n\nCheck it\n\nGogglesDescriptives\n\n# A tibble: 6 × 7\n# Groups: Alcohol [3]\n Alcohol Gender n mean sd se ci\n <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>\n1 None Male 8 66.9 10.3 3.65 8.64\n2 None Female 8 60.6 4.96 1.75 4.14\n3 2 Pints Male 8 66.9 12.5 4.43 10.5 \n4 2 Pints Female 8 62.5 6.55 2.31 5.47\n5 4 Pints Male 8 35.6 10.8 3.83 9.06\n6 4 Pints Female 8 57.5 7.07 2.5 5.91\n\n\nUse a line graph to graph the relationship\n\nggplot(GogglesDescriptives, aes(x = Alcohol,\n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1)\n\n\n\n\n\n\n\n\nUse dodge functions to make graph clearer\n\npd <- position_dodge(0.2)\nggplot(GogglesDescriptives, \n aes(x = Alcohol, \n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(position = pd, \n size = 3) +\n geom_line(position = pd,\n size = 1) +\n geom_errorbar(aes(ymin = mean - ci, \n ymax = mean + ci), \n width = .1, \n position= pd)\n\n\n\n\n\n\n\n\nAll the bells and whistles\n\npd <- position_dodge(0.2)\nggplot(GogglesDescriptives, \n aes(x = Alcohol, \n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(position=pd, \n size = 3) +\n geom_line(position = pd, \n size = 1) +\n geom_errorbar(aes(ymin = mean - ci, \n ymax = mean + ci), \n width = .1, \n position = pd, \n size = 1) +\n scale_color_brewer(palette=\"Set1\") +\n theme_minimal() +\n labs(title = \"Beer Goggels Effect\",\n x = \"Amount of Alcohol Consumed\", \n y = \"Mean Level of Attractiveness\",\n color = \"Gender\")\n\n\n\n\n\n\n\n\nUse Tukey to look at specific differences in the groups you are interested in\n\nTukeyHSD(m3)\n\n Tukey multiple comparisons of means\n 95% family-wise confidence level\n\nFit: aov(formula = Attractiveness ~ Gender + Alcohol + Gender * Alcohol, data = goggles)\n\n$Gender\n diff lwr upr p adj\nFemale-Male 3.75 -1.558607 9.058607 0.1613818\n\n$Alcohol\n diff lwr upr p adj\n2 Pints-None 0.9375 -6.889643 8.764643 0.9544456\n4 Pints-None -17.1875 -25.014643 -9.360357 0.0000105\n4 Pints-2 Pints -18.1250 -25.952143 -10.297857 0.0000040\n\n$`Gender:Alcohol`\n diff lwr upr p adj\nFemale:None-Male:None -6.250 -19.851381 7.351381 0.7432243\nMale:2 Pints-Male:None 0.000 -13.601381 13.601381 1.0000000\nFemale:2 Pints-Male:None -4.375 -17.976381 9.226381 0.9277939\nMale:4 Pints-Male:None -31.250 -44.851381 -17.648619 0.0000003\nFemale:4 Pints-Male:None -9.375 -22.976381 4.226381 0.3286654\nMale:2 Pints-Female:None 6.250 -7.351381 19.851381 0.7432243\nFemale:2 Pints-Female:None 1.875 -11.726381 15.476381 0.9983764\nMale:4 Pints-Female:None -25.000 -38.601381 -11.398619 0.0000306\nFemale:4 Pints-Female:None -3.125 -16.726381 10.476381 0.9825753\nFemale:2 Pints-Male:2 Pints -4.375 -17.976381 9.226381 0.9277939\nMale:4 Pints-Male:2 Pints -31.250 -44.851381 -17.648619 0.0000003\nFemale:4 Pints-Male:2 Pints -9.375 -22.976381 4.226381 0.3286654\nMale:4 Pints-Female:2 Pints -26.875 -40.476381 -13.273619 0.0000080\nFemale:4 Pints-Female:2 Pints -5.000 -18.601381 8.601381 0.8796489\nFemale:4 Pints-Male:4 Pints 21.875 8.273619 35.476381 0.0002776\n\n\nFor us, the most important difference is that males are more likely to choose a less attractive person at 4 pints of alcohol than females\n\n\n\n\n\nThere was a significant main effect of the amount of alcohol consumed on the attractiveness of the date that was selected, F(2, 42) = 20.07, p < .001.\n\n\nThere was not a significant main effect of gender on the attractiveness of the date that was selected, F(1, 42) = 2.03, p = .161.\n\n\n\n\n\nThere was a significant interaction effect between amount of alcohol consumed and gender on the attractiveness of the date that was selected F(2, 42) = 11.91, p < .001.\n\n\n\n\n\nTukeyHSD post hoc tests revealed that at the largest amount of alcohol consumption (4 pints) Males were significantly more likely to choose a less attractive date (M=35.6, SE=3.83) in comparison to females (M=57.5, SE=57.5). This difference, 21.88, 95% CI[8.27, 35.48] was significant with an adjusted p = .0003.",
"crumbs": [
"Statistical Tests",
"Two-Way ANOVA"
]
},
{
"objectID": "Two-Way ANOVA.html#two-way-anova",
"href": "Two-Way ANOVA.html#two-way-anova",
"title": "Two-Way ANOVA",
"section": "",
"text": "Before starting this section please review the section One-Way ANOVA first\n\n\nOne-way ANOVA is based on the general idea that the total variability \\(SS_T\\) is partitioned (divided or separated) into two types of variability. The variability between the groups \\(SS_{between}\\) and the variability within the groups \\(SS_{within}\\). Remember that the variability between the groups needed to be greater than the variability within the groups because that would indicate that the difference between the groups was greater than the measurement error that existed within the groups.\n\n\nTwo-way ANOVA goes beyond one-way ANOVA by analyzing the effects of two independent variables in the same experiment. In a two-way ANOVA the two independent variables are called main effects or factors and each main effect has its own individual hypothesis.\n\n\n\nFinally, the two-way ANOVA analyzes if there is an interaction effect between the two main effects or factors. An interaction effect occurs when the effect of one of the independent variables is not the same at all levels of the second independent variable. So the set up for a two-way ANOVA is:\n\nIndependent variable 1 = Main effect 1 = Factor 1\nIndependent variable 2 = Main effect 2 = Factor 2\nInteraction effect = Interaction between Factors 1 & 2\n\n\n\n\nThe example dataset tests what’s called the “beer googles effect”. Sometimes alcohol can have an effect on perceptions of attraction for potential dates, especially later in the evening at bars. The dataset tests whether perceptions of attractiveness change after drinking alcohol and whether males and females are effected by this phenomenon differently.\n\n\nTwo Independent Variables (Main Effects or Factors)\n\nMain Effect 1 = Alcohol 3 Levels (None, 2 Pints, 4 Pints)\nMain Effect 2 = Gender 2 Levels (Male and Female)\nInteraction Effect = Gender x Alcohol\nDV = Attractiveness of the partner selected at the end of the evening\n\nAlternative Hypotheses\n\n\\(H_1\\) Alcohol has an effect on the attractiveness level of the selected partner\n\\(H_2\\) Gender has an effect on the attractiveness level of the partner.\n\\(H_3\\) There is an interaction effect between Alcohol and Gender\n\nSum of Squares variation estimates\n\n\n\n\n\nGet the dataset and import it\n\nlibrary(haven)\ngoggles <- read_sav(\"goggles.sav\")\n\nCheck out Gender variable\n\ngoggles$Gender\n\n<labelled<double>[48]>: Gender\n [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n[39] 0 0 0 0 0 0 0 0 0 0\n\nLabels:\n value label\n 0 Male\n 1 Female\n\n\nCheck out Alcohol variable\n\ngoggles$Alcohol\n\n<labelled<double>[48]>: Alcohol Consumption\n [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2\n[39] 2 2 3 3 3 3 3 3 3 3\n\nLabels:\n value label\n 1 None\n 2 2 Pints\n 3 4 Pints\n\n\nCheck out Attractiveness variable\n\ngoggles$Attractiveness\n\n [1] 65 70 60 60 60 55 60 55 70 65 60 70 65 60 60 50 55 65 70 55 55 60 50 50 50\n[26] 55 80 65 70 75 75 65 45 60 85 65 70 70 80 60 30 30 30 55 35 20 45 40\nattr(,\"label\")\n[1] \"Attractiveness of Date\"\nattr(,\"format.spss\")\n[1] \"F8.0\"\nattr(,\"display_width\")\n[1] 13\n\n\nAlcohol and Gender variables are factors. When they get imported from SPSS they don’t function as well because the focus is on the numbers, not the labels or words.We can use tidyverse and the mutate function to fix this.\n\ngoggles <- goggles %>% \n mutate(Gender = factor(Gender, levels = c(0,1), \n labels = c(\"Male\", \"Female\")))\n\nWe can do the same thing with the Alcohol variable. Make sure you know the levels or numbering of the variable\n\ngoggles$Alcohol\n\n<labelled<double>[48]>: Alcohol Consumption\n [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2\n[39] 2 2 3 3 3 3 3 3 3 3\n\nLabels:\n value label\n 1 None\n 2 2 Pints\n 3 4 Pints\n\n\nThen go ahead and mutate the variable as you did with Gender\n\ngoggles <- goggles %>% \n mutate(Alcohol = factor(Alcohol, levels = c(1,2,3), \n labels = c(\"None\", \"2 Pints\", \"4 Pints\")))\n\n\n\nNow we can move to the Two-Way ANOVA analysis. “m3” is the new object to save results so the formula will have this structure.\n\n# m3 <- aov (this is the computation you are using, like t.test) Then the rest of your formula should look like this (Dependent variable ~ Varible 1 + Variable 2 + Variable 1*Variable 2, data = [your dataset])\n\n\nm3 <- aov(Attractiveness ~ Gender + Alcohol + \n Gender*Alcohol, data = goggles)\n\nCheck the results\n\nsummary(m3)\n\n Df Sum Sq Mean Sq F value Pr(>F) \nGender 1 169 168.8 2.032 0.161 \nAlcohol 2 3332 1666.1 20.065 7.65e-07 ***\nGender:Alcohol 2 1978 989.1 11.911 7.99e-05 ***\nResiduals 42 3487 83.0 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nKey Findings from our output\nMain effect of Alcohol on attractiveness\nInteraction effect of Alcohol and Gender\nWhat does this mean?\n\n\n\n\nLet’s use a graph to understand this better\nFirst let’s look at each variable individually\n\nGogglesAlcohol <- goggles %>%\n group_by(Alcohol) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nGraph the Alcohol variable individually\n\nggplot(GogglesAlcohol, aes(x = Alcohol,\n y = mean)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1)\n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n\n\n`geom_line()`: Each group consists of only one observation.\nℹ Do you need to adjust the group aesthetic?\n\n\n\n\n\n\n\n\n\nGraph the Gender variable individually\n\nGogglesGender <- goggles %>%\n group_by(Gender) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nGraph it\n\nggplot(GogglesGender, aes(x = Gender,\n y = mean)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1) +\n ylim(0,70)\n\n`geom_line()`: Each group consists of only one observation.\nℹ Do you need to adjust the group aesthetic?\n\n\n\n\n\n\n\n\n\nGraph the Interaction Effect\nFinally, we can graph relationships for both variables.\nFirst find your descriptive statistics, but this time based on two independent variables\n\nGogglesDescriptives <- goggles %>%\n group_by(Alcohol, Gender) %>%\n summarize(n = n(),\n mean = mean(Attractiveness),\n sd = sd(Attractiveness),\n se = sd/sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\n`summarise()` has grouped output by 'Alcohol'. You can override using the\n`.groups` argument.\n\n\nCheck it\n\nGogglesDescriptives\n\n# A tibble: 6 × 7\n# Groups: Alcohol [3]\n Alcohol Gender n mean sd se ci\n <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>\n1 None Male 8 66.9 10.3 3.65 8.64\n2 None Female 8 60.6 4.96 1.75 4.14\n3 2 Pints Male 8 66.9 12.5 4.43 10.5 \n4 2 Pints Female 8 62.5 6.55 2.31 5.47\n5 4 Pints Male 8 35.6 10.8 3.83 9.06\n6 4 Pints Female 8 57.5 7.07 2.5 5.91\n\n\nUse a line graph to graph the relationship\n\nggplot(GogglesDescriptives, aes(x = Alcohol,\n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(size = 3) +\n geom_line(size = 1) +\n geom_errorbar(aes(ymin =mean - ci, \n ymax = mean + ci), \n width = .1)\n\n\n\n\n\n\n\n\nUse dodge functions to make graph clearer\n\npd <- position_dodge(0.2)\nggplot(GogglesDescriptives, \n aes(x = Alcohol, \n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(position = pd, \n size = 3) +\n geom_line(position = pd,\n size = 1) +\n geom_errorbar(aes(ymin = mean - ci, \n ymax = mean + ci), \n width = .1, \n position= pd)\n\n\n\n\n\n\n\n\nAll the bells and whistles\n\npd <- position_dodge(0.2)\nggplot(GogglesDescriptives, \n aes(x = Alcohol, \n y = mean, \n group=Gender, \n color=Gender)) +\n geom_point(position=pd, \n size = 3) +\n geom_line(position = pd, \n size = 1) +\n geom_errorbar(aes(ymin = mean - ci, \n ymax = mean + ci), \n width = .1, \n position = pd, \n size = 1) +\n scale_color_brewer(palette=\"Set1\") +\n theme_minimal() +\n labs(title = \"Beer Goggels Effect\",\n x = \"Amount of Alcohol Consumed\", \n y = \"Mean Level of Attractiveness\",\n color = \"Gender\")\n\n\n\n\n\n\n\n\nUse Tukey to look at specific differences in the groups you are interested in\n\nTukeyHSD(m3)\n\n Tukey multiple comparisons of means\n 95% family-wise confidence level\n\nFit: aov(formula = Attractiveness ~ Gender + Alcohol + Gender * Alcohol, data = goggles)\n\n$Gender\n diff lwr upr p adj\nFemale-Male 3.75 -1.558607 9.058607 0.1613818\n\n$Alcohol\n diff lwr upr p adj\n2 Pints-None 0.9375 -6.889643 8.764643 0.9544456\n4 Pints-None -17.1875 -25.014643 -9.360357 0.0000105\n4 Pints-2 Pints -18.1250 -25.952143 -10.297857 0.0000040\n\n$`Gender:Alcohol`\n diff lwr upr p adj\nFemale:None-Male:None -6.250 -19.851381 7.351381 0.7432243\nMale:2 Pints-Male:None 0.000 -13.601381 13.601381 1.0000000\nFemale:2 Pints-Male:None -4.375 -17.976381 9.226381 0.9277939\nMale:4 Pints-Male:None -31.250 -44.851381 -17.648619 0.0000003\nFemale:4 Pints-Male:None -9.375 -22.976381 4.226381 0.3286654\nMale:2 Pints-Female:None 6.250 -7.351381 19.851381 0.7432243\nFemale:2 Pints-Female:None 1.875 -11.726381 15.476381 0.9983764\nMale:4 Pints-Female:None -25.000 -38.601381 -11.398619 0.0000306\nFemale:4 Pints-Female:None -3.125 -16.726381 10.476381 0.9825753\nFemale:2 Pints-Male:2 Pints -4.375 -17.976381 9.226381 0.9277939\nMale:4 Pints-Male:2 Pints -31.250 -44.851381 -17.648619 0.0000003\nFemale:4 Pints-Male:2 Pints -9.375 -22.976381 4.226381 0.3286654\nMale:4 Pints-Female:2 Pints -26.875 -40.476381 -13.273619 0.0000080\nFemale:4 Pints-Female:2 Pints -5.000 -18.601381 8.601381 0.8796489\nFemale:4 Pints-Male:4 Pints 21.875 8.273619 35.476381 0.0002776\n\n\nFor us, the most important difference is that males are more likely to choose a less attractive person at 4 pints of alcohol than females\n\n\n\n\n\nThere was a significant main effect of the amount of alcohol consumed on the attractiveness of the date that was selected, F(2, 42) = 20.07, p < .001.\n\n\nThere was not a significant main effect of gender on the attractiveness of the date that was selected, F(1, 42) = 2.03, p = .161.\n\n\n\n\n\nThere was a significant interaction effect between amount of alcohol consumed and gender on the attractiveness of the date that was selected F(2, 42) = 11.91, p < .001.\n\n\n\n\n\nTukeyHSD post hoc tests revealed that at the largest amount of alcohol consumption (4 pints) Males were significantly more likely to choose a less attractive date (M=35.6, SE=3.83) in comparison to females (M=57.5, SE=57.5). This difference, 21.88, 95% CI[8.27, 35.48] was significant with an adjusted p = .0003.",
"crumbs": [
"Statistical Tests",
"Two-Way ANOVA"
]
},
{
"objectID": "Regression.html",
"href": "Regression.html",
"title": "Regression",
"section": "",
"text": "Regression\nUnderstanding Linear regression.\nAs we saw with correlation, one way we can understand a relationship is through using a line as a model. Lines require two points and when using a scatterplot, based on the slope, an estimate of the strength of the relationship between two variables can be assessed. Linear regression adds more tools to understand the relationship between two variables.\nLet’s start with the basic equation for a line. \\[\nY=b_0 + b_1X\n\\] \\(b_1\\) tells us the shape of the model, basically whether it’s positive or negative, just like we saw with correlational coefficients earlier. \\(b_0\\) stands for our Y intercept or where the line crosses the Y axis. So these two values essentially stand in for our two points on a line. One indicates the point on the Y axis where the line crosses the Y axis and the other one describes the slope of the line (either positive or negative) and the degree of the slope of the line.\nOne thing that regression lines enable is the prediction of one variable (the outcome variable or dependent variables) based on another (predictor variable or independent variable). This is a form of simple regression. If you are using more than one variable or several predictors to predict an outcome variable, this is known as multiple regression.\nFor example, going back to our Album Sales dataset. If we wanted to predict album sales (Y)(outcome variable) based on our advertising budget (X)(predictor variable) and the regression model had 50 as the constant and 100 as the gradient, the formula would look like this: \\[\nalbum\\;sales = b_0 + b_1(advertising\\;budget)\n\\] The we could supply the constant and the gradient \\[\nalbum\\;sales = 50 + (100 \\times advertising\\;budget)\n\\] Then we could solve for the number of albums we could sell if we spent 5 dollars on the advertising budget \\[\nalbum\\;sales = 50 + (100 \\times 5)\n\\] Thus in R\n\n50+(100*5)\n\n[1] 550\n\n\nThe values of 50 and 100 were just made up numbers. How do we find the actual intercepts and gradients using R?\nLet’s take a look at the code for running a linear regression in R\n\nregression <- lm(formula = Sales ~ Adverts, data = Album_Sales)\n\nFor regression models, the output needs to be saved in an object, so there’s a second step to display the results.\n\nregression\n\n\nCall:\nlm(formula = Sales ~ Adverts, data = Album_Sales)\n\nCoefficients:\n(Intercept) Adverts \n 134.13994 0.09612 \n\n\nNotice that when the object is called up it provides the intercept and gradient discussed earlier. Thus we can plug these numbers into our formula using R and find the predicted album sales based on spending 500 dollars on the advertising budget.\n\n134.14 + (.096*500)\n\n[1] 182.14\n\n\nThe two coefficents have been identified, but we still don’t know if advertising is a good predictor of album sales. However, this information is also in the results that were saved in the object “regression”. To look at these results, the summary command is used.\n\nsummary(regression)\n\n\nCall:\nlm(formula = Sales ~ Adverts, data = Album_Sales)\n\nResiduals:\n Min 1Q Median 3Q Max \n-152.949 -43.796 -0.393 37.040 211.866 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***\nAdverts 9.612e-02 9.632e-03 9.979 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 65.99 on 198 degrees of freedom\nMultiple R-squared: 0.3346, Adjusted R-squared: 0.3313 \nF-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16\n\n\nRemember that the variable of interest, the predictor variable, was advertising. If we worked at this record company we would be interested if money spent on advertising would help in increasing record sales. The first thing to notice is the F statistic, which analyzes whether the regression model is significant or not. Notice that this number is 99.59 with a p value well below .05, so that tells us our regression model is a good one.\nWhere did this statistics analysis come from though? To answer this question we need to look at little bit closer at how to analyze the fit of a line.\nFrom the discussion of correlation, you may remember that a regression line is a line that is used to fit the data as well as possible, but as with all statistics there is a certain amount of error. The regression line does not cross all the points on a scatterdot, but tries to be as close to as many points as possible. The difference between the line and a particular point is known as a residual. A residual is similar to a deviation, but it is the distance or error between the line and a particular point. The smaller the residual the better the line is a predictor for that particular point or score. The larger the residual the less predictive the regression line is for that score.\nTo evaluate the regression model, the average amount of residual error must be calculated. However, the same problem observed with trying to find the average deviation for a dataset remains here. If you add up all the residuals and try to divide by N, the residuals will add up to zero, which doesn’t help very much. We solve this problem the same way it was solved with deviations, we square them!. This number becomes the sum of squared deviations and is symbolized below: \\[\nSS_R\n\\] To decide whether the regression line is a good predictor, something must be used as a comparator. In this case, the mean is compared to the line to see which serves as a better predictor for the data. If the line has a low amount of error (low \\(SS_R\\)) and shows considerable improvement in comparison to the mean (high \\(SS_M\\)), the regression line is a good predictor. Sum of squares model (\\(SS_M\\)) is a measurement of the average distance between the line and the mean. This provides a measurement of the slope or gradient of the line. Remember that the steeper the slope or gradient of the line the stronger the relationship between two variables. Here the mean, (which is a horizontal line drawn on the Y axis) acts as an estimate of zero relationship or correlation. Thus, the greater the difference between the line and the mean, the greater the relationship or correlation between the variables and the better the predictor variable is at predicting the relationship.\nThere are a few more steps to get to the F value. 1. Each sum of squares must be divided by the degrees of freedom. Think back to finding the standard deviation. We divided the sum of squares by N -1. Sum of squares residual and sum of squares model each have a slightly different degrees of freedom. - k - 1 - N - k 2. Once we’ve divided by the degrees of freedom, we get two numbers, The Means squared model and the Means squared residual. This his how we get to the F value. \\[\nF = \\frac {MS_M}{MS_R}\n\\] F is a proportional value, so it starts at 1 and is never a negative value. The higher the F value the higher the amount of improvement demonstrated by the regression model in comparison to the residuals (error in the model).\nOf course, we also want to include a scatterplot to show the relationship.\n\nggplot(Album_Sales, mapping = aes(x = Adverts, y = Sales)) +\n geom_point() +\n geom_smooth(method = 'lm')\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\nFinally, the results section should be properly formatted. Here is a good example:\n\nThe regression model does appear to be significant. Advertising budget is a good predictor of sales based on the ANOVA test, F (1/98)= 99.59, p < .001. Based on the \\(R^2\\) value of .33, 33% of the sales variable can be explained by the advertising budget.",
"crumbs": [
"Statistical Tests",
"Regression"
]
},
{
"objectID": "Independent t test.html",
"href": "Independent t test.html",
"title": "Independent t-test",
"section": "",
"text": "Resources to consult:\n\nChapter 10 from Field, A. (2017). Discovering Statistics Using IBM SPSS Statistics (5th Edition). SAGE Publications, Ltd. (UK).\nSection from Learning Statistics with R",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#resources",
"href": "Independent t test.html#resources",
"title": "Independent t-test",
"section": "",
"text": "Resources to consult:\n\nChapter 10 from Field, A. (2017). Discovering Statistics Using IBM SPSS Statistics (5th Edition). SAGE Publications, Ltd. (UK).\nSection from Learning Statistics with R",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#review",
"href": "Independent t test.html#review",
"title": "Independent t-test",
"section": "Review",
"text": "Review\nIn the section on probability we learned about the binomial distribution, which is a type of probability distribution. Probability distributions allow us to look at the probability of different types of outcomes. For example, the binomial distribution enables the calculation of the probability associated with getting say 6 heads out of 25 flips of a fair-sided coin.\n\ndbinom(x = 6, size = 25, prob = 1/2)\n\n[1] 0.005277991\n\n\nFor the t-test we use the t distribution to analyze the probability of the outcome of a particular experiment. More specifically, an experiment with two groups. The t distribution is similar to the binomial distribution, but much closer to the z distribution that we learned about in the section on the normal curve and z scores.",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#the-t-distribution",
"href": "Independent t test.html#the-t-distribution",
"title": "Independent t-test",
"section": "The t Distribution",
"text": "The t Distribution\nPagano (2013) provides a good definition of the t distribution.\n\nThe sampling distribution of t is a probability distribution of the t values that would occur if all possible different samples of a fixed size N were drawn from the null-hypothesis population. It gives (1) all the possible different t values for samples of size N and (2) the probability of getting each value if sampling is random from the null-hypothesis population p. 329\n\nLet’s try to unpack this a bit. First off, it’s a probability distribution constructed of t values or t scores based on all possible samples drawn from a population. Remember that samples are a smaller group taken from a population. The “fixed size N” refers to the size of the sample that was taken in the experiment, so the distribution will change it’s shape slightly based on N.\nAlthough the t distribution is based on the sample size, it’s actually constructed by using a somewhat complicated mathematical concept called degrees of freedom or df. Degrees of freedom refers to how many scores are free to vary for a given statistic. Rather, than go too far into the weeds with this concept, for now just know that the degrees of freedom for the independent t test is \\(N - 2\\).\nSo if we had a sample size of 30, we would use a t distribution of \\(df=30 - 2\\) or \\(df=28\\) to construct the distribution.\n\n\n\n\n\n\n\n\n\nThe t distribution changes shape based on the degrees of freedom (df) that is associated with it, which is related to the sample size.\n\n\n\n\n\n\n\n\n\nNotice that the t distribution has the same shape as a normal curve, plus it looks like the z distribution with a mean of zero and positive values above the mean and negative values below the mean. Just like with the z distribution as the absolute value of the t score increases (remember that absolute value looks at the score and ignores the +/- sign), it is associated with a lower probability. Each t score has a particular probability associated with it.",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#using-the-independent-t-test",
"href": "Independent t test.html#using-the-independent-t-test",
"title": "Independent t-test",
"section": "Using the Independent t test",
"text": "Using the Independent t test\nThe independent t test is used when analyzing the difference in means between two separate groups. Typically, this is used to analyze the difference between the control group (group who did not receive the independent variable) and the experimental group (group who did receive the independent variable).\n\nNull Hypothesis - Experimental Group = Control Group\n\n\nAlternative Hypothesis - Experimental Group \\(\\neq\\) Control Group\n\nThe null hypothesis assumes that the mean difference between the two groups is equal to zero. If the samples came from the same population, their means would be roughly the same and if they came from the same population their characteristics would be the same as well. Thus, the null hypothesis assumes no real differences based on the presence of the independent variable, while the alternative hypothesis assumes that the two means are different. \\[\nNull = \\bar X_{1} = \\bar X_{2}\n\\]\n\\[\nAlternative = \\bar X_{1} \\neq \\bar X_{2}\n\\]\nThis type of design is sometimes referred as a between-groups or independent design. It requires a categorical (nominal) predictor or independent variable that will specify the two groups (experimental and control) and a continuous dependent variable that specifies the outcome.\nThere are two possible reasons why two samples have different means.\n\nThe two sample means come from the same population, but the existence of measurement error and variability are the reason they are different. (Null hypothesis)\nThe two sample means come from different populations, which have different characteristics, and reflect a genuine difference between the sample means. (Alternative hypothesis)\n\n\nIndependent t test formula\nRemember the basic formula used in most statistical situations\n\\[\nStatistic = \\frac{Signal}{Noise}\n\\]\nThe signal refers to systematic variation or variation caused by the experimental manipulation and the work of the independent variable. If the independent variable has an effect we’ll expect there to be a difference between the means. Noise refers to background noise or unsystematic variation. Variation we don’t manipulate or don’t have control over, like measurement error.\nThus, the formula for the t test can be represented as:\n\\[\nt = \\frac{difference\\;between\\; sample\\;means}{\\;measurement\\; error}\n\\]\nBasically, the difference between the means is what we would expect if the independent variable has an effect. There should be some difference in the dependent variable that is a measure the effect of the independent variable. This is then compared to the measurement error. Remember that all statistical calculation contains some amount of error, but the difference between the means has to be greater that than measurement error in order to be considered statistically significant. So the numerator in the formula is simply the difference between the means of the two groups or \\(\\bar X_{1}-\\bar X _{2}\\).\nAs an estimate of the measurement error, the t test uses the standard deviation from the sample divided by the square root of n to calculate the measurement error. This is a form of the standard deviation, which was shown earlier as a descriptive statistic of measurement error. This statistic is known as the estimate of the standard error. So here is what our formula looks like now.\n\\[\nt = \\frac{\\bar X_{1}-\\bar X _{2}}{estimate\\; of\\; the\\; \\;standard\\; error}\n\\]\nHowever we need estimates from both sample 1 and sample 2 so the denominator of the equation looks like this: \\[\n\\sqrt{\\frac{s^{2}_{1}}{n_{1}}+\\frac{s^{2}_{2}}{n_{2}} }\n\\]\nFinally, we can put everything together for the primary formula. \\[\nt = \\frac{\\bar X_{1}- \\bar X_{2}}{\\sqrt{\\frac{s^{2}_{1}}{n_{1}}+\\frac{s^{2}_{2}}{n_{2}} }}\n\\]\nOverall, for the independent t-test, if the mean difference between the two samples is significantly greater than the measurement error in the two samples (which is found by using the estimated standard error), there is a real difference between the two samples.",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#independent-t-test-example",
"href": "Independent t test.html#independent-t-test-example",
"title": "Independent t-test",
"section": "Independent t-test example",
"text": "Independent t-test example\nFor the t test example, we’ll use a dataset based on a fictional experiment involving Harry Potter and the cloak of invisibility. The cloak of invisibility if the one of the deathly hallows and a gift that Harry received in year one at Hogwarts. When a person puts on the cloak they are invisible to everyone and can run around the castle doing whatever kinds of mischief then can imagine.\n\nHypotheses\nSo the experiment involves the use of the cloak and if persons act more mischievous when they wear it. So let’s lay out the particulars of the experiment.\n\nIndependent variable (IV) = The Cloak\nDependent variable (DV) = Acts of mischief\n\nSince this is an independent t test there will be 2 groups.\n\nGroup 1 - Experimental Group - Wears the cloak of invisibility\nGroup 2 - Control Group - Does not wear a cloak of invisibility\n\nThis would call for two hypotheses.\n\nNull Hypothesis - Wearing the invisibility cloak does not increase mischievous acts\nAlternative Hypothesis - Wearing the invisibility cloak increases mischievous acts\n\nThe experimental assumptions would be:\n\nNull hypothesis -> Group 1 = Group 2 or \\(\\bar X_{1}=\\bar X _{2}\\)\nAlternative Hypothesis -> Group 1 \\(\\neq\\) Group 2 or \\(\\bar X_{1}\\neq\\bar X _{2}\\)\n\n\n\nLoad the Data into R Studio\n\nFirst Step, upload dataset from SPSS\nGet data set named “Invisibility” from SPSS datasets\nUse import dataset tool under the environment tab\nFind file called invisibility.sav\n\n\nlibrary(haven)\nInvisibility <- read_sav(\"https://github.com/jvanster/Psy_300_Datasets/raw/refs/heads/main/SPSS_Datasets/Invisibility.sav\")\n\n\n\nInspect variables\n\nInvisibility$Cloak\n\n<labelled<double>[24]>: Cloak of invisibility\n [1] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1\n\nLabels:\n value label\n 0 No Cloak\n 1 Cloak\n\nInvisibility$Mischief\n\n [1] 3 1 5 4 6 4 6 2 0 5 4 5 4 3 6 6 8 5 5 4 2 5 7 5\nattr(,\"label\")\n[1] \"Mischievous Acts\"\nattr(,\"format.spss\")\n[1] \"F8.0\"",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Independent t test.html#bar-graph",
"href": "Independent t test.html#bar-graph",
"title": "Independent t-test",
"section": "Bar Graph",
"text": "Bar Graph\nTo begin, let’s analyze some descriptive statistics based on the “Invisibility” dataset and create a bar graph to have a first estimate as to whether there is a difference between the groups.\nThe first step is to create a second dataset with your descriptive variables. We’ll use the dplyr package to do this, which is part of tidyverse.\n\nFind Descriptive Statistics\n\nlibrary(dplyr)\nInvis_Descriptives <- Invisibility |>\n group_by(Cloak) |>\n summarize(n = n(),\n mean = mean(Mischief),\n sd = sd(Mischief),\n se = sd / sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\nInvis_Descriptives\n\n# A tibble: 2 × 6\n Cloak n mean sd se ci\n <dbl+lbl> <int> <dbl> <dbl> <dbl> <dbl>\n1 0 [No Cloak] 12 3.75 1.91 0.552 1.22\n2 1 [Cloak] 12 5 1.65 0.477 1.05\n\n\nNotice the basic structure of the code. We use the “filter” or “pipe” function to filter the dataset based on the two groups and then find all the important descriptive statistics we need. When we take a closer look at this dataset a few things should be of interest.\n\nInvis_Descriptives\n\n# A tibble: 2 × 6\n Cloak n mean sd se ci\n <dbl+lbl> <int> <dbl> <dbl> <dbl> <dbl>\n1 0 [No Cloak] 12 3.75 1.91 0.552 1.22\n2 1 [Cloak] 12 5 1.65 0.477 1.05\n\n\nThe first thing we should notice is that the mean number of mischievous acts without the cloak is 3.75, while the mean number of mischievous acts with the cloak is 5.00. Remember the assumption of the alternative hypothesis.\n\nAlternative Hypothesis - Group 1 \\(\\neq\\) Group 2 or \\(\\bar X_{1}\\neq\\bar X _{2}\\)\n\nAnd this is what we find when we compare the two means\n\\[\n3.75\\neq 5.00\n\\]\nThis is also demonstrated with a basic bar graph as well.\n\nggplot(Invis_Descriptives, \n aes(x = Cloak, \n y = mean)) +\n geom_bar(stat = \"identity\")\n\n\n\n\n\n\n\n\n\n\nStatistically Significant?\nSo it does look like there is a difference between these two groups. The question remains, is the difference statistically significant. Thinking back to the t-test equation, is the mean difference significantly more than the measurement error to produce a t score that is associated with a low probability?\n\\[\nt = \\frac{5.00-3.75}{measurement\\; error}\n\\]\nRemember in this case, the measurement error statistic we are using is the estimated standard error, so the formula looks like this:\n\\[\nt = \\frac{5.00-3.75}{estimated\\; standard\\; error}\n\\]\nTo find the estimated standard error, the formula is:\n\\[\n\\sqrt{\\frac{s^{2}_{1}}{n_{1}}+\\frac{s^{2}_{2}}{n_{2}} }\n\\]\nIf we were going to run the t score calculation by hand it would look like this:\n\\[\nt = \\frac{5.00-3.75}{\\sqrt{\\frac{1.91^2}{12}+\\frac{1.65^2}{12 }}}\n\\]So we could do the t test by hand using r\n\n(3.75-5.00)/(sqrt(1.91^2/12 + 1.65^2/12))\n\n[1] -1.715578\n\n\nRather than needing to run the calculation by hand, we can use r to the the calculation for us using the t.test() function.\n\nt.test(Mischief ~ Cloak, data = Invisibility)\n\n\n Welch Two Sample t-test\n\ndata: Mischief by Cloak\nt = -1.7135, df = 21.541, p-value = 0.101\nalternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0\n95 percent confidence interval:\n -2.764798 0.264798\nsample estimates:\nmean in group 0 mean in group 1 \n 3.75 5.00 \n\n\nSo we can see that the t score is -1.71 and the corresponding p value is .1007. The p value is greater than alpha or greater than .05. Thus, we fail to reject the null and our experiment does not support the idea that a cloak of invisibility increases acts of mischief.\n\n\nT Distribution for the Example\nThe degrees of freedom (df) is 22, so the t distribution would take on a shape that looks like this:\n\n\n\n\n\n\n\n\n\nHere’s a graph that shows the area for rejection of the Null (p value < .05) or retaining the null (p value > .05).\n\ndist_t(deg.f = 22, p = 1)\n\n\n\n\n\n\n\n\nNotice that the t score that is the start of the rejection area for the Null is 1.72, which is just slightly bigger than 1.71 which is the t score associated with the outcome in the invisibility cloak experiment, which makes sense since the p value was .1007 only .05 away from the alpha level.\n\n\nEffect Size\nThe other statistic that will be a part of the conclusion is the effect size, which is a measurement of the magnitude of the results obtained in the experiment. Effect sizes can be used to compare the magnitude of this experiment to others and it has become standard practice to include an effect size with any statistical results.\nIn this case, we’ll use Cohen’s d as a measure of the effect size. The formula for Cohen’s d is to subtract the means from each other and then divide by the standard deviation of the control group.\nWe can bring up our descriptive statistics if we need to remember the means and standard deviations.\n\nInvis_Descriptives\n\n# A tibble: 2 × 6\n Cloak n mean sd se ci\n <dbl+lbl> <int> <dbl> <dbl> <dbl> <dbl>\n1 0 [No Cloak] 12 3.75 1.91 0.552 1.22\n2 1 [Cloak] 12 5 1.65 0.477 1.05\n\n\nSo for Cohen’s d\n\nCohens_d <- (5.00-3.75)/1.91\nCohens_d\n\n[1] 0.6544503\n\n\n\n\nBar Graph for t Test\nEarlier, we looked at a basic bar graph that showed a difference between the means. However, a key piece of information was missing from this graph, confidence intervals. Let’s go ahead and add those in.\n\nggplot(Invis_Descriptives, \n aes(x = Cloak, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci))\n\n\n\n\n\n\n\n\nRemember that confidence intervals show a range or interval for values of the population mean based on a sample. For an Independent t test, if the independent variable has a real effect, we would expect that the two samples come from different populations, rather than the same population. When the confidence intervals overlap as they do in the chart above that means that there is a substantial chance that these samples came from the same population. If the confidence intervals do not overlap there is a much smaller chance that they came from the same population, thus the independent variable is more likely to have had an effect.\nWe use confidence intervals, p values, and effect sizes to analyze an experiment. A p value may be below .05, but the confidence intervals may still be touching, so all parts of the analysis are helpful.\nWe can add in labels to improve the look of our chart. We’ve added labels to our factor variable and labels for the title and x and y variables.\n\nggplot(Invis_Descriptives, \n aes(x = factor(Cloak, labels=c(\"No Cloak\", \"Cloak\")),\n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci)) +\n labs(title = \"Mean Number of Mischievious Acts with or without Cloak\", \n y=\"Mean Number of Mischievious Acts\", x=\"Were They Wearing a Cloak?\")\n\n\n\n\n\n\n\n\nNext let’s add some color to our chart\n\nggplot(Invis_Descriptives, \n aes(x = factor(Cloak, labels=c(\"No Cloak\", \"Cloak\")),\n y = mean)) +\n theme_minimal() +\n geom_bar(stat = \"identity\", fill=\"cornflowerblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.3, size=1) +\n labs(title = \"Mean Number of Mischievious Acts with or without Cloak\", \n y=\"Mean Number of Mischievious Acts\", x=\"Were They Wearing a Cloak?\")\n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n\n\n\n\n\n\n\n\n\n\n\nWriting up the Results\nHere is how to write up the results for this experiment including all the relevant information. Of course, it’s also important to include the bar graph as well.\n\n\nResults\n\nOn average, participants given a cloak of invisibility engaged in more acts of mischief (M = 5, SE = 0.48), than those not given a cloak (M = 3.75, SE = 0.55). This difference, 1.25, 95% CI[-2.76, 0.26], was not significant t(21.54) = −1.71, p = 0.101. However, it did represent a medium-sized effect d = 0.65.*\n\nSteps to writing results\n\nWrite out the means and include standard error\nWrite out the difference between the means (Subtract the sample means) and the confidence intervals for the difference, which is a part of the r output.\nt(df)= t score, p value, Cohen’s d",
"crumbs": [
"Intro to Statistics",
"Independent t-test"
]
},
{
"objectID": "Intro to ggplot.html",
"href": "Intro to ggplot.html",
"title": "Intro to ggplot",
"section": "",
"text": "ggplot is a graphics language that we use to make graphs. The text we will use for our first go at making graphs is called R for Data Science\nLike a lot of other attractive things about R and R Studio the book is free!! and probably one of the best resources for understanding R and Data.\nYou can find the book here\nI’ll be using several modified examples from Chapter 3 of that book.",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "Intro to ggplot.html#using-ggplot",
"href": "Intro to ggplot.html#using-ggplot",
"title": "Intro to ggplot",
"section": "",
"text": "ggplot is a graphics language that we use to make graphs. The text we will use for our first go at making graphs is called R for Data Science\nLike a lot of other attractive things about R and R Studio the book is free!! and probably one of the best resources for understanding R and Data.\nYou can find the book here\nI’ll be using several modified examples from Chapter 3 of that book.",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "Intro to ggplot.html#the-tidyverse",
"href": "Intro to ggplot.html#the-tidyverse",
"title": "Intro to ggplot",
"section": "The Tidyverse",
"text": "The Tidyverse\nTidyverse is a package we’ll be using throughout our exploration of statistics. There are lots of great helps for doing data science.\nFirst you’ll want to load “tidyverse” as a package. Here’s how we find and load packages in R Studio.\n\nSearch under “Packages” in the bottom right window\nClick on install and search for tidyverse in the CRAN repository\nOnce you’ve found tidyverse, click on install - It may take some time for the package to download and install - wait until it is all done.\nOnce the package is downloaded just make sure to check next to the package on your list of packages OR simply use this code.\n\nlibrary(tidyverse)\n\n\n\nmpg dataset\nWe’ll start off by using a dataset called “mpg” It should have loaded with the tidyverse package. Just type it in to find it\n\nmpg\n\n# A tibble: 234 × 11\n manufacturer model displ year cyl trans drv cty hwy fl class\n <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>\n 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…\n 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…\n 3 audi a4 2 2008 4 manu… f 20 31 p comp…\n 4 audi a4 2 2008 4 auto… f 21 30 p comp…\n 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…\n 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…\n 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…\n 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…\n 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…\n10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…\n# ℹ 224 more rows\n\n\nIt shows the dataset as a “tibble”, which just means table in the tidyverse\nBecause it’s a preloaded dataset we can get information about it by using a question mark. A question mark in front of any command in R will automatically generate the help file\n\n?mpg",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "Intro to ggplot.html#levels-of-measurement",
"href": "Intro to ggplot.html#levels-of-measurement",
"title": "Intro to ggplot",
"section": "Levels of Measurement",
"text": "Levels of Measurement\n\nCategorical\n\nBinary variable\nNominal variable\nOrdinal variable\n\nContinuous\n\nInterval variable\nRatio variable\n\n\nDatasets contain several different variables. Notice in the mpg dataset, there are several different variables such as model, year, cty, and hwy. The str command shows all the different variables.\n\nstr(mpg)\n\ntibble [234 × 11] (S3: tbl_df/tbl/data.frame)\n $ manufacturer: chr [1:234] \"audi\" \"audi\" \"audi\" \"audi\" ...\n $ model : chr [1:234] \"a4\" \"a4\" \"a4\" \"a4\" ...\n $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...\n $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...\n $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...\n $ trans : chr [1:234] \"auto(l5)\" \"manual(m5)\" \"manual(m6)\" \"auto(av)\" ...\n $ drv : chr [1:234] \"f\" \"f\" \"f\" \"f\" ...\n $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...\n $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...\n $ fl : chr [1:234] \"p\" \"p\" \"p\" \"p\" ...\n $ class : chr [1:234] \"compact\" \"compact\" \"compact\" \"compact\" ...\n\n\nThese are several different types of variables because they represent different kinds of things. In statistics we refer to this as levels of measurement. There are two broad types of variables, categorical and continuous (Field, 2009, pages 8-9). Categorical variables are based on particular categories, such a type of shoe, religious affiliation, or political affiliation. A binary variable is a type of categorical variable that only takes two categories. So things like being pregnant or not, voting yes or no on a certain bill, or being alive or dead are binary variables, there are only two possible things in the category. Categorical variables that take on more than 2 possibilities are called nominal variables (nominal means names)\n\nCategorical\nThere’s no mathematics involved in determining categorical variables, it’s simply based on which category has the most accurate fit. Either you are a republican, democrat, or independent, there is no mathematical quantity that would determine the appropriate category. There is one type of categorical variable that is ordered based on the absence or presence of a particular property, which is an ordinal variable. Ordinal variables have a particular rank order, but the rankings are not equal or uniform. So for example, college basketball team rankings would be an ordinal variable. The teams are ranked from best to worst, but team #2 may be twice is good as team #3, while team #5 maybe four times as good as team #6. So ordinal data tells us more about the variable than nominal data, but it still doesn’t provide a standardized scale of measurement.\n\n\nContinuous\nContinuous variables are the second broad category and involves some type of numerical measurement to define the property in the variable. Continuous variables can take on any value in the measurement scale defined by the variable. The first example of this type variable is an interval variable. These types of variables are based on a measurement scale with equal distances between the ranks based on the property measured; equal intervals in the scale are able to represent equal differences in the variable being measured. The Fahrenheit scale would be an interval scale because the difference in degree between 64 and 65 is the same as 74 and 75.\n\n\nRatio\nRatio variables add one more dimension to the properties of an interval variable. In addition to having an equal distance in ranks, ratios have an absolute zero point. This allows for multiplication of the intervals or the use of ratios. So on a ratio scale of 0 to 5, a score of 4 would be twice as good as a score of 2. Time is a good example of a ratio scale. There is an absolute zero point (there is no such thing as negative time), the distance between 40 and 60 seconds is the same as the distance between 30 and 10 seconds. And 20 seconds is twice as long as 10 seconds. Another example, Fahrenheit is an interval scale because you can have negative degrees (e. g. -2 degrees below zero), while the Kelvin scale is a ratio scale because there is an absolute zero point and no negative numbers.\nOne other property of continuous variables is that they allow for different degrees of measurement precision. So for example, the continuous ratio variable of time can be measured in hours, minutes, seconds, milliseconds, etc. In contrast, a discrete variable can only take on a fixed measurement scale, like a rank scale of 1 to 10. The scale requires that you choose a value between 1 and 10, 2.5 is not possible.",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "Intro to ggplot.html#scatterplot",
"href": "Intro to ggplot.html#scatterplot",
"title": "Intro to ggplot",
"section": "Scatterplot",
"text": "Scatterplot\nLet’s start by creating a simple scatterplot graph\n\nggplot(data = mpg) +\n geom_point(mapping = aes(x= displ, y=hwy))\n\n\n\n\n\n\n\n\nThere’s a few things to learn here in the code\n\nggplot is the basic command and within the parentheses is the data we’ll be running our graph on\ngeom_point is the type of graph will be using, geom stands for geometrical shape. So in this case we are making a graph of “points”\nThe “mapping” argument lays out the variables we are graphing and is always paired with “aes”. Finally x and y lay out which variables are going to be on our x and y axes.\n\nSo our basic formula looks lke this:\nggplot(data = ) + (mapping = aes())\n\nAbout the variables\nIf you look at the variables, you’ll notice that hwy and displ are numbers and R views them as integers or whole numbers. Scatterplots usually require data that are numeric like this, numbers that go up in scale. So in this case scatterplots typically require interval or ratio data.\nOne peculiar thing about this graph is the group of numbers just above the numbers in the right corner. The general trend we see in these variables is that as displacement goes up (i.e. the engine uses more gasoline) the highway mileage goes down. This is sometimes referred to as a negative relationship. The cars with the best highway mileage displace the least amount of gasoline. However, there’s a group of dots that range between 20 and 30 on the hwy variable, but seem to displace more gasoline than most. How can this be?\n\n\nAdding additional variables\nOne way we can answer this question is by adding in an additional variable. In this case the class or type of car. Let’s add that in now as an additional variable to our graph.\n\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy, color = class))\n\n\n\n\n\n\n\n\nNotice how most of those dots are 2 seater cars. Thus, the reason they get better gas mileage is because they are smaller cars!!\nClass is a particular kind of variable, in this case a nominal or categorical variable. r refers to them as characters. These are basically categories, so we can’t do any math on them. You can’t make a formula out of compact x midsize or minivan/pickup.\n\n\nAdd Color\nLet’s do one more graph, but add a little color\n\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy), color = \"blue\")\n\n\n\n\n\n\n\n\nNotice how when the word “color” is outside the x and y mappings it changes the color of our points. You can try several different colors for the dots on your scatterplot.",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "Intro to ggplot.html#bar-graphs",
"href": "Intro to ggplot.html#bar-graphs",
"title": "Intro to ggplot",
"section": "Bar Graphs",
"text": "Bar Graphs\nLet’s learn one more graph, which is especially important for categorical variables, the bar graph.\nFirst, check out the diamonds data set\n\ndiamonds\n\n# A tibble: 53,940 × 10\n carat cut color clarity depth table price x y z\n <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31\n 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39\n# ℹ 53,930 more rows\n\n\nThe basics of the formula are going to be the same, but this time we are using geom_bar as our geometrical shape to create a bar graph.\n\nggplot(data = diamonds) + \n geom_bar(mapping = aes(x = cut))\n\n\n\n\n\n\n\n\nIf you look in section 3.7 in R for Data Science it explains what R is doing with the data\n\nSo R looks at the data set and automatically uses the count statistic to simply count the number of occurrences for that variable. So this only works for categorical variables.\nLet’s inspect the cut variable to see if this is true. Remember that the str command allows us to investigate the types of variables in the dataset.\n\nstr(diamonds)\n\ntibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)\n $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...\n $ cut : Ord.factor w/ 5 levels \"Fair\"<\"Good\"<..: 5 4 2 4 2 3 3 3 1 3 ...\n $ color : Ord.factor w/ 7 levels \"D\"<\"E\"<\"F\"<\"G\"<..: 2 2 2 6 7 7 6 5 2 5 ...\n $ clarity: Ord.factor w/ 8 levels \"I1\"<\"SI2\"<\"SI1\"<..: 2 3 5 4 2 6 7 3 4 5 ...\n $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...\n $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...\n $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...\n $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...\n $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...\n $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...\n\n\nEach of the variables are preceded by the $ sign. So we can see that cut is the second variable. Notice that it says cut is an ord.factor with 5 levels. This tells us that this variable is an ordinal variable because it is ordered based on the type of cut from fair (lowest type of cut) to ideal (the best type of cut).\n\nAnother example\nHere’s another example from scratch. First I’ll create a quick dataset using the tribble command.\n\nExample <- tribble(\n ~group, ~number, \n \"Group 1\", 30, \n \"Group 2\", 50\n)\n\nNotice that our variable group is a categorical variable, but on it’s own the stat count won’t really tell us much (Group 1 = 1, Group 2 =1). So notice what happens when we try to make a bar graph like our last example.\n\nggplot(data = Example) + \n geom_bar(mapping = aes(x = group))\n\n\n\n\n\n\n\n\nDoesn’t really tell us much does it? So in this case the count has to be supplied by a second variable, y. When we use a y variable for our bar chart we have to use a different stat for the bar chart, in this case, the stat is “identity” so the code looks like this.\n\nggplot(data = Example) +\n geom_bar(mapping = aes(x = group, y = number), stat = \"identity\")\n\n\n\n\n\n\n\n\nNow it’s easy to see the difference in count between Group 1 and Group 2",
"crumbs": [
"Graphs",
"Intro to ggplot"
]
},
{
"objectID": "index.html",
"href": "index.html",
"title": "R Studio for Statistics",
"section": "",
"text": "R Studio and Statistics\nThis is a website for students and others to learn about statistics using R Studio.\nExamples and content are for anyone interested in learning how to use statistics, but they are especially for Psychology 300 Statistics and Psy 320 Experimental Psychology taught by Dr. James A. Van Slyke\nThere are two books used to help teach this content. Both are free and online.\n\nR for Data Science\nLearning statistics with R: A tutorial for psychology students and other beginners\n\n\n\nBook Resources\nThis site is also based on examples from books I’ve used in the past, which continue to be excellent resources for learning statistics.\n\nFields, A. (2017). Discovering statistics using IBM SPSS Statistics (5th ed.). Sage.\nSalkind, N. J. & Shaw, L. A. (2019). Statistics for people who (think they) hate statistics using R. Sage.\nPagano, R. R. (2012). Understanding Statistics in the behavioral sciences (10th ed.). Cengage.\n\nThis site contains several examples of the types of things you need to know to perform calculations using R Studio. R Studio uses “R” to perform statistical calculations (and all kinds of others!) with a few extras helps for working with various data sets.\n\n\nInstallation\nThere are a few different ways to install and use R Studio and learn Statistics.\n\nR is the programming language that builds databases and analyzes statistics\nR Studio is the program we use to get the most out of R and adds several different tools that make R easier to use.\n\n\nYour computer\nInstall both on your laptop or desktop computer\nR - the most current version - https://cran.r-project.org/\n\nClick the link and then select your operating system (Mac or Windows)\nDownload the verions that corresponds to your computer and operating system\n\nR Studio - a program that supports using R - https://posit.co/download/rstudio-desktop/\n\nClick on the Link and then scroll down\nChoose R Studio Desktop, which will say FREE under it.\nThe Website should detect your operating system and then you just press the download button.\nOr you can look through the the installers by scrolling down and select your operating system.\n\n\n\nThe Cloud\nUse R Studio from the Cloud\nThe easiest way to use R Studio is to use the cloud version that you can run in your internet browser (Safari, Chrome, etc.) If you don’t want to download the program or have difficulty installing it go to this website and sign up for the “Cloud Free” plan.\nhttps://posit.cloud/plans/free\nThis is a great way to get started and gives you everything you need for the course.\nIf you have any problems installing the program, please use the cloud version, it’s easy to use and a great way to get started.\n\n\n\nHere’s some more videos that may be useful\n\nHere’s a brief video to introduce you to R Studio - https://youtu.be/haYxa3vWA28\nHere is a video on how to install the programs - https://edge.sagepub.com/salkindshaw/student-resources-0/r-tutorial-videos\nHere’s a video for Mac users - https://youtu.be/LanBozXJjOk"
},
{
"objectID": "In Progress/Point Estimates and Proportions.html",
"href": "In Progress/Point Estimates and Proportions.html",
"title": "Point Estimates",
"section": "",
"text": "Reference -\nDiez, D., OpenIntro Statistics\nChapter 5 - Foundations for Inference"
},
{
"objectID": "In Progress/Point Estimates and Proportions.html#polling-and-point-estimates",
"href": "In Progress/Point Estimates and Proportions.html#polling-and-point-estimates",
"title": "Point Estimates",
"section": "Polling and Point estimates",
"text": "Polling and Point estimates\nOne poll that is tracked often is the presidential approval ratings. Different polling agencies will conduct polls throughout the year to measure the country’s assessment of how well the current president is doing running the country. Polling companies use samples to estimate presidential approval ratings, it would be next to impossible to poll the entire country every couple of weeks to determine the current approval ratings.\nSince we need to use samples to estimate the true current approval rating in the United States, the approval rating obtained from the sample is known as the point estimate. It’s an estimate of the US population referred to as the parameter of interest. A parameter is a numeric value related to a population. Statistics attempts to quantify how well the parameter estimate estimates the parameter of interest. This is possible because there are certain properties that are usually present between a population and samples drawn from that population."
},
{
"objectID": "In Progress/Point Estimates and Proportions.html#a-stimulated-approval-rating",
"href": "In Progress/Point Estimates and Proportions.html#a-stimulated-approval-rating",
"title": "Point Estimates",
"section": "A stimulated approval rating",
"text": "A stimulated approval rating\nOne of the great things that using R allows us to do is different types of simulations. We can simulate the behaviors and properties of a fictional population. For example, let’s say that the current approval rating for a president is 60% approve and 40% disapprove. Thinking about this in terms of proportions .60 approve and .40 disapprove. So the true approval rating of the president in the population of US citizens would be 60%. Let’s simulate this:\n\n#Create a population of 340 million\n\npop_size <- 340000000\n#Create the approval/disapproval ratings\npossible_entries <- c(rep(\"approve\", 0.60 * pop_size),\n rep(\"disapprove\", 0.40 * pop_size))\n\nThen we can take a random sample from this population\n\n#Keep randomized sample the same\nset.seed(123)\n\n#Randomly Sample 1000 from the population\nsampled_entries <- sample(possible_entries, size = 1000)\n\nThe random sample can be used to estimate the true population proportion. This is referred to as p-hat and the symbol looks like this \\(\\hat{p}\\). This p-hat is our point estimate of the true approval rating in the population, even though in this case we know the real parameter.\n\nsum(sampled_entries == \"approve\")/1000\n\n[1] 0.604\n\n\nNotice that this sample gave us an approval proportion of 0.611 or \\(\\hat{p}=0.611\\) or 61%. The difference between the true parameter 60% and the p-hat is 1%. This is called the error of the estimate. As discussed throughout the course, all measurement and sampling involves a certain amount of error, but that error can be quantified to determine how confident we can be in our numbers.\n\nMultiple Samples\nLet’s look to see how our p-hat changes over several samples\nSample 2\n\n#Sample\nsampled_entries <- sample(possible_entries, size = 1000)\n\n#p-hat\nsum(sampled_entries == \"approve\")/1000\n\n[1] 0.603\n\n\nSample 3\n\n#Sample\nsampled_entries <- sample(possible_entries, size = 1000)\n\n#p-hat\nsum(sampled_entries == \"approve\")/1000\n\n[1] 0.615\n\n\nSample 4\n\n#Sample\nsampled_entries <- sample(possible_entries, size = 1000)\n\n#p-hat\nsum(sampled_entries == \"approve\")/1000\n\n[1] 0.563\n\n\nNotice how \\(\\hat{p}\\) varies slightly with every sample taken. This demonstrates the sampling error that exists between a sample and the population it is trying to estimate.\n\n\nLarger numbers of samples\nWe can actually do better than just a few random samples from a population. Let’s investigate what happens when we take a large number of samples and calculate the p-hat for each of the samples.\nTo do this, we can use the binomial distribution we used in an earlier lesson. Remember that a binomial distribution is used when there are two possibilities for a particular outcome. In this case the two outcomes are approve or disapprove. In this case we can use rbinom to randomly create a histogram based on sampling from a population with 60% approval and 40% disapproval.\n\n#Set the Parameters\nn_samples <- 10000 # Number of samples\nsample_size <- 1000 # Size of each sample\ntrue_p <- 0.6 # True proportion of \"approve\"\n\n#Use a randomly generated binomial distribution to simulate taking 10,000 samples with an n = 1000 from the population of the US\nsample_counts <- rbinom(n = n_samples, size = sample_size, prob = true_p)\n\n# Calculate the p-hat from the samples\np_hats <- sample_counts / sample_size\n\n# Create a datafrome of the p-hats\np_hat_df <- data.frame(p_hat = p_hats)\n\n# Create histogram\nphat_hist <- ggplot(p_hat_df, aes(x = p_hat)) +\n geom_histogram(binwidth = 0.005, fill = \"skyblue\", color = \"black\") +\n labs(title = \"Sampling Distribution of p̂\",\n x = \"Sample Proportion (p̂)\",\n y = \"Frequency\") +\n theme_minimal()\nphat_hist\n\n\n\n\n\n\n\n\nWhat does this graph remind you of? Hopefully you’ve noticed that it looks like a normal curve. In terms of probability, remember that the most likely outcome for a given distribution is the center or middle of the distribution, in this case the mean. The center of the distribution of sampled approval ratings matches the population parameter at 60%. This is no accident. It is part of one of the basic aspects of inferential statistics, the central limit theorem. The first part of the central limit theory looks like this:\n\nThe proportion found in the population will be equal to the proportion found in sampling distribution, given that the sample is sufficiently large\n\nOr\n\\[\n\\mu_{\\hat{p}} = p\n\\]\nThe second part of the central limit theorem has to do with the standard deviation of the sampling distribution of p, which in this case we call it the standard error. Since we sampled from the population and created a distribution of p-hat estimates of the population proportion, the standard error will indicate the spread of that p-hat point estimate. Here is the formula:\n\\[\nSE_{\\hat{p}} = \\sqrt{\\frac{p(1 - p)}{n}}\n\\]"
},
{
"objectID": "Hypothesis_Testing.html",
"href": "Hypothesis_Testing.html",
"title": "Hypothesis Testing",
"section": "",
"text": "Populations and Samples\nSo far, we’ve primarily dealt with descriptive statistics, which are used to describe the basics of a dataset (think central tendency and variability). However, most of psychological science is running experiments and testing hypotheses. When psychologists run an experiment, they use a sample, which is a smaller subgroup of the population (the larger group) they are trying to explain. Most experiments are run on samples, which serve as an estimation of some property of the population that a scientist is attempting to explain.\nFor example, if a psychologist is interested in gender differences, the populations would be either all females or all males (That’s a large group of people!). So to run an experiment, psychologists instead use a sample (smaller subgroup) of males and a sample of females to compare them based on some particular variable.\nThe process of using a sample to infer characteristics about a population is called inferential statistics. Many of the statistics used to analyze experiments are inferential statistics of one kind or another.\n\n\nHypotheses\nA good experiment begins with a good hypothesis, which clearly specifies the variables in an experiment. There are two types of variables in most experiments:\n\nIndependent Variable (IV)\nDependent Variable (DV)\n\n\nThe independent variable is the variable that is manipulated in some way and is thought to be the cause in the experiment, while the dependent variable is measured to assess whether the independent variable produces any changes in it, so it is thought to be the effect in an experiment.\n\nThe hypothesis specifies the relationship between the IV and the DV. In statistics, there are two types of hypotheses, the null hypothesis and the alternative hypothesis. The null hypothesis negates or contradicts the expected relationship between the IV and DV if the IV has a real effect. The alternative hypothesis specifies the relationship between the IV and DV as the experimenter expects it to be, the IV causing a change in the DV.\n\n\nNull and Alternative Hypotheses\nIn experimental methodology, the null hypothesis is directly tested and if it can be shown to be unsupported (based on the statistical analysis), it then lends support to its counterpart, the alternative hypothesis.\nFor example, suppose you work for a gasoline company and you’ve developed a new additive that you believe increases gas mileage. Your supervisor would like to add it to their gasoline, but feels that it’s important to run a test first to see if the additive really increases gas mileage. So you collect a sample of 75 cars that are all using the gasoline additive. You discover that the mean gas mileage for your sample is 26.5 miles per gallon.\nHere are the two hypotheses for the gasoline additive example.\n\nNull hypothesis - “The gasoline additive will not increase the mileage of the gasoline”\n\n\nAlternative Hypothesis - “The gasoline additive will increase the mileage of the gasoline”\n\nNotice that the hypotheses are basically exactly the same, except that the null hypothesis negates the relationship between the IV and DV. A good hypothesis will include both the IV and DV and something to specify the expected outcome of the experiment. In this case the word “increases” shows the relationship between the two variables.\n\n\nDirectional or Nondirectional?\nThis is an example of a directional hypothesis because it’s specifying the direction of the expected effect of the IV. A nondirectional hypothesis does not specify the direction, it just states that the IV will have an effect on the DV. Here’s how gasoline mileage hypotheses would look like as nondirectional hypothesis.\n\nNull hypothesis - “The gasoline additive will not effect the mileage of the gasoline”\n\n\nAlternative Hypothesis - “The gasoline additive will effect the mileage of the gasoline”\n\nNotice that the only real difference is that the word “increase” was swapped with the word “effect”. Whether the hypothesis is directional or nondirectional is usually determined by the experimenter or the context of the experiment.\n\n\n\nNondirectional vs. Directional tests\n\n\nImage from https://towardsdatascience.com/hypothesis-testing-z-scores-337fb06e26ab\nDirectional tests are one-tailed (evaluate only one tail of the distribution) because they are investigating a specific direction of effect for the IV (e.g. increasing) while nondirectional tests are two-tailed (evaluate both tails of the distribution) because they are not specifying the direction of effect for IV.\n\n\nExperimental and Control Groups\nBasic scientific methodology requires an experimental and a control group. The experimental group receives the IV while the control group does not. If a difference between the groups can be detected then that difference may be attributable to the IV. If there is no difference between the groups the IV does not have an effect.\nAnother way to state this is that the null hypothesis assumes that there is no difference between the control and experimental groups (the groups or equal), while the alternative hypothesis assumes that there will be a difference between the groups.\n\nNull hypothesis - experimental group = control group\n\n\nAlternative Hypothesis - experimental group \\(\\neq\\) control group\n\nThe null hypothesis is tested to answer the question, “Are these results due to chance?” Different statistical tests use different probability distributions to analyze the results of an experiment. The probability distributions allow us to see what is the probability associated with the results we obtained in our experiment.\n\n\nAlpha\nThe probability that is shown through the probability distribution is compared to alpha. Alpha is an accepted scientific standard for determining the likelihood of rejecting the null hypothesis. Alpha is equal to .05 or 5%. So the standard that is accepted by the scientific community is that there is only a 5% chance or less that the findings in the experiment conducted are the result of chance alone. \\[\n\\alpha = .05\n\\] The probability associated with the outcome of our experiment based on a probability distribution is known as the p value (p standing for probability)\n\n\nBasic Statistical Formula\nA really easy way to understand how most statistical outputs are generated is to understand this simple formula. \\[\n\\frac{signal}{noise}\n\\] Signal refers to what is known as systematic variation or variation that is introduced by an experiment. In the gasoline example, the additive that is supposed to increase gas mileage is systematic variation. The gasoline is being manipulated to try to increase gas mileage through the use of the additive. The more increase in gas mileage associated with the additive, the stronger the signal and the more likely the independent variable (in this case the additive) has a real effect.\nNoise is like background noise or variation that exists, but wasn’t introduced through the experimental manipulation. This is often referred to as unsystematic variation. This is variation that exists for a variety of reasons. Going back to our gasoline mileage example, unsystematic variation could be differences in the cars used, differences in the road conditions or differences in the weather. Experimentors try to minimize this type of variation as much as possible, but there is always a certain amount of unsystematic variation\n\n\nMeasurement Error\nOne potential form of unsystematic variation is measurement error. For example, imagine that you weigh yourself every day for a whole week. Most likely there will be fluctuations in your weight during the week. It will be slightly higher one day and slightly lower another day with an average or mean that stays more consistent. Measurement error exists in any variable that we are attempting to measure. It is unavoidable to a certain extent because whatever we are attempting to measure, there will be certain fluctuations in the precision of measurement. Going back to our original formula, measurement error is our estimation of noise and although experiments are always trying to minimize error a certain amount will always be present. So another way to understand our primary formula is \\[\nstatistics = \\frac{experimental\\;manipulation}{measurement\\;error}\n\\]\n\n\nConfidence Intervals\nOne statistic to help analyze measurement error is called a confidence interval. A confidence interval is an interval of numbers between which we are 95% confident that the population mean is contained when drawn from a sample. According to Morling (2021) a confidence interval is composed of 3 components.\n\nVariability component - this is most often the standard deviation sd (remember that the sd if one of our basic measures of dispersion or variability)\nSample size component - Usually this will be a calculation involving the number in the sample, most often symbolized by n.\nConstant associated with 95% - Here we’ll use the z score associated with 95 percent\n\nHere is the basic formula for a confidence interval using z scores.\nlower boundary of CI - \\(\\bar X - 1.96\\) x \\(\\frac{s}{\\sqrt{n}}\\)\nupper boundary of CI - \\(\\bar X + 1.96\\) x \\(\\frac{s}{\\sqrt{n}}\\)\n\\(\\frac{s}{\\sqrt{n}}\\) is known as the standard error of the mean and is the standard deviation divided by the square root of the number in the sample. 1.96 is the z score associated with the 95% percentile, thus, the 95% confidence interval.\n\nConfidence Interval Example\nLet’s imagine we have a sample of persons aged 40 to 50 who were trying to run on a treadmill at it’s fastest speed. This is a sample of times in seconds they were able to maintain a sprint at the treadmill’s fastest speed.\n\nsprint <- c(18, 16, 18, 24, 23, 22, 22, 23, 26, 29, 32, 34, 34, 36, 36, 43, 42, 49, 46, 46, 57)\n\nThen we can find the mean and standard deviation for this group of scores\n\nsprint_mean <- mean(sprint)\nsprint_sd <- sd(sprint)\n\nThen we can use these scores to find the 95% confidence interval\n\nlower <- sprint_mean - 1.96*sprint_sd/sqrt(21)\nupper <- sprint_mean + 1.96*sprint_sd/sqrt(21)\nlower\n\n[1] 27.23457\n\nupper\n\n[1] 37.14638\n\n\nBased on this sample, we are 95% confident that the mean time for the population of 40 to 50-year-olds to sprint on the treadmill at its top speed is between 27.23 and 37.15 seconds.",
"crumbs": [
"Intro to Statistics",
"Hypothesis Testing"
]
},
{
"objectID": "More with Databases.html",
"href": "More with Databases.html",
"title": "More with Databases",
"section": "",
"text": "On the Moodle website download the pdf file Anxiety and Learning example\nStep 1 - Identify variables\n\nVariable 1 = Anxiety\nVariable 2 = Learn new material\nVariable 3 = Difficulty of new material\n\nStep 2 - What kind of variables?\n\nAnxiety = categorical -> ordinal\n\nAnxiety is a categorical variable because it’s different groups of persons based on the presence of anxiety, but there is an order based on the amount of anxiety (Low, Medium, High), thus it is ordinal.\n\nDifficulty of new material = categorical -> ordinal\n\nJust like anxiety, difficulty of new material is a categorical variable because it’s different groups of persons based on the difficulty of the material, but there is an order to the level of difficulty (Low, Medium, High), this it is ordinal\n\nAbility to learn new material = numeric -> ratio\n\nAbility to learn new material is numeric because it is the variable being measured (based on how much material is learned out of a possible 20). Since it has an absolute zero point (there are no negative numbers) and it’s possible to talk about a score being twice as good as another it is a ratio scale.\n\n\nStep 3 - How do I get it into R Studio?\nThe first example we’ll look at as a Tibble. A tibble is helpful because you can set it up like a normal spreadsheet. Use a ~ to tell R which words are the names for your variables. For each variable make sure you don’t have any spaces. All variables and objects should not contain a space between words. Choose a single word for your variables or use an _. Remember that each row should be an observation and each column should be a variable. Here’s how you would set it up (It’s incomplete, but you get the general idea).\n\nLearning_TR <- tribble(~Difficulty, ~Anxiety, ~Score, \n \"Low\", \"Low\", 18,\n \"Low\", \"Low\", 17,\n \"Low\", \"Low\", 20,\n \"Low\", \"Low\", 16,\n \"Low\", \"Low\", 17,\n \"Low\", \"Medium\", 18,\n \"Low\", \"Medium\", 18,\n \"Low\", \"Medium\", 19,\n \"Low\", \"Medium\", 15,\n \"Low\", \"Medium\", 17, \n \"Low\", \"High\", 18,\n \"Low\", \"High\", 17,\n \"Low\", \"High\", 16,\n \"Low\", \"High\", 18,\n \"Low\", \"High\", 19,\n \"Medium\", \"Low\", 18)\n\nAnother way to create this kind of dataset is to use the repeat function rep. But remember you need to slightly different patterns to include each observation based on the different ordinal variables.\n\nLearning <- data.frame(\n Difficulty = \n c(rep(c(\"Low\", \"Medium\", \"High\"), each = 15)), \n Anxiety = \n c(rep(c(\"Low\", \"Medium\", \"High\"), times = 5)), \n Score = \n c(18, 18, 18, 17, 18, 17, 20, 19, 16,\n 16, 15, 18, 17, 17, 19, 18, 18, 14, \n 14, 17, 15, 17, 18, 17, 16, 15, 12,\n 14, 14, 16, 11, 15, 9, 6, 12, 8, 10,\n 13, 7, 10, 11, 8, 8, 12, 5))\nLearning\n\n Difficulty Anxiety Score\n1 Low Low 18\n2 Low Medium 18\n3 Low High 18\n4 Low Low 17\n5 Low Medium 18\n6 Low High 17\n7 Low Low 20\n8 Low Medium 19\n9 Low High 16\n10 Low Low 16\n11 Low Medium 15\n12 Low High 18\n13 Low Low 17\n14 Low Medium 17\n15 Low High 19\n16 Medium Low 18\n17 Medium Medium 18\n18 Medium High 14\n19 Medium Low 14\n20 Medium Medium 17\n21 Medium High 15\n22 Medium Low 17\n23 Medium Medium 18\n24 Medium High 17\n25 Medium Low 16\n26 Medium Medium 15\n27 Medium High 12\n28 Medium Low 14\n29 Medium Medium 14\n30 Medium High 16\n31 High Low 11\n32 High Medium 15\n33 High High 9\n34 High Low 6\n35 High Medium 12\n36 High High 8\n37 High Low 10\n38 High Medium 13\n39 High High 7\n40 High Low 10\n41 High Medium 11\n42 High High 8\n43 High Low 8\n44 High Medium 12\n45 High High 5\n\n\nLet’s check out our variables using the str command\n\nstr(Learning)\n\n'data.frame': 45 obs. of 3 variables:\n $ Difficulty: chr \"Low\" \"Low\" \"Low\" \"Low\" ...\n $ Anxiety : chr \"Low\" \"Medium\" \"High\" \"Low\" ...\n $ Score : num 18 18 18 17 18 17 20 19 16 16 ...\n\n\nNotice our two ordinal variables are listed as chr, which stands for character and our ratio variable is listed as num which stands for numeric or number.\nThe only thing missing is making our two ordinal variables factors. A factor allows us to show a character variable based on a specific order, so they can actually be ordinal and not just words or strings. There are two ways to do this.\nFirst define the levels of your factor variable.\n\nLevels <- c(\"Low\", \"Medium\", \"High\")\n\nThen change the variable itself using the object Levels to define the order of your factor variable.\n\nLearning$Difficulty <- factor(Learning$Difficulty, \n levels = Levels)\n\nYou can also define the levels in the code to create a factor in one step like this example using the Anxiety variable.\n\nLearning$Anxiety <- factor(Learning$Anxiety, \n levels = c(\"Low\", \"Medium\", \"High\"))\n\nNow check out the structure with the adjustments you’ve just made to your dataset. It should now identify Anxiety and Difficulty as factors.\n\nstr(Learning)\n\n'data.frame': 45 obs. of 3 variables:\n $ Difficulty: Factor w/ 3 levels \"Low\",\"Medium\",..: 1 1 1 1 1 1 1 1 1 1 ...\n $ Anxiety : Factor w/ 3 levels \"Low\",\"Medium\",..: 1 2 3 1 2 3 1 2 3 1 ...\n $ Score : num 18 18 18 17 18 17 20 19 16 16 ...",
"crumbs": [
"R Basics",
"More with Databases"
]
},
{
"objectID": "More with Databases.html#more-on-creating-databases",
"href": "More with Databases.html#more-on-creating-databases",
"title": "More with Databases",
"section": "",
"text": "On the Moodle website download the pdf file Anxiety and Learning example\nStep 1 - Identify variables\n\nVariable 1 = Anxiety\nVariable 2 = Learn new material\nVariable 3 = Difficulty of new material\n\nStep 2 - What kind of variables?\n\nAnxiety = categorical -> ordinal\n\nAnxiety is a categorical variable because it’s different groups of persons based on the presence of anxiety, but there is an order based on the amount of anxiety (Low, Medium, High), thus it is ordinal.\n\nDifficulty of new material = categorical -> ordinal\n\nJust like anxiety, difficulty of new material is a categorical variable because it’s different groups of persons based on the difficulty of the material, but there is an order to the level of difficulty (Low, Medium, High), this it is ordinal\n\nAbility to learn new material = numeric -> ratio\n\nAbility to learn new material is numeric because it is the variable being measured (based on how much material is learned out of a possible 20). Since it has an absolute zero point (there are no negative numbers) and it’s possible to talk about a score being twice as good as another it is a ratio scale.\n\n\nStep 3 - How do I get it into R Studio?\nThe first example we’ll look at as a Tibble. A tibble is helpful because you can set it up like a normal spreadsheet. Use a ~ to tell R which words are the names for your variables. For each variable make sure you don’t have any spaces. All variables and objects should not contain a space between words. Choose a single word for your variables or use an _. Remember that each row should be an observation and each column should be a variable. Here’s how you would set it up (It’s incomplete, but you get the general idea).\n\nLearning_TR <- tribble(~Difficulty, ~Anxiety, ~Score, \n \"Low\", \"Low\", 18,\n \"Low\", \"Low\", 17,\n \"Low\", \"Low\", 20,\n \"Low\", \"Low\", 16,\n \"Low\", \"Low\", 17,\n \"Low\", \"Medium\", 18,\n \"Low\", \"Medium\", 18,\n \"Low\", \"Medium\", 19,\n \"Low\", \"Medium\", 15,\n \"Low\", \"Medium\", 17, \n \"Low\", \"High\", 18,\n \"Low\", \"High\", 17,\n \"Low\", \"High\", 16,\n \"Low\", \"High\", 18,\n \"Low\", \"High\", 19,\n \"Medium\", \"Low\", 18)\n\nAnother way to create this kind of dataset is to use the repeat function rep. But remember you need to slightly different patterns to include each observation based on the different ordinal variables.\n\nLearning <- data.frame(\n Difficulty = \n c(rep(c(\"Low\", \"Medium\", \"High\"), each = 15)), \n Anxiety = \n c(rep(c(\"Low\", \"Medium\", \"High\"), times = 5)), \n Score = \n c(18, 18, 18, 17, 18, 17, 20, 19, 16,\n 16, 15, 18, 17, 17, 19, 18, 18, 14, \n 14, 17, 15, 17, 18, 17, 16, 15, 12,\n 14, 14, 16, 11, 15, 9, 6, 12, 8, 10,\n 13, 7, 10, 11, 8, 8, 12, 5))\nLearning\n\n Difficulty Anxiety Score\n1 Low Low 18\n2 Low Medium 18\n3 Low High 18\n4 Low Low 17\n5 Low Medium 18\n6 Low High 17\n7 Low Low 20\n8 Low Medium 19\n9 Low High 16\n10 Low Low 16\n11 Low Medium 15\n12 Low High 18\n13 Low Low 17\n14 Low Medium 17\n15 Low High 19\n16 Medium Low 18\n17 Medium Medium 18\n18 Medium High 14\n19 Medium Low 14\n20 Medium Medium 17\n21 Medium High 15\n22 Medium Low 17\n23 Medium Medium 18\n24 Medium High 17\n25 Medium Low 16\n26 Medium Medium 15\n27 Medium High 12\n28 Medium Low 14\n29 Medium Medium 14\n30 Medium High 16\n31 High Low 11\n32 High Medium 15\n33 High High 9\n34 High Low 6\n35 High Medium 12\n36 High High 8\n37 High Low 10\n38 High Medium 13\n39 High High 7\n40 High Low 10\n41 High Medium 11\n42 High High 8\n43 High Low 8\n44 High Medium 12\n45 High High 5\n\n\nLet’s check out our variables using the str command\n\nstr(Learning)\n\n'data.frame': 45 obs. of 3 variables:\n $ Difficulty: chr \"Low\" \"Low\" \"Low\" \"Low\" ...\n $ Anxiety : chr \"Low\" \"Medium\" \"High\" \"Low\" ...\n $ Score : num 18 18 18 17 18 17 20 19 16 16 ...\n\n\nNotice our two ordinal variables are listed as chr, which stands for character and our ratio variable is listed as num which stands for numeric or number.\nThe only thing missing is making our two ordinal variables factors. A factor allows us to show a character variable based on a specific order, so they can actually be ordinal and not just words or strings. There are two ways to do this.\nFirst define the levels of your factor variable.\n\nLevels <- c(\"Low\", \"Medium\", \"High\")\n\nThen change the variable itself using the object Levels to define the order of your factor variable.\n\nLearning$Difficulty <- factor(Learning$Difficulty, \n levels = Levels)\n\nYou can also define the levels in the code to create a factor in one step like this example using the Anxiety variable.\n\nLearning$Anxiety <- factor(Learning$Anxiety, \n levels = c(\"Low\", \"Medium\", \"High\"))\n\nNow check out the structure with the adjustments you’ve just made to your dataset. It should now identify Anxiety and Difficulty as factors.\n\nstr(Learning)\n\n'data.frame': 45 obs. of 3 variables:\n $ Difficulty: Factor w/ 3 levels \"Low\",\"Medium\",..: 1 1 1 1 1 1 1 1 1 1 ...\n $ Anxiety : Factor w/ 3 levels \"Low\",\"Medium\",..: 1 2 3 1 2 3 1 2 3 1 ...\n $ Score : num 18 18 18 17 18 17 20 19 16 16 ...",
"crumbs": [
"R Basics",
"More with Databases"
]
},
{
"objectID": "More with Databases.html#bar-graph",
"href": "More with Databases.html#bar-graph",
"title": "More with Databases",
"section": "Bar graph",
"text": "Bar graph\nTo create a bar graph we need to do a little more work since this will be based on one of our measures of central tendency, the mean, because count or number won’t tell us much about the variables.\nSo we need to use the pipe again |> and something new, the summarize function summarise. Summarise allows us to calculate various measurements of central tendency, variation, and others. Then we can use those numbers in a graph.\n\nLearning_Mean <- Learning |>\n group_by(Anxiety) |>\n summarise(n = n(),\n mean = mean(Score))\nLearning_Mean\n\n# A tibble: 3 × 3\n Anxiety n mean\n <fct> <int> <dbl>\n1 Low 15 14.1\n2 Medium 15 15.5\n3 High 15 13.3\n\n\nNext Let’s create the bar graph using the Anxiety variable and mean score from the new dataset.\n\nggplot(data = Learning_Mean) +\n geom_bar(mapping = aes(x = Anxiety, \n y = mean), stat = \"identity\" )",
"crumbs": [
"R Basics",
"More with Databases"
]
},
{
"objectID": "One-Way ANOVA.html",
"href": "One-Way ANOVA.html",
"title": "One-Way ANOVA",
"section": "",
"text": "ANOVA stands for analysis of variance, which is based on the F statistic, which is named after the statistician who invented it, R. A. Fisher. The statistic is fundamentally the ratio of two independent variance estimates of the same population variance (Pagano, 2013). Thus the basic formula is:\n\\[\nF = \\frac{variance\\;estimate\\; 1\\; of\\; \\sigma^{2}}{variance\\; estimate\\; 2\\; of\\; \\sigma^{2}}\n\\]",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#one-way-anova",
"href": "One-Way ANOVA.html#one-way-anova",
"title": "One-Way ANOVA",
"section": "",
"text": "ANOVA stands for analysis of variance, which is based on the F statistic, which is named after the statistician who invented it, R. A. Fisher. The statistic is fundamentally the ratio of two independent variance estimates of the same population variance (Pagano, 2013). Thus the basic formula is:\n\\[\nF = \\frac{variance\\;estimate\\; 1\\; of\\; \\sigma^{2}}{variance\\; estimate\\; 2\\; of\\; \\sigma^{2}}\n\\]",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#how-many-groups",
"href": "One-Way ANOVA.html#how-many-groups",
"title": "One-Way ANOVA",
"section": "How Many Groups?",
"text": "How Many Groups?\nUltimately the F test or ANOVA is used to analyze differences in the means of more than two groups. Remember we use the t test to analyze differences in the means of two groups, but when we have more than two groups we need a different test. In this case, we use the ANOVA.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#null-and-alternative-hypotheses",
"href": "One-Way ANOVA.html#null-and-alternative-hypotheses",
"title": "One-Way ANOVA",
"section": "Null and Alternative Hypotheses",
"text": "Null and Alternative Hypotheses\nHere are the basic assumptions of the two hypotheses used in an ANOVA.\nNull \\[\nH_0 : \\mu_1=\\mu_2=\\mu_3\n\\] Alternative \\[\nH_1 : \\mu_1\\neq\\mu_2\\neq\\mu_3\n\\]",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#section",
"href": "One-Way ANOVA.html#section",
"title": "One-Way ANOVA",
"section": "",
"text": "So the Null hypothesis assumes there will be no differences between the means or that the groups come from the same population, while the alternative assumes there is a difference between the means or the groups come from different populations. One thing to remember, the ANOVA test can only identify if some of the group means are different. It cannot identify which means are different. So the means for group 1 and 2 may be statistically the same, while the means for group 2 and 3 may be different, but the ANOVA test will still be significant.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#signal-vs.-noise",
"href": "One-Way ANOVA.html#signal-vs.-noise",
"title": "One-Way ANOVA",
"section": "Signal vs. Noise",
"text": "Signal vs. Noise\nRemember our basic formula for statistics.\n\\[\nStatistics = \\frac{Signal}{Noise}\n\\]\nSignal refers to systematic variation or variation based on the causal work of the independent variable. Whereas noise refers to unsystematic variation or variation which is not the result of the independent variable. Unsystematic variation is the result of measurement error and we’ve seen that measurement error occurs in all types of measurement and statistics.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#section-1",
"href": "One-Way ANOVA.html#section-1",
"title": "One-Way ANOVA",
"section": "",
"text": "So the changes observed in our dataset that are the result of systematic variation have to be larger then the differences we observe as the result of unsystematic variation for our statistical finding to be considered significant.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#section-2",
"href": "One-Way ANOVA.html#section-2",
"title": "One-Way ANOVA",
"section": "",
"text": "Remember that the null hypothesis assumes no differences between the groups, whereas the alternative hypothesis assumes the groups will be different because of the manipulation of the independent variable.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#levels-of-the-independent-variable",
"href": "One-Way ANOVA.html#levels-of-the-independent-variable",
"title": "One-Way ANOVA",
"section": "Levels of the Independent variable",
"text": "Levels of the Independent variable\nANOVA enables the comparison of more than two groups, which is often structured to analyze different levels of an independent variable. By levels we are referring to different amounts or quantities of the independent variable. For example, one group may be the control group, but then subsequent groups may have different quantities of the independent variable.\nHere is the dataset\n\nch15ds1\n\n Group Language.Score\n1 5 Hours 87\n2 5 Hours 86\n3 5 Hours 76\n4 5 Hours 56\n5 5 Hours 78\n6 5 Hours 98\n7 5 Hours 77\n8 5 Hours 66\n9 5 Hours 75\n10 5 Hours 67\n11 10 Hours 87\n12 10 Hours 85\n13 10 Hours 99\n14 10 Hours 85\n15 10 Hours 79\n16 10 Hours 81\n17 10 Hours 82\n18 10 Hours 78\n19 10 Hours 85\n20 10 Hours 91\n21 20 Hours 89\n22 20 Hours 91\n23 20 Hours 96\n24 20 Hours 87\n25 20 Hours 89\n26 20 Hours 90\n27 20 Hours 89\n28 20 Hours 96\n29 20 Hours 96\n30 20 Hours 93\n\n\nFirst let’s look at the independent variable, which is attendance at preschool.\nIV = Preschool\nHere it is in R studio labeled as “Group”\n\nch15ds1$Group\n\n [1] \"5 Hours\" \"5 Hours\" \"5 Hours\" \"5 Hours\" \"5 Hours\" \"5 Hours\" \n [7] \"5 Hours\" \"5 Hours\" \"5 Hours\" \"5 Hours\" \"10 Hours\" \"10 Hours\"\n[13] \"10 Hours\" \"10 Hours\" \"10 Hours\" \"10 Hours\" \"10 Hours\" \"10 Hours\"\n[19] \"10 Hours\" \"10 Hours\" \"20 Hours\" \"20 Hours\" \"20 Hours\" \"20 Hours\"\n[25] \"20 Hours\" \"20 Hours\" \"20 Hours\" \"20 Hours\" \"20 Hours\" \"20 Hours\"\n\n\nNotice that there is 3 levels to this independent variable based on the amount of time spent in preschool per week: 5 hours, 10 hours, and 20 hours. Notice that there is no control group. Everyone is in preschool. So the hypothesis is really whether more time in preschool increases language development.\n\nCreating Factors\nAn important first step for data analysis is to make sure the variables are in the correct format. We can use the str command to figure out the types of variables in the dataset.\n\nstr(ch15ds1)\n\n'data.frame': 30 obs. of 2 variables:\n $ Group : chr \"5 Hours\" \"5 Hours\" \"5 Hours\" \"5 Hours\" ...\n $ Language.Score: int 87 86 76 56 78 98 77 66 75 67 ...\n\n\nLanguage.Score is an integer int or whole number, which makes sense for a measurement scale that is looking at language development.\nGroup is a character chr, which is fine for defining groups, but we would prefer it to be a factor so it could more easily distinguish the levels of the independent variable.\nSo let’s change the variable type for group to a factor.\n\nch15ds1$Group <- factor(ch15ds1$Group, \n levels = c(\"5 Hours\", \"10 Hours\", \"20 Hours\"))\n\nAnd then check it\n\nstr(ch15ds1$Group)\n\n Factor w/ 3 levels \"5 Hours\",\"10 Hours\",..: 1 1 1 1 1 1 1 1 1 1 ...\n\n\n\n\nAre the means different?\nThe first question for the dataset is whether the means are different. More specifically, the mean level of language development should increase with an increase in hours spent at preschool.\n\nMean difference between the groups\n\nch15ds1 |> \n group_by(Group) |> \n summarise(n = n(),\n mean = mean(Language.Score))\n\n# A tibble: 3 × 3\n Group n mean\n <fct> <int> <dbl>\n1 5 Hours 10 76.6\n2 10 Hours 10 85.2\n3 20 Hours 10 91.6\n\n\nSo the means are different and the language score increases with hours spent in preschool. Now it needs to be determined if the differences between the means are statistically significant. This is where we need the ANOVA or F Test.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#review",
"href": "One-Way ANOVA.html#review",
"title": "One-Way ANOVA",
"section": "Review",
"text": "Review\nRemember that the F test or ANOVA is based on comparing variation between the groups (signal) to variation within the groups (noise). So the equation is:\n\\[\nF = \\frac{MS_{between}}{MS_{within}}\n\\]\nHowever, there are some different steps we need to take to get to the two types of Mean Squares \\(MS\\) for this formula.\nFor this usage of the ANOVA, the total variability \\(SS_T\\) is partitioned (divided or separated) into 2 groups or sources. The variability between the groups \\(SS_{between}\\) and the variability within the groups \\(SS_{within}\\). Remember that variability between groups gives us evidence that the groups are different and if the variability is greater than the variability within the groups than our F value will be significant.\nHowever, the two sum of squares values (\\(SS_{within}\\) & \\(SS_{between}\\)) need to be averaged based on the number of scores from which they were calculated in order to eliminate bias. In this case we’ll use the degrees of freedom to accomplish this task \\(df\\). Here is the formulas:\n\\[df_{within} = N - 1\\]\n\\[df_{between}=k-1\\]\n\\(N\\) stands for the number of observations or participants we have in all the groups because it deals with individual variation. \\(k\\) stands for the number of groups we have because it deals with group variation. Here is an overview of the entire formula.\n\\[\n\\frac{SS_{between}/df_{between}}{SS_{within}/df_{within}} = \\frac{MS_{between}}{MS_{within}}= F\n\\]\n\nF Distribution\nHere’s a look at the F distribution. Notice both the similaries and differences to the binomial and t and z distributions.\n\ndist_f(deg.f1 = 4, deg.f2 = 20, p = .05)\n\n\n\n\n\n\n\n\nThe F distribution is also a family of curves based on the degrees of freedom. Notice that the distribution has a positive skew (more scores at the lower end of the distribution) and it also only had one tail rather than two tails. This is another indication that the F test can’t determine the direction of difference between the groups, only if there is a difference between the groups.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "One-Way ANOVA.html#using-r-studio-to-calculate-the-anova",
"href": "One-Way ANOVA.html#using-r-studio-to-calculate-the-anova",
"title": "One-Way ANOVA",
"section": "Using R Studio to calculate the ANOVA",
"text": "Using R Studio to calculate the ANOVA\nIn another section, ANOVA was used to test a linear regression model. For that model a comparison was made between the improvement of the linear model in comparison to the grand mean \\(SS_{M}\\) and the measurement error based on the residuals \\(SS_{R}\\). There is a strong statistical relationship between regression and ANOVA and in fact ANOVA for groups can be understood as a part of the general linear model (GLM). The differences between the groups \\(SS_{between}\\) can be understood as a line composed of the group means and compared to the grand mean \\(SS_M\\). The greater the difference between this line and the grand mean, the greater the difference between the group means. \\(SS_{within}\\) can be understood as the residuals for each observation and the grand mean \\(SS_R\\). Everything else remains the same (\\(df\\), \\(MS\\), and \\(F\\)) when running the ANOVA in R Studio.\n\nRunning the Code\nHere’s the basic set up for running ANOVA\nObject <- aov(Dependent Variable ~ Independent Variable, data = your dataset)\nSo in this case\n\nANOVA_1 <- aov(Language.Score~Group, data = ch15ds1)\n\nFor ANOVA, the results are saved in an object, so we need to use summary to get the results.\n\nsummary(ANOVA_1)\n\n Df Sum Sq Mean Sq F value Pr(>F) \nGroup 2 1133 566.5 8.799 0.00114 **\nResiduals 27 1738 64.4 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n“Group” is the independent variable so Sum Sq stands for the \\(SS_{between}\\) or \\(SS_M\\), whereas “Residuals stands for the \\(SS_{within}\\) or \\(SS_R\\).\nThe F value is 8.799 and the significance or p value is 0.00114, so the results are significant.\n\n\nBar Graph of the Data\nStep 1 - create table of Descriptive Statistics\n\nlibrary(dplyr)\nPreschool_Descriptives <- ch15ds1 %>%\n group_by(Group) %>%\n summarize(n = n(),\n mean = mean(Language.Score),\n sd = sd(Language.Score),\n se = sd / sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nCheck it out\n\nPreschool_Descriptives\n\n# A tibble: 3 × 6\n Group n mean sd se ci\n <fct> <int> <dbl> <dbl> <dbl> <dbl>\n1 5 Hours 10 76.6 12.0 3.78 8.56\n2 10 Hours 10 85.2 6.20 1.96 4.43\n3 20 Hours 10 91.6 3.41 1.08 2.44\n\n\nNow graph it based on the descriptive statistics\n\nggplot(Preschool_Descriptives, \n aes(x = Group, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci))\n\n\n\n\n\n\n\n\nMake the graph look better.\n\nggplot(Preschool_Descriptives, \n aes(x = Group,\n y = mean)) +\n theme_minimal() +\n geom_bar(stat = \"identity\", fill=\"steelblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.3, size=1) +\n labs(title = \"Does Preschool Effect Language Development?\", \n y=\"Mean Score on Language Test\", x=\"Number of Hours Spent in Preschool\") \n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n\n\n\n\n\n\n\n\n\n\n\nEffect Size\nFigure out the effect size - Eta squared The formula is SSbetween/SSTotal or SSbetween/SSbetween+SSResidual\n\n1133/(1133+1738)\n\n[1] 0.394636\n\n\n\n\nConclusion\nWrite out conclusion\n\nNumber of hours in Preschool had a significant effect on language development, F(2, 27) = 8.799, p = 0.00114, 𝜂2 = 0.39.\n\nWhere is the difference? Need to use post hoc tests\n\n\nTukeyHSD\nTukeyHSD will tell us where the differences are between the individual groups.\nRun TukeyHSD on saved ANOVA results\n\nTukeyHSD(ANOVA_1)\n\n Tukey multiple comparisons of means\n 95% family-wise confidence level\n\nFit: aov(formula = Language.Score ~ Group, data = ch15ds1)\n\n$Group\n diff lwr upr p adj\n10 Hours-5 Hours 8.6 -0.2972884 17.49729 0.0596448\n20 Hours-5 Hours 15.0 6.1027116 23.89729 0.0007780\n20 Hours-10 Hours 6.4 -2.4972884 15.29729 0.1941234\n\n\nFinally, write out the whole conclusion.\n\nTukeyHSD post hoc tests revealed that 20 hours a week of preschool (M=91.6, SE=1.96) resulted in significantly higher levels of language development in comparison to 5 hours (M=76.6, SE=3.78). This difference, -15 95% CI[-23.90, -6.10] was significant with an adjusted p = .0008.",
"crumbs": [
"Statistical Tests",
"One-Way ANOVA"
]
},
{
"objectID": "Normal Curve.html",
"href": "Normal Curve.html",
"title": "Normal Curve",
"section": "",
"text": "Properties of the Normal Curve\nThe normal curve is a special kind of histogram that has a particular shape and certain properties. Remember that a histogram usually represents a continuous variable, so it is most often numeric. There’s other important attributes of the normal curve that are essential to hypothesis testing and statistical analysis.\nHere’s an example normal curve representing a hypothetical reading proficiency test administered in a local school district with mean reading proficiency score of 75 and a standard deviation of 16.\n\n\n\n\n\n\n\n\n\n\n\nImportant Properties\nThere are several important properties of a normal curve.\n\nThe distribution of scores is symmetrical about the mean, which indicates that if you were to fold the distribution in half both sides or tails of the distribution would match.\nThe mean, median and mode are all equal\nThe distribution will have two inflection points indicating one standard deviation above and below the mean.\n\nRemember that in a normal curve most of the data is close to the mean or scores with the highest frequency are closest to the mean. As you move to the tails of the normal distribution there is a lower frequency of scores, less of the variable is contained in the tails of the distribution. Another way to think of this is that more extreme scores (scores that are either much smaller or much larger than the mean) are contained in the tails of the distribution and are less probable.\n\n\nThe proportion or percentage of a particular variable in a normal curve is dispersed differently throughout the curve\n\n\n\nDistribution of scores in a normal curve\n\n\nImage credit - https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2\n\n34.13% of the data are between the mean and one standard deviation above it or below it.\n13.59% of the data is between one and two standard deviations\n2.15% of the data is between two and three standard deviations\n\nSo when a data point or score is further on either of the tails of the distribution, the data point is less probable or more extreme.\n\n\nZ Scores\nOne of the ways we are able to understand the probability or distribution of scores in a normal curve is through the use of z scores. A z score is score that designates how many standard deviations a particular score is above or below the mean.\nThe formula is the score minus the mean divided by the standard deviation. Here is the formula: \\[\nz = \\frac{X - \\bar X}{s}\n\\] z scores can be positive or negative depending on whether the score in question is above or below the mean. Thus, the higher the magnitude of the score (meaning independent of whether it’s positive or negative) the further the score is away from the mean.\n Image from https://www.scribbr.com/statistics/standard-normal-distribution/\n###PNorm Function Each z score also has a particular percentage associated with it. For example we can see what percentage is associated with a z score of 1\n\npnorm(1)\n\n[1] 0.8413447\n\n\nThis tells us that a z score of 1 is at the 84th percentile or that 84% of the scores in a normal curve are below a z score of 1.\nRemember also that half of the scores are above and below the mean or 50% of the scores are above and below the mean.\nIf we take this score (0.8413447) minus 50% or .50, we’ll get proportion or percentage between the mean and 1 standard deviation above it.\n\npnorm(1) - .50\n\n[1] 0.3413447\n\n\nNotice that this is the same percentage we saw earlier between the mean and one standard deviation above it, 34.13%.\n\n\nExample Problem\nImagine that a nationwide mathematics aptitude test is normally distributed with a mean of 80 and a standard deviation of 12. Here’s a look at the graph of the distribution of mathematics aptitude test scores.\n\n\n\n\n\n\n\n\n\nNow imagine our score on the aptitude test was 90 and we were curious what percentage of scores were below our score. So we were interested in the area below a score of 90 as shown in the shaded area below.\n\n\n\n\n\n\n\n\n\nTo perform this calculation, we can find the z score associated with the score of 90 and then use pnorm to find what percentage is associated with that z score.\nFind the z score\n\n(90-80)/12\n\n[1] 0.8333333\n\n\nThen find the percentage associated with that z score\n\npnorm(.83)\n\n[1] 0.7967306\n\n\nSo a score of 90 is at the 80th percentile or approximately 80% of the scores on the aptitude test are less than a score of 90.\nOr we can go the opposite direction. Let’s say we want to know the percentage of scores above a score of 65. Or the area described below.\n\n\n\n\n\n\n\n\n\nFirst, we need to find the z score associated with a score of 65.\n\n(65-80)/12\n\n[1] -1.25\n\n\nThen we take the pnorm of that number but subtract it from 1.00 because we are looking for the scores above it.\n\n1.00 - pnorm(-1.25)\n\n[1] 0.8943502\n\n\nSo approximately 89% of the scores are above a score of 65.\nSo if you want to know the percentage of scores above a particular score you subtract the pnorm z score from 1.00, but if you want to know the percentage below a certain number or the percentile rank, you simply do the pnorm of the z score.\nYou can also find the percentage of scores between certain scores. Just find the z scores associated with each score and subtract them.\nFor example, what percentage of scores are between a score of 72 and 106?\n\n\n\n\n\n\n\n\n\nFind the z score associated with each and then use pnorm to find the percentage associated with each z score. Make sure to subtract the smaller number from the larger.\n\nz1 <- (72-80)/12\nz2 <- (106-80)/12\npnorm(z2) - pnorm(z1)\n\n[1] 0.7323773",
"crumbs": [
"Graphs",
"Normal Curve"
]
},
{
"objectID": "Paired Samples t test.html",
"href": "Paired Samples t test.html",
"title": "Paired Samples t test",
"section": "",
"text": "Make sure to read the description of the independent t-test before moving to the paired t-test\n\nBooks for Reference\nHere’s two books that would be good to reference for this discussion.\n\nChapter 10 from Field, A. (2017). Discovering Statistics Using IBM SPSS Statistics (5th Edition). SAGE Publications, Ltd. (UK).\nChapter 13 & 14 from Pagano, R. (2010). Understanding Statistics in the Behavioral Sciences (9th Edition). Wadsworth.\n\n\n\nHypotheses in the t-test\nThe paired samples t-test is very similar to the independent t-test and the basic assumptions regarding the the two groups are the same.\n\nNull Hypothesis - Experimental Group = Control Group\n\n\nAlternative Hypothesis - Experimental Group \\(\\neq\\) Control Group\n\nThe difference in the paired samples t-test is that a single group goes through both the experimental and control conditions. The control condition is sometime referred to as the baseline, it’s the measurement of a dependent variable without the presence of the independent variable.\n\n\nExample: Treating Depression\nFor example, assume you are testing the effectiveness of a new drug on treating depression. You measure the current level of a sample of persons who are currently experiencing depression (baseline) then allow them to take the new drug for a month and then measure their depression a second time. So the assumptions of our two hypotheses would look something like this:\n\nNull Hypothesis - Before taking the drug = After taking the drug\n\n\nAlternative Hypothesis - Before taking the drug \\(\\neq\\) After taking the drug\n\n\n\nFormula: Paired t-test\nLet’s look at the formula and then we can break down the elements.\n\\[\nt = \\frac{\\bar D - \\mu_{D}}{\\frac{s_{D}}{\\sqrt{n}}}\n\\]\n\n\nMean of the Difference Scores\nThe first element is the mean of the difference scores \\(\\bar D\\). The t-test calculates the difference between each subject before and after the introduction of the independent variable. For our example, it would be the difference in their level of depression before and after taking the new drug. After all the differences are calculated the mean difference is found \\(\\bar D\\). Obviously, if the independent variable has an effect, this number should have some amount of magnitude.\n\n\nMean of the Null Population\nThe second element to look at is the mean of the population for difference scores assuming the null hypothesis is true \\(\\mu_{D}\\). Remember that the Null hypothesis assumes no difference between the sample means, so it assumes that this sample of differences scores comes from a population with zero difference or has a mean value of 0.\n\\[\n\\mu_{D}=0\n\\]\nSince this value is zero, it basically falls out of the equation, but it does remind us of the basis for this kind of t-test. The comparison in the formula is between a sample and a population. However, in this cases it’s a sample of difference scores compared to a population of difference scores with an assumed mean of zero.\n\n\nHypothesis Assumptions\n\nNull Hypothesis - The sample can reasonably be said to have come from a popultion of differnce scores with a mean of zero. Any difference between the means is the result of measurement error\n\n\nAlternative Hypothesis - The sample can not reasonably be said to have come from a population of difference scores with a mean of zero because the difference is too great and not the result of measurement error.\n\n\n\nStandard error of the Differences\nWe need some kind of value to compare to the mean difference. In this case the comparison is to the error. More specifically to the measurement error that could reasonably be expected based on random variation. This value is known as the standard error of the differences. This standard error is an indicator of the size of potential differences in sample means based on sampling variation.\nHere’s the formula/symbols\n\\[\n\\frac{s_{D}}{\\sqrt{n}}\n\\]\nIf you look closely, this formula looks remarkably similar to the standard error of the mean (because it is!) but notice how the standard deviation from the population \\(\\sigma\\) has been replaced with the standard deviation from the sample \\(s\\). What is unique about the standard error of the differences is that it uses the standard deviation from the sample as an estimate of the standard deviation from the population.\n\n\nt Distribution\nOnce the t score is found using the formula, it is compared to the t distribution, which is very similar to the z distribution.\n\nThe sampling distribution of t is a probability distribution of t values that would occur if all possible different samples of a fixed size N were drawn from the null hypothesis population (Pagano, p. 320)\n\nThe t distribution varies based on the degrees of freedom, which is directly related to N. Just like the z distribution, the t distribution has a mean of zero and as t values get larger, they are situated on the tails of the distribution as either positive or negative numbers. Just like with z scores, the larger the t score the less likely the sample of obtained difference scores can reasonably be assumed to have come from a population of difference scores with a mean of zero.\n###Example using R Studio For the example import the following dataset.\n\nch14ds1 <- read.csv(\"ch14ds1.csv\")\n\nView the dataset in Console\n\nch14ds1\n\n Pretest Posttest\n1 3 7\n2 5 8\n3 4 6\n4 6 7\n5 5 8\n6 5 9\n7 4 6\n8 5 6\n9 3 7\n10 6 8\n11 7 8\n12 8 7\n13 7 9\n14 6 10\n15 7 9\n16 8 9\n17 8 8\n18 9 8\n19 9 4\n20 8 4\n21 7 5\n22 7 6\n23 6 9\n24 7 8\n25 8 12\n\n\nNotice that this is a fairly straightforward pre vs. posttest dataset.\nUse the summary command to look at the descriptive statistics\n\nsummary(ch14ds1)\n\n Pretest Posttest \n Min. :3.00 Min. : 4.00 \n 1st Qu.:5.00 1st Qu.: 6.00 \n Median :7.00 Median : 8.00 \n Mean :6.32 Mean : 7.52 \n 3rd Qu.:8.00 3rd Qu.: 9.00 \n Max. :9.00 Max. :12.00 \n\n\nThe means between the two groups are different and we can calculate the difference.\n\n7.52-6.32\n\n[1] 1.2\n\n\nOf course, we need to determine whether this difference is significant or not. So let’s use r to run the paired samples t test. Notice that the code is fairly straightforward. Start with the t.test command and then list the two variables that are being analyzed, followed by lettting R know you want to perform a paired test paired = TRUE.\n\nt.test(ch14ds1$Posttest, ch14ds1$Pretest, paired = TRUE)\n\n\n Paired t-test\n\ndata: ch14ds1$Posttest and ch14ds1$Pretest\nt = 2.4495, df = 24, p-value = 0.02198\nalternative hypothesis: true mean difference is not equal to 0\n95 percent confidence interval:\n 0.1889003 2.2110997\nsample estimates:\nmean difference \n 1.2 \n\n\nBased on the results, there does appear to be a significant difference between the pre and post tests.\nList the effect size as well using Cohen’s d - Subtract the means from each other and divide by the standard deviation of the control group\n\nCohens_d <- (7.52-6.32)/1.73\nCohens_d\n\n[1] 0.6936416\n\n\n\nBar Graph\nThis time let’s use a ggplot to make a better looking bar graph. Check out the format of the data\n\nch14ds1\n\n Pretest Posttest\n1 3 7\n2 5 8\n3 4 6\n4 6 7\n5 5 8\n6 5 9\n7 4 6\n8 5 6\n9 3 7\n10 6 8\n11 7 8\n12 8 7\n13 7 9\n14 6 10\n15 7 9\n16 8 9\n17 8 8\n18 9 8\n19 9 4\n20 8 4\n21 7 5\n22 7 6\n23 6 9\n24 7 8\n25 8 12\n\n\nThe data is set up for a paired t test, so it includes 2 numeric variables. This is a problem because a single factor variable is missing that would be used to describe our two bars (i.e. pre and post test).\nThus, we need to create a second dataset or dataframe that has a factor variable (“Group”) and numeric (“Test”) variable. A new object is created called “TestData”. The factor variable is created from scratch using the “rep” repeat function and then adding together the Pre and Post Test scores into a single numeric variable.\n\nTestData <- data.frame(\n group= rep(c(\"Pretest\", \"Posttest\"), each=25),\n Test = c(ch14ds1$Pretest, ch14ds1$Posttest)\n )\n\nNext, find the descriptive statistics for this new dataset\n\nlibrary(dplyr)\nTest_Descriptives <- TestData %>%\n group_by(group) %>%\n summarize(n = n(),\n mean = mean(Test),\n sd = sd(Test),\n se = sd / sqrt(n),\n ci = qt(0.975, df = n - 1) * sd / sqrt(n))\n\nCheck the descriptives and make sure they turned out ok.\n\nTest_Descriptives\n\n# A tibble: 2 × 6\n group n mean sd se ci\n <chr> <int> <dbl> <dbl> <dbl> <dbl>\n1 Posttest 25 7.52 1.83 0.366 0.755\n2 Pretest 25 6.32 1.73 0.345 0.712\n\n\nFinally, create a basic bar graph\n\nggplot(Test_Descriptives, \n aes(x = group, \n y = mean)) +\n geom_bar(stat = \"identity\")\n\n\n\n\n\n\n\n\nThen let’s add in some error bars with the geom_errorbar function.\n\nggplot(Test_Descriptives, \n aes(x = group, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci))\n\n\n\n\n\n\n\n\nFinally, let’s make it look really nice\n\nggplot(Test_Descriptives, \n aes(x = group,\n y = mean)) +\n theme_minimal() +\n geom_bar(stat = \"identity\", fill=\"cornflowerblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.3, size=1) +\n labs(title = \"Change in Performance from Pre to Post Tests\", \n y=\"Mean Score\", x=\"Tests\")\n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n\n\n\n\n\n\n\n\n\nYou can also flip the columns to make it look cleaner\n\nggplot(Test_Descriptives, \n aes(x = group,\n y = mean)) +\n theme_minimal() +\n geom_bar(stat = \"identity\", fill=\"cornflowerblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.3, size=1) +\n labs(title = \"Change in Performance from Pre to Post Tests\", \n y=\"Mean Score\", x=\"Tests\") +\n scale_x_discrete(limits=c(\"Pretest\", \"Posttest\"))\n\n\n\n\n\n\n\n\nFinally, we can write out the results, which is essentially the same format as the independent t test.\n\nOn average, scores on the posttest were larger (M = 7.52, SE = .37) than the scores on the pretest (M = 6.32, SE = .35). This difference, 1.2, 95% CI[.19, 2.21], was significant t(24) = 2.45, p = 0.02. The results represented an effect size of d = 0.69.*",
"crumbs": [
"Intro to Statistics",
"Paired Samples t test"
]
},
{
"objectID": "Data Import.html",
"href": "Data Import.html",
"title": "Data Import",
"section": "",
"text": "Although it’s possible to create datasets in R Studio, it is often the case that datasets will be imported into R Studio and then manipulated and analyzed using code. Code is often more helpful for cleaning datasets and even importing a variety of different formats.\nHere’s a list of several types of datasets that can be imported:\n\nData Types\n\n\nData Types\nFile Extension\n\n\n\n\nComma Separated Values\n.csv\n\n\nExcel\n.xlsx\n\n\nTab-delimited Files\n.tsv\n\n\nSPSS\n.sav\n\n\nGoogle Sheet.\nno extension\n\n\nText File\n.txt"
},
{
"objectID": "Data Import.html#data-importing",
"href": "Data Import.html#data-importing",
"title": "Data Import",
"section": "",
"text": "Although it’s possible to create datasets in R Studio, it is often the case that datasets will be imported into R Studio and then manipulated and analyzed using code. Code is often more helpful for cleaning datasets and even importing a variety of different formats.\nHere’s a list of several types of datasets that can be imported:\n\nData Types\n\n\nData Types\nFile Extension\n\n\n\n\nComma Separated Values\n.csv\n\n\nExcel\n.xlsx\n\n\nTab-delimited Files\n.tsv\n\n\nSPSS\n.sav\n\n\nGoogle Sheet.\nno extension\n\n\nText File\n.txt"
},
{
"objectID": "Data Import.html#importing-a-.csv-file",
"href": "Data Import.html#importing-a-.csv-file",
"title": "Data Import",
"section": "Importing a .csv file",
"text": "Importing a .csv file\nComma Separated Values files (.csv) are some of the more commonly found dataset files that can be imported. From the tidyverse the readr package can be used to import these types of datasets along with several others.\n\nlibrary(readr)\n\nYou can add options to executable code like this\n\n\n[1] 4\n\n\nThe echo: false option disables the printing of code (only output is displayed)."
},
{
"objectID": "Histograms.html",
"href": "Histograms.html",
"title": "Histograms",
"section": "",
"text": "Note - this section is based on section 7.3 from R for Data Science\n\n\nRemember that a continuous variable is a numerical measure of something that can take on a variety of different values at different levels of precision. One of the best tools for investigating a continuous variable like interval or ratio is to use a histogram. A histogram shows the variability that exists within a particular variable.\nLet’s examine the “carat” variable from the diamonds dataset. Carat refers to the weight of a diamond. Here is how to make a histogram.\n\nggplot(data = diamonds) +\n geom_histogram(mapping = aes(x = carat), binwidth = 0.5)\n\n\n\n\n\n\n\n\n\n\n\nAs a refresher, let’s talk through the code. ggplot gives us the primary command, followed by the data set we want to use in parentheses (). Then we use a new geom called geom_histogram with a mapping of the variable we are investigating, carat (mapping = aes(x=carat)). Finally we use binwidth to define how many intervals to divide the x axis.\n\n\n\nEach of the bars represents the count or number of occurrences for a particular bin of numbers on the carat scale. So at first glance, it looks like most of the diamonds have a carat of about .25 to maybe .75.\nWe can adjust the binwidth to give us more information in the graph. Let’s make the binwidth smaller (0.5 to 0.1) to see a little bit more detail.\n\nggplot(data = diamonds) +\n geom_histogram(mapping = aes(x = carat), binwidth = 0.1)\n\n\n\n\n\n\n\n\nNotice that now we can see that actually .25 is the most frequently occurring carat with over 10,000 entries. Our range on the x axis is fairly large (0 to 5 carats), but notice that most of the carats are much lower on the scale. Carats above 2.5 carats seem to be outliers, which means that they don’t fit as well in the dataset or have more extreme properties compared to the reset of the dataset. This shouldn’t be surprising, because the diamonds with more carats weigh more, tend to be larger, cost more, and thus would be more rare.\n\n\n\nAnother adjustment we can make to see more detail is to adjust the dataset to just focus on a particular range of carats. Tidyverse allows us to create a different dataset based on a rule using what’s call the pipe |>\n\n|> \n\nThis is a special symbol the tidyverse uses to “pipe” or “filter” the dataset based on rules that are specified in the code. Another form of the pipe that is sometimes used looks like this %>%. Either way works, but the latest edition of R for Data Science uses |>, so we’ll stick with that.\nLet’s take a look at the code to make a smaller version of the diamonds dataset.\n\nsmaller <- diamonds |> \n filter(carat < 3)\n\nWe’re creating a new object call “smaller” by taking the diamonds dataset and piping it (|>) through a filter function, which takes the variable “carat” and only chooses those observations that are less than 3 carats (carat < 3). So now we can just look at the diamonds with a carat of less than 3 and create a histogram to show us that variable.\n\nggplot(data = smaller, mapping = aes(x = carat)) +\n geom_histogram(binwidth = 0.1)\n\n\n\n\n\n\n\n\nNow we can see the variability that exists in this smaller range of carat values. Notice that .25 is still the most frequently occurring carat, followed by 1.00 and .75.\n\n\n\nWe can also take a variable like carat and overlay it with a second categorical variable. But rather than using a histogram we can use a simple line graph. In this case geom_freqpoloy creates a line graph based on a 2nd categorical variable. In this case we’ll overlay the carat variable with cut.\n\nggplot(data = smaller, mapping = aes(x = carat, color = cut)) +\n geom_freqpoly(binwidth = 0.1)\n\n\n\n\n\n\n\n\nNotice that most of the ideal diamonds are about .25 carats with premium having the second largest count at about .25 and 1 carats.",
"crumbs": [
"Graphs",
"Histograms"
]
},
{
"objectID": "Histograms.html#histograms",
"href": "Histograms.html#histograms",
"title": "Histograms",
"section": "",
"text": "Note - this section is based on section 7.3 from R for Data Science\n\n\nRemember that a continuous variable is a numerical measure of something that can take on a variety of different values at different levels of precision. One of the best tools for investigating a continuous variable like interval or ratio is to use a histogram. A histogram shows the variability that exists within a particular variable.\nLet’s examine the “carat” variable from the diamonds dataset. Carat refers to the weight of a diamond. Here is how to make a histogram.\n\nggplot(data = diamonds) +\n geom_histogram(mapping = aes(x = carat), binwidth = 0.5)\n\n\n\n\n\n\n\n\n\n\n\nAs a refresher, let’s talk through the code. ggplot gives us the primary command, followed by the data set we want to use in parentheses (). Then we use a new geom called geom_histogram with a mapping of the variable we are investigating, carat (mapping = aes(x=carat)). Finally we use binwidth to define how many intervals to divide the x axis.\n\n\n\nEach of the bars represents the count or number of occurrences for a particular bin of numbers on the carat scale. So at first glance, it looks like most of the diamonds have a carat of about .25 to maybe .75.\nWe can adjust the binwidth to give us more information in the graph. Let’s make the binwidth smaller (0.5 to 0.1) to see a little bit more detail.\n\nggplot(data = diamonds) +\n geom_histogram(mapping = aes(x = carat), binwidth = 0.1)\n\n\n\n\n\n\n\n\nNotice that now we can see that actually .25 is the most frequently occurring carat with over 10,000 entries. Our range on the x axis is fairly large (0 to 5 carats), but notice that most of the carats are much lower on the scale. Carats above 2.5 carats seem to be outliers, which means that they don’t fit as well in the dataset or have more extreme properties compared to the reset of the dataset. This shouldn’t be surprising, because the diamonds with more carats weigh more, tend to be larger, cost more, and thus would be more rare.\n\n\n\nAnother adjustment we can make to see more detail is to adjust the dataset to just focus on a particular range of carats. Tidyverse allows us to create a different dataset based on a rule using what’s call the pipe |>\n\n|> \n\nThis is a special symbol the tidyverse uses to “pipe” or “filter” the dataset based on rules that are specified in the code. Another form of the pipe that is sometimes used looks like this %>%. Either way works, but the latest edition of R for Data Science uses |>, so we’ll stick with that.\nLet’s take a look at the code to make a smaller version of the diamonds dataset.\n\nsmaller <- diamonds |> \n filter(carat < 3)\n\nWe’re creating a new object call “smaller” by taking the diamonds dataset and piping it (|>) through a filter function, which takes the variable “carat” and only chooses those observations that are less than 3 carats (carat < 3). So now we can just look at the diamonds with a carat of less than 3 and create a histogram to show us that variable.\n\nggplot(data = smaller, mapping = aes(x = carat)) +\n geom_histogram(binwidth = 0.1)\n\n\n\n\n\n\n\n\nNow we can see the variability that exists in this smaller range of carat values. Notice that .25 is still the most frequently occurring carat, followed by 1.00 and .75.\n\n\n\nWe can also take a variable like carat and overlay it with a second categorical variable. But rather than using a histogram we can use a simple line graph. In this case geom_freqpoloy creates a line graph based on a 2nd categorical variable. In this case we’ll overlay the carat variable with cut.\n\nggplot(data = smaller, mapping = aes(x = carat, color = cut)) +\n geom_freqpoly(binwidth = 0.1)\n\n\n\n\n\n\n\n\nNotice that most of the ideal diamonds are about .25 carats with premium having the second largest count at about .25 and 1 carats.",
"crumbs": [
"Graphs",
"Histograms"
]
},
{
"objectID": "Histograms.html#normal-curve",
"href": "Histograms.html#normal-curve",
"title": "Histograms",
"section": "Normal Curve",
"text": "Normal Curve\nOne of the more important histograms is called the normal curve. The normal curve has a particular shape in that it is symmetrical in relationship to the mean. A distribution (collection of scores) of a particular variable that is normally shaped will tend to have most of the scores in the middle of the distribution or fairly close to the mean.\nTo demonstrate this, let’s create a variable that is normally distributed and part of a dataset that has a second categorical variable. To do this, we can use the rnorm function to create this type of variable and include it with a categorical variable using the rep function.\n\nPractice <- data.frame(Condition = \n c(rep(c(\"A\", \"B\"), each=250)), \n Dist = rnorm(500, mean = 0, sd = 1))\n\nTo get a quick look at the shape we can use the hist function, which is a simplified version of the histogram.\n\nhist(Practice$Dist)\n\n\n\n\n\n\n\n\nNotice that the bulk of the frequency of the scores are in the middle of the graph. This means they are closest to the mean, which for this data set is 0. Histograms also often include a curved line to show the shape of the distribution. For histograms this is based on what’s called density, which is based on the percentage or proportion of scores rather than the count. So we’ll change our histogram slightly and include the line.\n\nhist(Practice$Dist, probability = TRUE)\nlines(density(Practice$Dist))\n\n\n\n\n\n\n\n\nSo the greatest density or percentage of scores lies close to the mean and the lowest density is on the tails of the distribution. The tails are where the distribution tapers off to the left and the right.\n\nSkew\nDatasets may also be skewed, which means that the bulk of the scores is either more toward the higher end or lower end of the graph.\n\nSkew is basically a measure of asymmetry in comparisong to the symmetry present in a normal curve. Negatively skewed histograms have a greater frequency of scores that tend to be larger and more extreme smaller scores at the left tail of the distribution. Positive skew is the exact opposite. Greater frequency of scores that tend to be smaller and more extreme scores at the right tail of the distribution.\n\n\nKurtosis\nKurtosis is a measure of the “pointiness” of a data set. The more kurtosis the closer the scores are to the mean and less kurtosis the more spread out the data set.\n\nThere are a few different types of kurtosis.\n\nPlatykurtic - Flat, spread out\nMesokurtic - Even distribution, kurtosis close to 0\nLeptokurtic - Too “pointy”, bunched around the mean\n\n\n\nDensity Curve\nWe can also add density curves to our ggplots as well as analyze a categorical variable. First we can outline the bars in our histogram to show more detail.\n\nggplot(Practice, aes(x=Dist)) +\n geom_histogram(binwidth=.4, colour=\"black\", fill=\"white\")\n\n\n\n\n\n\n\n\nThen we can also add in the normal curve by using density rather than count for the histogram.\n\nggplot(Practice, aes(x=Dist)) + \n geom_histogram(aes(y= after_stat(density)),\n binwidth=.4,\n colour=\"black\", fill=\"white\") +\n geom_density()\n\n\n\n\n\n\n\n\n\n\nTwo Histograms\nIt’s also possible to overlay two histograms based on a categorical variable, which divides the variable based on category.\n\nggplot(Practice, aes(x=Dist, fill=Condition)) +\n geom_histogram(binwidth=.4, alpha=.5, position=\"identity\")\n\n\n\n\n\n\n\n\nOr you can use two curves to show the shape.\n\nggplot(Practice, aes(x=Dist, color=Condition)) + geom_density()",
"crumbs": [
"Graphs",
"Histograms"
]
},
{
"objectID": "Bar Graphs and CIs.html",
"href": "Bar Graphs and CIs.html",
"title": "Bar Graphs and Confidence Intervals",
"section": "",
"text": "To create this bar graph we’ll start off with the dataset Salaries, which is from the package carData\n\ndata(Salaries, package = \"carData\")\n\nTo begin, make sure you have downloaded, installed, and loaded the package carData and make sure you’ve loaded the tidyverse.\n\nlibrary(tidyverse)\n\n\n\n\nhead(Salaries)\n\n rank discipline yrs.since.phd yrs.service sex salary\n1 Prof B 19 18 Male 139750\n2 Prof B 20 16 Male 173200\n3 AsstProf B 4 3 Male 79750\n4 Prof B 45 39 Male 115000\n5 Prof B 40 41 Male 141500\n6 AssocProf B 6 6 Male 97000\n\n\nThe dataset is about Professors and the salaries associated with different ranks. If you want to learn more about the dataset, remember you can type a question mark in front of it and R Studio will give you more information.\n\n?Salaries\n\n\n\n\nTo create a bar graph, first you need to find summary statistics for the dataset. In this case you’re going to find several descriptive statistics that will be used to make the graph.\nYou’ll find mean, standard deviation, standard error, and the 95% Confidence interval.\nHere’s the code to use:\n\nDesData <- Salaries |> \n group_by(rank) |> \n summarize(n=n(),\n mean=mean(salary),\n sd= sd(salary),\n se=sd/sqrt(n),\n ci=qt(0.975, df=n-1)*sd/sqrt(n))\n\nCheck out the code here. First notice this symbol |>, which is the pipe. You can think of this symbol like a pipe or funnel. It funnels or pipes the dataset through various functions.\n\n\n\nSo we start with our dataset, Salaries and then we pipe it or funnel it by a particular variable. In this case we’ll use the function group_by and the variable rank.\nThis variable will almost always be a factor or categorical variable.\nThen we use the summarize command to let R know what kinds of descriptive statistics we want.\n\nn = number of cases\nmean = mean\nsd = standard deviation\nse = standard error\nci = confidence interval",
"crumbs": [
"Graphs",
"Bar Graphs and Confidence Intervals"
]
},
{
"objectID": "Bar Graphs and CIs.html#an-example-to-begin",
"href": "Bar Graphs and CIs.html#an-example-to-begin",
"title": "Bar Graphs and Confidence Intervals",
"section": "",
"text": "To create this bar graph we’ll start off with the dataset Salaries, which is from the package carData\n\ndata(Salaries, package = \"carData\")\n\nTo begin, make sure you have downloaded, installed, and loaded the package carData and make sure you’ve loaded the tidyverse.\n\nlibrary(tidyverse)\n\n\n\n\nhead(Salaries)\n\n rank discipline yrs.since.phd yrs.service sex salary\n1 Prof B 19 18 Male 139750\n2 Prof B 20 16 Male 173200\n3 AsstProf B 4 3 Male 79750\n4 Prof B 45 39 Male 115000\n5 Prof B 40 41 Male 141500\n6 AssocProf B 6 6 Male 97000\n\n\nThe dataset is about Professors and the salaries associated with different ranks. If you want to learn more about the dataset, remember you can type a question mark in front of it and R Studio will give you more information.\n\n?Salaries\n\n\n\n\nTo create a bar graph, first you need to find summary statistics for the dataset. In this case you’re going to find several descriptive statistics that will be used to make the graph.\nYou’ll find mean, standard deviation, standard error, and the 95% Confidence interval.\nHere’s the code to use:\n\nDesData <- Salaries |> \n group_by(rank) |> \n summarize(n=n(),\n mean=mean(salary),\n sd= sd(salary),\n se=sd/sqrt(n),\n ci=qt(0.975, df=n-1)*sd/sqrt(n))\n\nCheck out the code here. First notice this symbol |>, which is the pipe. You can think of this symbol like a pipe or funnel. It funnels or pipes the dataset through various functions.\n\n\n\nSo we start with our dataset, Salaries and then we pipe it or funnel it by a particular variable. In this case we’ll use the function group_by and the variable rank.\nThis variable will almost always be a factor or categorical variable.\nThen we use the summarize command to let R know what kinds of descriptive statistics we want.\n\nn = number of cases\nmean = mean\nsd = standard deviation\nse = standard error\nci = confidence interval",
"crumbs": [
"Graphs",
"Bar Graphs and Confidence Intervals"
]
},
{
"objectID": "Bar Graphs and CIs.html#basic-bar-graph",
"href": "Bar Graphs and CIs.html#basic-bar-graph",
"title": "Bar Graphs and Confidence Intervals",
"section": "Basic Bar Graph",
"text": "Basic Bar Graph\nLet’s start with a simple bar chart of the means\n\nggplot(DesData, \n aes(x = rank, \n y = mean)) +\n geom_bar(stat = \"identity\")\n\n\n\n\n\n\n\n\nNotice geom_bar is the geom we need for the bar chart\nWe have to use a special stat as well, identity. This means that bar height will be determined by the number provided by the variable assigned to the y axis. In this case the mean.",
"crumbs": [
"Graphs",
"Bar Graphs and Confidence Intervals"
]
},
{
"objectID": "Bar Graphs and CIs.html#error-bars",
"href": "Bar Graphs and CIs.html#error-bars",
"title": "Bar Graphs and Confidence Intervals",
"section": "Error Bars",
"text": "Error Bars\nNext, we’ll add in error bars to bar chart, which is the geom_errorbar\n\nggplot(DesData, \n aes(x = rank, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-se,\n ymax=mean+se))\n\n\n\n\n\n\n\n\nWe can also change length of the error bars to make it look better.\n\nggplot(DesData, \n aes(x = rank, \n y = mean)) +\n geom_bar(stat = \"identity\") +\n geom_errorbar(aes(ymin=mean-se,\n ymax=mean+se), width=.8)",
"crumbs": [
"Graphs",
"Bar Graphs and Confidence Intervals"
]
},
{
"objectID": "Bar Graphs and CIs.html#bells-and-whistles",
"href": "Bar Graphs and CIs.html#bells-and-whistles",
"title": "Bar Graphs and Confidence Intervals",
"section": "Bells and whistles",
"text": "Bells and whistles\nThen let’s add in all the bells and whistles.\n\nFirst let’s add correct column labels for the rank variable\nSecond let’s add more detail to the scale of the y variable\nFinally let’s add some better labels\nFor this one we’ll use the standard error\n\nYou’ll also need to add the library scales to get the proper symbol for dollars.\n\nlibrary(scales)\n\n\nAttaching package: 'scales'\n\n\nThe following object is masked from 'package:purrr':\n\n discard\n\n\nThe following object is masked from 'package:readr':\n\n col_factor\n\n\nThen you can do the graph, we’ll use the standard error for the confidence interval this time.\n\nggplot(DesData, \n aes(x = factor(rank,\n labels = c(\"Assistant\\nProfessor\",\n \"Associate\\nProfessor\",\n \"Full\\nProfessor\")), \n y = mean)) +\n geom_bar(stat = \"identity\", \n fill = \"cornflowerblue\") +\n geom_errorbar(aes(ymin=mean-se,\n ymax=mean+se), width=.8) +\n scale_y_continuous(breaks = seq(0, 130000, 20000), \n labels = dollar) +\n labs(title = \"Mean Salary by Rank\", \n subtitle = \"9-month academic salary for 2008-2009\",\n x = \"\",\n y = \"\")\n\n\n\n\n\n\n\n\nThis time use the 95% Confidence Intervals\n\nggplot(DesData, \n aes(x = factor(rank,\n labels = c(\"Assistant\\nProfessor\",\n \"Associate\\nProfessor\",\n \"Full\\nProfessor\")), \n y = mean)) +\n geom_bar(stat = \"identity\", \n fill = \"cornflowerblue\") +\n geom_errorbar(aes(ymin=mean-ci,\n ymax=mean+ci), width=.5, size=1) +\n scale_y_continuous(breaks = seq(0, 130000, 20000), \n labels = dollar) +\n labs(title = \"Mean Salary by Rank\", \n subtitle = \"9-month academic salary for 2008-2009\",\n x = \"\",\n y = \"\")\n\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.",
"crumbs": [
"Graphs",
"Bar Graphs and Confidence Intervals"
]
},
{
"objectID": "about.html",
"href": "about.html",
"title": "About",
"section": "",
"text": "About this site\n\n1 + 1\n\n[1] 2"
},
{
"objectID": "Working with Quarto.html",
"href": "Working with Quarto.html",
"title": "Working with Quarto",
"section": "",
"text": "Quarto is a document format in R Studio that allows you to print off your notes, homework, and code as a word doc or other types of files. It is based on R markdown, which is a prior version of this software. You can also make websites, slide decks, and other formats to present your findings from your statistical research. For our class we’ll be using it for homework, quizzes, and tests when we’re working on problems that involve data analysis.\n\n\n\n\n\n\n\nFor example, the penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.\n\n\n\nYou can use the Palmer Penguins dataset to make a graph or plot of the data.\nThe graph below shows the relationship between flipper and bill lengths of these penguins.\n\n\n\n\n\n\n\n\n\nNotice how Quarto runs the code to create your graph and then includes it in the output.\nHere is a tutorial for Quarto using R Studio\n\n\nWhen you want to preview your content you hit the render button and it will show you how you are doing so far.\n\n\n\nIf you go up to the insert feature in the tool bar, you can insert a cell like the one below where you can include R code. Or you can use the hot keys Option+Command+I for the Mac or in Windows Control+Alt+I\nThen you can put code into the cell to run it in the document\n\nsummary(cars)\n\n speed dist \n Min. : 4.0 Min. : 2.00 \n 1st Qu.:12.0 1st Qu.: 26.00 \n Median :15.0 Median : 36.00 \n Mean :15.4 Mean : 42.98 \n 3rd Qu.:19.0 3rd Qu.: 56.00 \n Max. :25.0 Max. :120.00 \n\n\nIf you click the green arrow on the right of the cell, it will run the code chunk for you and you can check it to make sure it’s right.\n\n\n\n\nYou can download the assignment template to use when you are working on your assignment. Here is how to set it up.\n\n\n\nThis is just the title of the assignment and your name, date, and the output, which will be a word document. This is always at the top of any Quarto document.\n\n\n\nThe r setup file lets you add any libraries or datasets you may need for the file. For example, you’ll need to usually add the tidyverse library for most of the assignments.\nYou’ll also need to add any datasets you are working with that you need to import. For example, to add SPSS files like the Album Sales Dataset you’ll need to add the haven library and then include the code to import the file.\nOr if you import a .csv file, you’ll need to include the code for that as well.\nIf you’re not sure of where the file is, you can always import the file first and the copy the code from your console. Typically this is not included (include = FALSE), but sometimes we need to see how you set things up.\n\n\n\nFor each question you’ll want to copy and paste the question from your homework into R Markdown.\n\n\n\nQuarto allows you to answer the question using both the code and output. So I’ll start with a descriptive sentence or annotation about what I’m doing and then show the code and the output.\nFilter dataset to mass below 250\n\nSmaller_StarWars <- starwars |>\n filter(starwars$mass < 250)\nSmaller_StarWars\n\n# A tibble: 58 × 14\n name height mass hair_color skin_color eye_color birth_year sex gender\n <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> \n 1 Luke Sk… 172 77 blond fair blue 19 male mascu…\n 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…\n 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…\n 4 Darth V… 202 136 none white yellow 41.9 male mascu…\n 5 Leia Or… 150 49 brown light brown 19 fema… femin…\n 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…\n 7 Beru Wh… 165 75 brown light blue 47 fema… femin…\n 8 R5-D4 97 32 <NA> white, red red NA none mascu…\n 9 Biggs D… 183 84 black light brown 24 male mascu…\n10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…\n# ℹ 48 more rows\n# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,\n# vehicles <list>, starships <list>\n\n\nCreate a scatterplot of mass and height\n\nggplot(data = Smaller_StarWars) +\n geom_point(mapping = aes(x = mass, \n y = height))\n\n\n\n\n\n\n\n\n\n\n\nAfter your done with your document and want to turn it in, you’ll click the render button and it will open as a word document.\nThe document will be saved in your working directory. The working directory is the file where all your current files are being saved. It should be listed in your console or you can type the command getwd() to find it as well.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#palmer-penguins",
"href": "Working with Quarto.html#palmer-penguins",
"title": "Working with Quarto",
"section": "",
"text": "For example, the penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#graphs",
"href": "Working with Quarto.html#graphs",
"title": "Working with Quarto",
"section": "",
"text": "You can use the Palmer Penguins dataset to make a graph or plot of the data.\nThe graph below shows the relationship between flipper and bill lengths of these penguins.\n\n\n\n\n\n\n\n\n\nNotice how Quarto runs the code to create your graph and then includes it in the output.\nHere is a tutorial for Quarto using R Studio\n\n\nWhen you want to preview your content you hit the render button and it will show you how you are doing so far.\n\n\n\nIf you go up to the insert feature in the tool bar, you can insert a cell like the one below where you can include R code. Or you can use the hot keys Option+Command+I for the Mac or in Windows Control+Alt+I\nThen you can put code into the cell to run it in the document\n\nsummary(cars)\n\n speed dist \n Min. : 4.0 Min. : 2.00 \n 1st Qu.:12.0 1st Qu.: 26.00 \n Median :15.0 Median : 36.00 \n Mean :15.4 Mean : 42.98 \n 3rd Qu.:19.0 3rd Qu.: 56.00 \n Max. :25.0 Max. :120.00 \n\n\nIf you click the green arrow on the right of the cell, it will run the code chunk for you and you can check it to make sure it’s right.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#how-to-set-up-an-assignment",
"href": "Working with Quarto.html#how-to-set-up-an-assignment",
"title": "Working with Quarto",
"section": "",
"text": "You can download the assignment template to use when you are working on your assignment. Here is how to set it up.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#write-your-yaml-file",
"href": "Working with Quarto.html#write-your-yaml-file",
"title": "Working with Quarto",
"section": "",
"text": "This is just the title of the assignment and your name, date, and the output, which will be a word document. This is always at the top of any Quarto document.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#add-the-r-setup-file",
"href": "Working with Quarto.html#add-the-r-setup-file",
"title": "Working with Quarto",
"section": "",
"text": "The r setup file lets you add any libraries or datasets you may need for the file. For example, you’ll need to usually add the tidyverse library for most of the assignments.\nYou’ll also need to add any datasets you are working with that you need to import. For example, to add SPSS files like the Album Sales Dataset you’ll need to add the haven library and then include the code to import the file.\nOr if you import a .csv file, you’ll need to include the code for that as well.\nIf you’re not sure of where the file is, you can always import the file first and the copy the code from your console. Typically this is not included (include = FALSE), but sometimes we need to see how you set things up.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#add-questions-from-the-assignment.",
"href": "Working with Quarto.html#add-questions-from-the-assignment.",
"title": "Working with Quarto",
"section": "",
"text": "For each question you’ll want to copy and paste the question from your homework into R Markdown.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#answer-the-question",
"href": "Working with Quarto.html#answer-the-question",
"title": "Working with Quarto",
"section": "",
"text": "Quarto allows you to answer the question using both the code and output. So I’ll start with a descriptive sentence or annotation about what I’m doing and then show the code and the output.\nFilter dataset to mass below 250\n\nSmaller_StarWars <- starwars |>\n filter(starwars$mass < 250)\nSmaller_StarWars\n\n# A tibble: 58 × 14\n name height mass hair_color skin_color eye_color birth_year sex gender\n <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> \n 1 Luke Sk… 172 77 blond fair blue 19 male mascu…\n 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…\n 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…\n 4 Darth V… 202 136 none white yellow 41.9 male mascu…\n 5 Leia Or… 150 49 brown light brown 19 fema… femin…\n 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…\n 7 Beru Wh… 165 75 brown light blue 47 fema… femin…\n 8 R5-D4 97 32 <NA> white, red red NA none mascu…\n 9 Biggs D… 183 84 black light brown 24 male mascu…\n10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…\n# ℹ 48 more rows\n# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,\n# vehicles <list>, starships <list>\n\n\nCreate a scatterplot of mass and height\n\nggplot(data = Smaller_StarWars) +\n geom_point(mapping = aes(x = mass, \n y = height))",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Working with Quarto.html#render-your-document",
"href": "Working with Quarto.html#render-your-document",
"title": "Working with Quarto",
"section": "",
"text": "After your done with your document and want to turn it in, you’ll click the render button and it will open as a word document.\nThe document will be saved in your working directory. The working directory is the file where all your current files are being saved. It should be listed in your console or you can type the command getwd() to find it as well.",
"crumbs": [
"R Basics",
"Working with Quarto"
]
},
{
"objectID": "Assignment Template (1).html",
"href": "Assignment Template (1).html",
"title": "Setup",
"section": "",
"text": "Setup\n\nAdd packages\nAdd External Datasets\n\n\nknitr::opts_chunk$set(echo = TRUE)\nlibrary(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.2 ✔ tibble 3.2.1\n✔ lubridate 1.9.4 ✔ tidyr 1.3.1\n✔ purrr 1.0.4 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\n\n\n\nAdd Questions from homework\n\nCreate a data set of the total number of Covid cases in 5 counties in California.\n\n\nTotal_Cases <- tribble(~County, ~Cases, \n \"Los Angeles\", 2730000, \n \"Fresno\", 233000, \n \"Stanislaus\", 129000,\n \"Tulare\", 124000, \n \"San Diego\", 750000)\nTotal_Cases\n\n# A tibble: 5 × 2\n County Cases\n <chr> <dbl>\n1 Los Angeles 2730000\n2 Fresno 233000\n3 Stanislaus 129000\n4 Tulare 124000\n5 San Diego 750000\n\n\n(Write a summary statement or add any other details if necessary)\n\nCreate a bar graph based on the Covid cases by county dataset you just created.\n\n\nggplot(data = Total_Cases) + \n geom_bar(mapping = aes(x = County, y = Cases), \n stat = \"identity\") +\n labs(title = \"Total Cases by County in California\")"
},
{
"objectID": "Descriptive Statistics.html",
"href": "Descriptive Statistics.html",
"title": "Descriptive Statistics",
"section": "",
"text": "Sometimes pre-existing datasets are imported into R Studio. There are several different types of datasets that can be imported. The ones used most frequently in this course are from Excel and SPSS. Types of computer files are indicated by what follows the period (“.”) in the file name. For example file.docx is a type of word file.\nHere’s a list of common file types we’ll use.\n\nExcel = .xlsx\nSPSS = .sav\nComma Seperated Values = .csv\n\nOne files we’ll be using today is the Album Sales file. On the moodle page for the course at the very top underneath the link for zoom sessions, you’ll see a folder with SPSS Data Sets. When you click on the folder you’ll see a list of datasets. Click on the one that says, “Album Sales.sav” and download it to your computer.\nOnce the file is downloaded go to the upper right pane in R Studio. Under the environment tab you’ll see a button that says “Import Dataset”. Click on the button and you’ll get several different options. Go to the one that says “From SPSS” and click on it. At the top of the window you’ll need to find the file you are importing to open it. Click on the Browse buttom to find the file. Once you find it click open and it will give you a preview of the file in the window. Then all you need to do is click import.\nHere is the code for importing the file\n\nlibrary(haven)\nAlbum_Sales <- read_sav(\"Album Sales.sav\")",
"crumbs": [
"Intro to Statistics",
"Descriptive Statistics"
]
},
{
"objectID": "Descriptive Statistics.html#opening-datasets",
"href": "Descriptive Statistics.html#opening-datasets",
"title": "Descriptive Statistics",
"section": "",
"text": "Sometimes pre-existing datasets are imported into R Studio. There are several different types of datasets that can be imported. The ones used most frequently in this course are from Excel and SPSS. Types of computer files are indicated by what follows the period (“.”) in the file name. For example file.docx is a type of word file.\nHere’s a list of common file types we’ll use.\n\nExcel = .xlsx\nSPSS = .sav\nComma Seperated Values = .csv\n\nOne files we’ll be using today is the Album Sales file. On the moodle page for the course at the very top underneath the link for zoom sessions, you’ll see a folder with SPSS Data Sets. When you click on the folder you’ll see a list of datasets. Click on the one that says, “Album Sales.sav” and download it to your computer.\nOnce the file is downloaded go to the upper right pane in R Studio. Under the environment tab you’ll see a button that says “Import Dataset”. Click on the button and you’ll get several different options. Go to the one that says “From SPSS” and click on it. At the top of the window you’ll need to find the file you are importing to open it. Click on the Browse buttom to find the file. Once you find it click open and it will give you a preview of the file in the window. Then all you need to do is click import.\nHere is the code for importing the file\n\nlibrary(haven)\nAlbum_Sales <- read_sav(\"Album Sales.sav\")",
"crumbs": [
"Intro to Statistics",
"Descriptive Statistics"
]
},
{
"objectID": "Descriptive Statistics.html#descriptive-statistics",
"href": "Descriptive Statistics.html#descriptive-statistics",
"title": "Descriptive Statistics",
"section": "Descriptive Statistics",
"text": "Descriptive Statistics\nDescriptive statisitics are just what the name implies, they are used to describe a dataset. This is different from inferential statistics, which are used to infer population paramenters based on a sample.\nOne of the most common sets of descriptive statistics is known as central tendency, which is simply finding a way to get a basic overview of a dataset as a whole.\n\nMean\nThe easiest way to measure central tendency is to find the average or the mean. The formula for the mean is very straightforward. It’s simply the sum of all the scores divided by the number of scores. The symbol for the sample mean is an x with a bar over it. \\[\n\\bar X=\\frac{\\Sigma X}{N}\n\\] The capital Greek letter Sigma in the numerator stands for sum and X stands in for all the values in a dataset with the N standing for the number of values in the dataset. In R Studio, we simply use the mean command to find the mean for a particular variable\n\nmean(diamonds$carat)\n\n[1] 0.7979397\n\n\n\n\nMedian and Mode\nThere are two other measures of central tendency that are often used in statistics, median and mode. The median is simply the center score, so to find it, you simply rank order your values from lowest to highest and the middle score in that rank order. For even scores you take the two middle scores add them and divide them by two. For odd scores there is only one middle score. Here again, R gives us an easy way to find the median.\n\nmedian(diamonds$carat)\n\n[1] 0.7\n\n\nThe mode is the score or value with the highest frequency in your dataset. So whatever value occurs the most times is the mode. For some reason, mode is not part of the basic R package, so you need to install the “modeest” package if you need to find this descriptive statistic.\n\nlibrary(modeest)\n\nOnce the library is installed use the mfv command\n\nmfv(diamonds$carat)\n\n[1] 0.3\n\n\nWe tend to use the mean most often for central tendency, but it’s more effected by extreme scores. So if you have several low or high outliers (extreme scores that are higher or lower than the mean), the median may be a more accurate representation of the data. More often then not, we’ll use the mean. The mode is a more straightforward measure. So if you were interested in the most popular song on an album based on downloads, you would find the mode. The mode is often very helpful with categorical data.",
"crumbs": [
"Intro to Statistics",
"Descriptive Statistics"
]
},
{
"objectID": "Descriptive Statistics.html#variability",
"href": "Descriptive Statistics.html#variability",
"title": "Descriptive Statistics",
"section": "Variability",
"text": "Variability\nThe second most important aspect of descriptive statistics is variability or dispersion in a dataset. This typically represents how the rest of the dataset relates to the mean or some other measure of central tendency. For example: are the scores close to or widely dispersed from the mean?\n\nStandard Deviation\nOften times were interested in the average spread or deviation from the mean. Deviance is the distance from any particular score from the mean. So to find the deviance you simply take the score and subtract it from the mean. \\[\ndeviance = X - \\bar X\n\\] If we wanted to find out the total amount of deviance, we could simply add together the total deviance for each number in our dataset. So the equation would look like this: \\[\ntotal\\;deviance = \\Sigma(X-\\bar X)\n\\] Unfortunately this equation causes some problems. If you remember, the mean is the average score, so it’s the score at around 50%. But that means that for any dataset about half the scores will be above the mean and half the scores will be below the mean. Or another way to think about it, half the deviation scores will be negative and half the deviation scores will be positive. So if we add up all these scores, the total deviation will be equal to 0, but 0 doesn’t tell us much about the spread of the scores.\nThe way to overcome this problem is by squaring each deviation score, which makes all the deviation values positive and thus produces a positive number. This number is called the sum of squared errors of the sum of squares (SS) with the following formula. \\[\nsum\\;of\\;squares (SS) = \\Sigma(X-\\bar X)^2\n\\] This number is still somewhat inflated, since its constructed based on squared values. One way to fix this issue to to find the average dispersion, which will be based on the number in our sample. Since the sample is an estimate of the population, we actually don’t use N, but N -1. This number is called variance and has the following symbol and equation. \\[\nvariance(s^2) = \\frac {SS}{N-1} = \\frac {\\Sigma(X-\\bar X)^2}{N-1}\n\\]\nThis number is closer to the original units of measurement, but to make it more accurate, the original squared values need to be taken back out of the measure. To do this, the square root of the variance is calculated.\n\\[\ns = \\sqrt {\\frac {\\Sigma (X- \\bar X)^2}{N-1}}\n\\]\nTo find the standard deviation using R Studio we use the sd command to find it.\n\nsd(Album_Sales$Adverts)\n\n[1] 485.6552\n\n\n\n\nRange\nAnother measure of dispersion is range. This is simply subtracting the highest score from the lowest score for a particular variable or vector of scores. So to find the range simply use the range function and then subtract the two numbers provided, which are the highest and the lowest.\n\nrange(Album_Sales$Adverts)\n\n[1] 9.104 2271.860\n\n\nThen simply subtract the scores\n\n2271.860-9.104\n\n[1] 2262.756\n\n\n\n\nInterquartile Range\nAnother helpful type of range is the interquartile range, which is the range of numbers from the 25th and 75th percentile. Percentiles are just dividing up a dataset based on where certain scores are based on percentages. For example, we can look at what score is at the 50th percentile. To do that we use the quantile function (a quantile is the same thing as a percentile). x is the variable we are analyzing, and .5 is the percentile (.5 = 50%; .35 = 35%)\n\nquantile(x = diamonds$carat, probs = .5)\n\n50% \n0.7 \n\n\nSo the number at the 50th percentile is 0.7. Notice that this is the same as the median, described earlier. Remember that the median is the number from the dataset that is in the middle or center of the dataset. Because the mean is calculated, it may or may not be a number in the dataset. The mean is the number that describes the closest number to the average, but may or may not be a number contained in the dataset.\nSo to find the interquartile range, we want our two quartiles or numbers at “quarters” of the dataset, so 25% and 75%. Think of a dollar, which is made up of 4 quarters. We already know that half a dolar is 50 cents, which would be the median, but the other two quartiles (quarters) would be 25 cents and 75 cents.\n\nquantile(x = diamonds$carat, probs = c(.25, .75))\n\n 25% 75% \n0.40 1.04 \n\n\nFinally, to get the interquartile ranks we simply subtract these two numbers.\n\n1.04-.40\n\n[1] 0.64\n\n\nSo the interquartile range is 0.64.",
"crumbs": [
"Intro to Statistics",
"Descriptive Statistics"
]
},
{
"objectID": "Correlation.html",
"href": "Correlation.html",
"title": "Correlation",
"section": "",
"text": "One of the first tests we can use to look at the relationship between two variables is correlation. Correlation usually requires two continuous numerical variables of some kind.\n\nVariance\nThe basic method for comparison between the two variables is variance, which we learned when looking at descriptive statisitics. Remember that variance is a measurement of data dispersion or spread. More specifically it refers to the average amount that data varies from the mean. So what correlation ultimately analyses is whether 2 variables vary in a similar way from their respective means. Remember here’s the formula for variance.\n\\[\nvariance(s^2)=\\frac{\\Sigma(x-\\bar x)^2}{N-1}\n\\]\n\n\nCovariance\nSince correlation is based on how two variables vary, naturally we are interested in covariance. Covariance is simply variance for two variables. Notice here in the formula that we’ve modified the orinigal variance formula and included our second variable y. So we’ve modified the exponent in the formula for x - the mean and incorporated y. \\[\ncovariance(x,y)=\\frac{\\Sigma(x-\\bar x)(y-\\bar y)}{N-1}\n\\]\nx stands for our first variable and y stands for our second variable, so we are analyzing variance over two variables rather than just one. How does one variable deviate from the mean in comparison to how another variable deviates from the mean? If they deviate (vary) from the mean in a similar way they will be expected to be highly correlated. When there are two variables we can multiply the deviation for one variable by the devation from the second variable. If both deviations are either positive or negative this will give us a positive value, which tells us that the deviations are in the same direction (positive correlation). If the deviations go in opposite directions (one positive and one negative) this will give us a negative value (negative corrleation). Multiplying deviations of one variable by a second variable gets us the cross-product deviation.\nAt this point the covariance is dependent upon the types of units used to calculate the number. However, we want to standardize this number, which basically means that the number will be in units that are the same across different experiments and the tools they use to measure their variables.In this case to standardize our covariance we use a calculation discussed in descriptive statistics, the standard deviation.\n\n\nPearson’s r\nThis gets us what’s known as Pearson’s r, which was named after the person who developed the calculation, Karl Pearson with Florenece Nightingale David doing a lot of the harder mathematical calculations. So hear is the formula for Pearson’s r \\[\nr = \\frac {cov_{xy}}{s_{x}s_{y}}=\\frac{\\Sigma(x-\\bar x)(y-\\bar y)}{(N-1)s_{x}s_{y}}\n\\]\nNotice that the formula is the covariance of x and y divided by the standard deviation of x and y, then this is elaborated in the longer version following the second equal sign where the standard deviation for x and y is added to our original formula for covariance.\n\n\nCorrelational Coefficent\nPearon’s r or the correlational coefficient always varies between +1 and -1. As your r value moves closer to +1 that means that both variables are varying in the same direction either both increasing or both descreasing. When your r value is moving closer to -1 that indicates that both scores are varying in opposite directions or as one varible increases the other decreases.\nCorrelation only describes relationships between variables, not causal relationships between variables. It’s not possible to decipher which variable is doing the causal work or which variable is the independent variable and which is the dependent variable. However, correlation can still provide helpful information about variables and we can still use them to test certain types of hypotheses.\n\n\nExploring Correlation in a Dataset\nTo explore correlation futher, let’s look at a particular dataset. Open the dataset labeled album_sales.csv. Use the Import Dataset command in the upper righthand window and let’s look at the data set.\n\nhead(Album_Sales)\n\n# A tibble: 6 × 4\n Adverts Sales Airplay Image\n <dbl> <dbl> <dbl> <dbl>\n1 10.3 330 43 10\n2 986. 120 28 7\n3 1446. 360 35 7\n4 1188. 270 33 7\n5 575. 220 44 5\n6 569. 170 19 5\n\n\n\n\nAlbum Sales Dataset\nThis is a dataset for a record company looking at various variables related to album sales.\n\nAdverts - Money spent on advertising\nSales - Sales for various albums\nAirplay - how much airplay each album received\nAttract - How attractive each band was on a 1-10 scale\n\nNext let’s look at the types of variables we have\n\nstr(Album_Sales)\n\ntibble [200 × 4] (S3: tbl_df/tbl/data.frame)\n $ Adverts: num [1:200] 10.3 985.7 1445.6 1188.2 574.5 ...\n ..- attr(*, \"label\")= chr \"Advertsing budget (thousands)\"\n ..- attr(*, \"format.spss\")= chr \"F8.2\"\n $ Sales : num [1:200] 330 120 360 270 220 170 70 210 200 300 ...\n ..- attr(*, \"label\")= chr \"Album sales (thousands)\"\n ..- attr(*, \"format.spss\")= chr \"F8.0\"\n $ Airplay: num [1:200] 43 28 35 33 44 19 20 22 21 40 ...\n ..- attr(*, \"label\")= chr \"No. of plays on radio\"\n ..- attr(*, \"format.spss\")= chr \"F8.0\"\n $ Image : num [1:200] 10 7 7 7 5 5 1 9 7 7 ...\n ..- attr(*, \"label\")= chr \"Band image rating (0-10)\"\n ..- attr(*, \"format.spss\")= chr \"F8.0\"\n\n\nNotice that each of the variables are continuous, either numeric (num) or integer (int), so this is a great dataset for doing correlations.\nHow would you expect attractiveness of the band to be related to album sales? Unless you’ve been living under a rock or don’t know anything about rock `n roll, attractiveness is a huge part of album sales. Whether your Dua Lipa or BTS, looks matter if your in the entertainment industry. So we should expect that as an artists looks increase, their album sales should increase as well.\n\n\nScatterplot - Band Attractiveness & Album Sales\nLet’s use a graph to first look at this data using a ggplot we learned earlier called a scatterplot. We’ll do a basic one first.\n\nggplot(Album_Sales, mapping = aes\n (x = Image, y = Sales)) + geom_point()\n\n\n\n\n\n\n\n\nSo notice that dots lower on the attractiveness scale tend to be associated with less sales, whereas when you look at the dots higher on the attractiveness scales their are more dots that are higher on the sales scale.\n\n\nRegression Line\nAnother tool to help us analyze this relationship is a regression line. This is a line drawn on the graph that is closest to as many dots or points as possible. You add it like this.\n\nggplot(Album_Sales, mapping = aes\n (x = Image, y = Sales)) + geom_point() +\n geom_smooth(method = 'lm')\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\nA regression line is simply using a line to describe the relationship between the two variables. As the line increases in slope or steepness, that indicates a stronger relationship. If the line moves from the lower left to the upper right that indicates a positive correlation whereas if the line moves from the upper left hand corner to the bottom right that indicates a negative correlation. The more the line “flattens” or decreases in slope the less correlation there is between the two variables.\nLet’s look at a second variable using the graph to judge which one is more correlated. Use advertising rather than attractiveness this time.\n\nggplot(Album_Sales, mapping = aes(x = Adverts, \n y = Sales)) + geom_point()\n\n\n\n\n\n\n\n\nNotice how most of the dots are gathered in the middle of the graph and moving from lower left to upper right. Notice also that no points are in the bottom right hand corner. So lower sales does not appear to be related to higher amount of advertising. Let’s see the location of the regression line.\n\nggplot(Album_Sales, mapping = aes(x = Adverts, \n y = Sales)) + geom_point() +\n geom_smooth(method = 'lm')\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\nNotice that the regression line seems slightly steeper on the adverts graph vs. the Image graph. But we can’t be totally sure. So let’s go ahead and use our Pearson’s r correlation coefficient to help us. The formula is really simple, it is just “cor” followed by the two variables you are analyzing.\n\ncor(Album_Sales$Sales, Album_Sales$Image)\n\n[1] 0.3261111\n\n\nOk, so the correlation between Sales and Attractiveness is .33, now let’s look at Sales and Advertising.\n\ncor(Album_Sales$Sales, Album_Sales$Adverts)\n\n[1] 0.5784877\n\n\nNotice how the corrleation has increased (meaning moved closer to + 1) from .33 to .58. This would seem to justify our hunch that their was a stronger correlation between Sales and Advertising than between Sales and Image (Looks aren’t everything apparently), based on the regression line for each graph.\n\n\nHow to use correlations to test hypotheses\nTo more formally use correlation to test a hypothesis, let’s look at another variable in the Album Sales dataset, airplay. This is how often a song gets played on the radio (FYI, the radio is an ancient device that old people used to listen to random music played using mysterious spooky radio waves). So your hypothesis would look something like:\n\nThere is a correlation between the airplay an album receives and the sales for the album.\n\nThe null hypothesis would simply negate the primary hypothesis:\n\nThere is no correlation between the airplay an album receives and the sales for the album.\n\nTo test the hypothesis we use the “cor.test” command and then specify each of the variables we are analyzing, so it looks like this:\n\ncor.test(Album_Sales$Sales, Album_Sales$Airplay)\n\n\n Pearson's product-moment correlation\n\ndata: Album_Sales$Sales and Album_Sales$Airplay\nt = 10.524, df = 198, p-value < 2.2e-16\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n 0.5018814 0.6810668\nsample estimates:\n cor \n0.5989188 \n\n\nThen write the results using the output including the correlation coefficient, 95% confidence internal, and the p value.\nResults There was a significant correlation between the airply an album received and the sales for the album, r = .60, 95% CI [.50, .68], p < .001.\nLet’s unpack these results a bit. The correlation is pretty high at .60. Remember that the closer a correlational coefficient is to either +/- 1 the stronger the correlation. We’ve also listed the confidence interval, which means that we are 95% confident that the correlation coefficient is between .50 and .68. The big problem that may show up here is if the CI range includes 0, which would mean that it’s possible that the correlation is zero, thus nullifying our primary hypothesis. The closer this range is the more accurate our correlation estimate and the closer to 1, the stronger the correlation. Finally, our p value is very low, so this shows us that the finding is significant.\nFinally, we want to include a scatterplot graph with the regression line to show the relationship we found.\n\nggplot(Album_Sales, mapping = aes(x = Airplay, y = Sales)) +\n geom_point() +\n geom_smooth(method = 'lm')\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\nHere’s a second version with a few more bells and whistles. I made a different color for the scatter dots by inserting a color command and changed the size to make them a bit bigger. I did something similar with the line to make it thicker by adjusting the size and removed the CIs with the “se = FALSE” command. I also added labels through the “labs” command, make sure to use quotations around your titles and x and y descriptions.\n\nggplot(Album_Sales, mapping = aes(x = Airplay, y = Sales)) +\n geom_point(size = 1.5, color = \"cornflower blue\") +\n geom_smooth(method = 'lm', se=FALSE, linewidth = 1.5) + \n labs(title = \"Scatterplot of the Relationship between Airplay\n and Sales\", y = \"Album Sales (thousands)\", \n x = \"Number of Plays on the Radio\")\n\n`geom_smooth()` using formula = 'y ~ x'",
"crumbs": [
"Statistical Tests",
"Correlation"
]
},
{
"objectID": "Working with R Markdown.html",
"href": "Working with R Markdown.html",
"title": "Working with R Markdown",
"section": "",
"text": "Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com or a nice slide presentation of what R Markdown can do is here.\nWhen you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.\nHere is the code to create a code chunk. Three left quotes or accents followed by braces with an r and then closing 3 accents.\nIf you click the green arrow on the right, it will run the code chunk for you and you can check it to make sure it’s right.\n\n\n\nCode Chunk\n\n\nSo the output will look like this:\n\nsummary(cars)\n\n speed dist \n Min. : 4.0 Min. : 2.00 \n 1st Qu.:12.0 1st Qu.: 26.00 \n Median :15.0 Median : 36.00 \n Mean :15.4 Mean : 42.98 \n 3rd Qu.:19.0 3rd Qu.: 56.00 \n Max. :25.0 Max. :120.00"
},
{
"objectID": "Working with R Markdown.html#r-markdown",
"href": "Working with R Markdown.html#r-markdown",
"title": "Working with R Markdown",
"section": "",
"text": "Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com or a nice slide presentation of what R Markdown can do is here.\nWhen you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.\nHere is the code to create a code chunk. Three left quotes or accents followed by braces with an r and then closing 3 accents.\nIf you click the green arrow on the right, it will run the code chunk for you and you can check it to make sure it’s right.\n\n\n\nCode Chunk\n\n\nSo the output will look like this:\n\nsummary(cars)\n\n speed dist \n Min. : 4.0 Min. : 2.00 \n 1st Qu.:12.0 1st Qu.: 26.00 \n Median :15.0 Median : 36.00 \n Mean :15.4 Mean : 42.98 \n 3rd Qu.:19.0 3rd Qu.: 56.00 \n Max. :25.0 Max. :120.00"
},
{
"objectID": "Working with R Markdown.html#including-plots",
"href": "Working with R Markdown.html#including-plots",
"title": "Working with R Markdown",
"section": "Including Plots",
"text": "Including Plots\nYou can also embed plots, for example:\n\n\n\n\n\n\n\n\n\nNote that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Most of the time we’ll have this set to echo=TRUE so we can see your code."
},
{
"objectID": "Working with R Markdown.html#how-to-set-up-an-assignment",
"href": "Working with R Markdown.html#how-to-set-up-an-assignment",
"title": "Working with R Markdown",
"section": "How to set up an assignment",
"text": "How to set up an assignment"
},
{
"objectID": "Working with R Markdown.html#write-your-yaml-file",
"href": "Working with R Markdown.html#write-your-yaml-file",
"title": "Working with R Markdown",
"section": "1. Write your YAML file",
"text": "1. Write your YAML file\nThis is just the title of the assignment and your name, date, and the output, which will be a word document. This is always at the top of any R Markdown document.\n\n\n\nYAML front matter"
},
{
"objectID": "Working with R Markdown.html#add-the-r-setup-file",
"href": "Working with R Markdown.html#add-the-r-setup-file",
"title": "Working with R Markdown",
"section": "2. Add the r setup file",
"text": "2. Add the r setup file\nThe r setup file lets you add any libraries or datasets you may need for the file. For example, you’ll need to usually add the tidyverse library for most of the assignments.\nYou’ll also need to add any datasets you are working with that you need to import. For example, to add SPSS files like the “Album Sales” Dataset you’ll need to add the “haven library” and then include the code to import the file.\nOr if you import a .csv file, you’ll need to include the code for that as well.\nIf you’re not sure of where the file is, you can always import the file first and the copy the code from your console. Typically this is not included (include = FALSE), but sometimes we need to see how you set things up.\n\n\n\nr setup file"
},
{
"objectID": "Working with R Markdown.html#add-questions-from-the-assignment.",
"href": "Working with R Markdown.html#add-questions-from-the-assignment.",
"title": "Working with R Markdown",
"section": "3. Add questions from the assignment.",
"text": "3. Add questions from the assignment.\nFor each question you’ll want to copy and paste the question from your homework into R Markdown. For example, from your quiz.\n\nUse the “Star Wars” data from the tidy verse, use the “pipe” feature to filter the dataset to mass below 250. With the new dataset, create a scatterplot of mass and height."
},
{
"objectID": "Working with R Markdown.html#answer-the-question",
"href": "Working with R Markdown.html#answer-the-question",
"title": "Working with R Markdown",
"section": "4. Answer the question",
"text": "4. Answer the question\nR Markdown allows you to answer the question using both the code and output. So I’ll start with a descriptive sentence or annotation about what I’m doing and then show the code and the output.\nFilter dataset to mass below 250\n\nSmaller_StarWars <- starwars %>%\n filter(starwars$mass < 250)\nSmaller_StarWars\n\n# A tibble: 58 × 14\n name height mass hair_color skin_color eye_color birth_year sex gender\n <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> \n 1 Luke Sk… 172 77 blond fair blue 19 male mascu…\n 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…\n 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…\n 4 Darth V… 202 136 none white yellow 41.9 male mascu…\n 5 Leia Or… 150 49 brown light brown 19 fema… femin…\n 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…\n 7 Beru Wh… 165 75 brown light blue 47 fema… femin…\n 8 R5-D4 97 32 <NA> white, red red NA none mascu…\n 9 Biggs D… 183 84 black light brown 24 male mascu…\n10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…\n# ℹ 48 more rows\n# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,\n# vehicles <list>, starships <list>\n\n\nCreate a scatterplot of mass and height\n\nggplot(data = Smaller_StarWars) +\n geom_point(mapping = aes(x = mass, \n y = height))"
},
{
"objectID": "Working with R Markdown.html#knit-your-document",
"href": "Working with R Markdown.html#knit-your-document",
"title": "Working with R Markdown",
"section": "5 Knit your document",
"text": "5 Knit your document\nAfter your done with your document and want to turn it in, you’ll click the knit button and knit it as a word document.\nThe document will be saved in your working directory. The working directory is the file where all your current files are being saved. It should be listed in your console or you can type the command getwd() to find it as well.\nHere is a R Markdown template file to get you started:\nAssignment Template"
},
{
"objectID": "Building Databases.html",
"href": "Building Databases.html",
"title": "Building Databases",
"section": "",
"text": "Scientific Notation\nWhen you first get R and R Studio set up, it may be using scientific notation to express larger numbers. So you’ll see numbers like this \\[\n5.234e+10\n\\] This is a type of exponent, which is in a scientific notation format. Here’s a simpler example to understand what this means. Let’s start with a number like 28.\nIn scientific notation this would look like \\[\n2.8e+01\n\\] Or in an exponential form more familiar \\[\n2.8x10^1\n\\] So it’s 2.8 times 10 to the first power. 280 Would look like this \\[\n2.8e+02/\nor/\n2.8x10^2\n\\] It’s basically an easier way to represent larger numbers like 280 million (280,000,000) \\[\n2.8e+08/\nor/\n2.8x10^8\n\\] To turn this off this setting do the following\n\noptions(scipen = 999)\n\nIf you want to turn it back on, do this\n\noptions(scipen = 0)\n\n\n\nMore work on databases\nLet’s make up a database based on covid figures from the New York Times.\n\nCreate your objects\nMake sure to use quotations for objects that are names or titles (Remember these are categorical variables)\n\n\nCountries <- c(\"United States\", \"India\", \"Brazil\", \"Russia\", \"UK\")\n\n\nTotal_Cases <- c(24249722, 10581837,8511770,3574330, 3466849)\n\n\nTotal_Deaths <- c(400810, 152556, 210299,65632,91470)\n\nThen you can use the data.frame command to put them all together\n\nCovid <- data.frame(Countries, Total_Cases, Total_Deaths)\n\nYou could actually do all these steps at the same time\n\nCovid_Again <- data.frame(Countries = c(\"United States\", \"India\", \n \"Brazil\", \"Russia\", \"UK\"), \n Total_Cases = c(24249722, 10581837,8511770,\n 3574330, 3466849), \n Total_Deaths = c(400810, 152556, 210299,65632,91470))\n\nAnother nice way to make a dataset is by using a tibble\nThis is part of the tidyverse package and simplifies the code somewhat. Notice that the command to make a tibble is actually tribble.\n\nCovid_TR <- tribble(\n ~Countries, ~Total_Cases, ~Total_Deaths, \n \"United States\", 24249722, 400810, \n \"India\", 10581837, 152556, \n \"Brazil\", 8511770, 210299, \n \"Russia\", 3574330, 65632, \n \"UK\", 3466849, 91470\n)\n\nA tibble is nice because it sets it up more like a spreadsheet.\nNotice that the ~ specifies the columns or variables and then the rest are like rows.\n\n\nManipulate Data\nMortality rate is total deaths divided by the total number of cases. You can use R to calculate this for you and then create the object.\n\nMortality_Rate <- c(Total_Deaths/Total_Cases)\n\nThen we can add all four variables together to remake our covid data.frame\n\nCovid <- data.frame(Countries,Total_Cases,Total_Deaths,Mortality_Rate)\n\nTidyverse supplies some other helps here if we are using tibbles.\nWe can use mutate to add in the other variable based on a computation.\n\nCovid_TR <- mutate(Covid_TR, Mortality_Rate = Total_Deaths/Total_Cases)\n\nWe can use rename to change the name of our variable\n\nCovid_TR <- rename(Covid_TR, Mortality = Mortality_Rate)\n\nYou can practice this on your own.",
"crumbs": [
"R Basics",
"Building Databases"
]
},
{
"objectID": "hello.html",
"href": "hello.html",
"title": "Hello, Quarto",
"section": "",
"text": "Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org."
},
{
"objectID": "hello.html#meet-quarto",
"href": "hello.html#meet-quarto",
"title": "Hello, Quarto",
"section": "",
"text": "Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org."
},
{
"objectID": "hello.html#meet-the-penguins",
"href": "hello.html#meet-the-penguins",
"title": "Hello, Quarto",
"section": "Meet the penguins",
"text": "Meet the penguins\n\nThe penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.\nThe plot below shows the relationship between flipper and bill lengths of these penguins."
}
]