Multidimensional-Analysis/stress_analysis.Rmd at main · lckh24/Multidimensional-Analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
---
title: "Stress Analysis"
author: "Khanh Le"
date: "2024-06-17"
output:
  html_document:
    keep_md: true
---

# [Import dữ liệu và các thư viện cần thiết]()
```{r setup, include=FALSE}
knitr::opts_knit$set(root.dir = normalizePath("D:/R/project/"))
R.version
getwd()
```

```{r}
pacman::p_load(
  ggplot2,
  mvtnorm,
  GGally,
  corrplot,
  readxl,
  tidyverse,
  gridExtra,
  grid,
  plotly,
  ggcorrplot,
  FactoMineR,
  factoextra,
  rgl,
  scatterplot3d,
  inspectdf,
  mvnormtest,
  psych
)
```

```{r}
data = read.csv("data/StressLevelDataset.csv")
head(data)
```
# [Thông tin về dữ liệu và phương pháp nghiên cứu]()
* **Bộ dữ liệu gồm 1100 quan trắc và 21 biến như sau:**
   1. anxiety-level: Mức độ lo lắng từ 0 (lo lắng thấp) đến 21 (lo lắng cao).
   2. self-esteem: Lòng tự trọng từ 0 (lòng tự trọng thấp) đến 30 (lòng tự trọng cao).
   3. mental-health-history: Tiền sử sức khỏe tâm thần, thể hiện liệu học sinh có tiền sử mắc các vấn đề về sức khỏe tâm thần hay không.
   4. depression: Trầm cảm dao động từ 0 (trầm cảm thấp) đến 27 (trầm cảm cao).
   5. headache: Tần suất đau đầu mà học sinh gặp phải, dao động từ 0 (không đau đầu) đến 5 (đau đầu thường xuyên).
   6. blood-pressure: Huyết áp với các giá trị từ 1 (thấp) đến 3 (cao).
   7. sleep-quality: Chất lượng giấc ngủ thang điểm từ 0 (chất lượng kém) đến 5 (chất lượng tuyệt vời).
   8. breathing-problem: Cho biết học sinh có gặp vấn đề về hô hấp hay không
   9. noise-level: Nhận thức về mức độ tiếng ồn trong môi trường của học sinh, từ 0 (tiếng ồn thấp) đến 5 (tiếng ồn cao).
   10. living-conditions: Điều kiện sống, các giá trị từ 0 (điều kiện kém) đến 5 (điều kiện tuyệt vời).
   11. safety: Sự an toàn, từ 0 (không an toàn) đến 5 (rất an toàn).
   12. basic-needs: Nhu cầu cơ bản, từ 0 (không hài lòng) đến 5 (hoàn toàn hài lòng).
   13. academic-performance: Thành tích học tập, giá trị từ 0 (kém) đến 5 (xuất sắc).
   14. study-load: Nhận thức của học sinh về khối lượng học tập của mình, từ 0 (nhẹ) đến 5 (nặng).
   15. teacher-student-relationship: Chất lượng mối quan hệ với giáo viên, với các giá trị từ 0 (kém) đến 5 (xuất sắc).
   16. future-career-concerns: Mối quan tâm về triển vọng nghề nghiệp trong tương lai, từ 0 (mối quan tâm thấp) đến 5 (mối quan tâm cao).
   17. social-support: Mức độ hỗ trợ xã hội mà học sinh nhận được, từ 0 (hỗ trợ thấp) đến 3 (hỗ trợ cao).
   18. peer-pressure: Ảnh hưởng của áp lực từ bạn bè, với giá trị từ 0 (áp lực thấp) đến 5 (áp lực cao).
   19. extracurricular-activities: Các hoạt động ngoại khóa, từ 0 (không tham gia) đến 5 (tham gia tích cực).
   20. bullying: Bắt nạt, từ 0 (không bị bắt nạt) đến 5 (thường xuyên bị bắt nạt).
   21. stress-level: Mức độ căng thẳng, giá trị từ 0 (căng thẳng thấp) đến 2 (căng thẳng cao).

```{r}
str(data)
summary(data)
```


```{r}
# Kiểm tra dòng trùng lặp
duplicate_check <- anyDuplicated(data)

# In kết quả
colSums(is.na(data))

cat("Có dòng trùng lặp:", duplicate_check > 0, "\n")
```
# [Tìm hiểu các thông tin hữu ích từ dữ liệu]()


* **Phân Phối Của Dữ Liệu**

```{r, fig.width=25, fig.height=15}
plot_all_histograms <- function(data, fill_color = "skyblue", alpha_value = 0.5, base_size = 15, binwidth = 1) {
  plot_list <- list()
  for (var in names(data)) {
    p = ggplot(data, aes_string(x = var)) +
      geom_histogram(fill = fill_color, color = "blue", alpha = alpha_value, binwidth = binwidth) +
      theme_minimal(base_size = base_size) +
      labs(title = var,
           x = "",
           y = "Frequency") +
      geom_rug(sides = "b")
     plot_list[[var]] <- p
  }
  title_grob <- textGrob("Distribution Of Numerical Columns", gp = gpar(fontsize = 25, fontface = "bold"))
  grid.arrange(title_grob, grobs=plot_list, ncol = 7, nrow=3)
}

plot_all_histograms(data)
```

* **Nhận xét:**

1. Anxiety Level
    - Mức độ lo lắng có phân bố khá đều với sự gia tăng nhẹ về tần suất ở mức cao (20-21).
    - Phần lớn học sinh có mức độ lo lắng nằm trong khoảng từ 5 đến 20.
2. Self-Esteem
    - Lòng tự trọng có phân bố đa dạng với đỉnh ở khoảng 20-30.
    - Học sinh có lòng tự trọng ở khắp dải (0-30), nhưng sự gia tăng đáng chú ý ở mức 20 cho thấy nhiều học sinh đánh giá lòng tự trọng của họ ở mức cao hơn trung bình.
3. Mental Health History
    - Số lượng học sinh có tiền sử về sức khỏe tâm thần nhiều hơn những học sinh bình thường nhưng chênh lệch không nhiều.
4. Depression
    - Điểm trầm cảm dao động rộng với đỉnh ở khoảng giữa (10-15).
5. Headache
    - Tần suất đau đầu khác nhau, tập trung nhiều ở 1 và 3
    - Điều này cho thấy đau đầu là một vấn đề phổ biến trong số các học sinh được khảo sát.
6. Blood Pressure
    - Huyết áp được phân loại thành ba mức (1: thấp, 2: bình thường, 3: cao).
    - Phần lớn học sinh có huyết áp cao (mức 3), với ít học sinh ở mức thấp (mức 1) và bình thường (mức 2).
7. Sleep Quality
    - Phân bố cho thấy nhiều học sinh báo cáo chất lượng giấc ngủ kém (0) trong khi một nhóm nhỏ báo cáo chất lượng giấc ngủ tuyệt vời (5).
8. Breathing Problem
    - Các học sinh có vấn đề về hô hấp tập trung nhiều ở mức 2 và mức 4.
    - Chỉ một nhóm nhỏ học sinh không có vấn đề về hô hấp.
9. Noise Level
    - Hầu hết học sinh đều chịu tác động tiếng ồn ở mức từ khá đến cao.
10. Living Conditions
    - Dàn trải khá chênh lệch, tập trung nhiều ở mức khá.
11. Safety
    -  Phần lớn học sinh đều có sự an toàn nhất định, tập trung nhiều ở mức 2 trở đi
12. Basic Needs
    - Phân bố cho thấy phần lớn học sinh hài lòng với nhu cầu cơ bản của mình ở mức 2 và 3, với một số ít học sinh hài lòng ở mức cao hơn hoặc thấp hơn
13. Academic Performance
    - Phần lớn học sinh có thành tích học tập ở mức 2, với một số ít học sinh có thành tích xuất sắc hoặc kém hơn. Điều này cho thấy phần lớn học sinh có thành tích học tập trung bình.
14. Study Load
    - Phân bố cho thấy phần lớn học sinh đánh giá khối lượng học tập của mình ở mức 2 và 3, với một số ít học sinh đánh giá khối lượng học tập ở mức cao hơn hoặc thấp hơn.
15. Teacher-Student Relationship
    - Phần lớn học sinh đánh giá mối quan hệ giáo viên-học sinh ở mức trung bình (2 và 3). Họ có mối quan hệ tương đối tốt với giáo viên.
16. Future Career Concerns
    - Phân bố cho thấy phần lớn học sinh có mối quan tâm về triển vọng nghề nghiệp ở mức trung bình (2 và 3).
17. Social Support
    - Phần lớn học sinh cảm thấy họ nhận được sự hỗ trợ xã hội đáng kể. (Mức 2 và 3).
18. Peer Pressure
    - Phần lớn học sinh cảm thấy áp lực từ bạn bè ở mức trung bình (2 và 3).
19. Extracurricular Activities
    - Phần lớn học sinh có tham gia hoạt động ngoại khóa nhưng không quá tích cực.
20. Bullying
    - Vấn đề bắt nạt không quá phổ biến trong học sinh. (Mức 0 và 1)
21. Stress Level
    - Phần lớn học sinh cảm thấy mức độ căng thẳng ở mức trung bình đến cao (1 và 2).
* **Boxplot của dữ liệu**


```{r, fig.width=20, fid.height=10}
# Plot the boxplot of all the columns and outliers
boxplot_all <- function(data, fill_color = "skyblue", alpha_value = 0.5, base_size = 15) {
  plot_list <- list()
  for (var in names(data)) {
    p = ggplot(data, aes_string(x = var)) +
      geom_boxplot(fill = fill_color, color = "blue", alpha = alpha_value) +
      theme_minimal(base_size = base_size) +
      labs(title = var,
           x = "",
          ) +
      geom_rug(sides = "l")
    plot_list[[var]] <- p
  }
  title_grob <- textGrob("Boxplot Of Numerical Columns", gp = gpar(fontsize = 25, fontface = "bold"))
  grid.arrange(title_grob, grobs=plot_list, ncol = 7, nrow=3)
}
boxplot_all(data)
```


* 3D scatter plot

```{r}
# The 3D plot allows you to visualize the relationship between three variables: anxiety_level, self_esteem, and stress_level. using plotly with clusters
plot_ly(data, x = ~anxiety_level, y = ~self_esteem, z = ~stress_level, color = ~stress_level, colors = c("cornsilk3", "aquamarine4", "azure4"), marker = list(size = 5)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = "Anxiety Level"),
                      yaxis = list(title = "Self Esteem"),
                      zaxis = list(title = "Stress Level")))


```


* **Correlation of the data**


``` {r, fig.width=15 , fid.height=10}
corrplot(cor(data),
         method = "number",
         type = "upper",
         order = "hclust",
         tl.col = "black",
         tl.srt = 45)
```

Hệ số tương quan giữa mức độ căng thẳng và các chỉ số sức khỏe là dương và có ý nghĩa đối với hầu hết các chỉ số sức khỏe. Điều này có nghĩa là khi mức độ căng thẳng tăng lên, các chỉ số sức khỏe cũng tăng theo. Điều này cho thấy rằng căng thẳng có liên quan đến một số hậu quả tiêu cực đối với sức khỏe.

Mối tương quan mạnh nhất là giữa mức độ căng thẳng và kiệt sức, với hệ số tương quan là 0,675. Điều này cho thấy rằng kiệt sức là hậu quả rất phổ biến của căng thẳng.

Các chỉ số sức khỏe khác có mối tương quan mạnh với mức độ căng thẳng bao gồm:

- Trầm cảm (hệ số tương quan = 0,73)
- Lo lắng (hệ số tương quan = 0,74)
- Đau đầu (hệ số tương quan = 0,71)
- Chất lượng giấc ngủ (hệ số tương quan = -0,75)

Mối tương quan âm giữa mức độ căng thẳng và chất lượng giấc ngủ cho thấy rằng căng thẳng có thể dẫn đến chất lượng giấc ngủ kém. Đổi lại, chất lượng giấc ngủ kém có thể dẫn đến mức độ căng thẳng tăng cao. Điều này có thể tạo ra một vòng luẩn quẩn.

# [EDA]()

* **Phân loại dữ liệu**
Trong tập dữ liệu này, có ba cột sẽ được chuyển đổi thành các danh mục dựa trên các tỷ lệ nhất định. Ba cột là ```anxiety_level```, ```self_esteem``` và ```depression```.

* Với ```anxiety_level```:
    - 0-4: Minimal
    - 5-9: Mild
    - 10-14: Moderate
    - > 14: Severe
* Với ```depression```:
    - 0-4: None-minimal
    - 5-9: Mild
    - 10-14: Moderate
    - 15-19: Moderately Severe
    - 20-27: Severe
* Với ```self_esteem```:
    - 0-14: Low
    - 15-25: Normal
    - 26-30: High

```{r}
eda = data
```


```{r}
eda$anxiety_level <- ifelse(eda$anxiety_level < 5, "Minimal",
                       ifelse(eda$anxiety_level >=5 & eda$anxiety_level < 10, "Mild",
                       ifelse(eda$anxiety_level >=10 & eda$anxiety_level < 15, "Moderate", "Severe")))

eda$depression <- ifelse(eda$depression < 5, "None-Minimal",
                    ifelse(eda$depression >=5 & eda$depression < 10, "Mild",
                    ifelse(eda$depression >=10 & eda$depression < 15, "Moderate",
                    ifelse(eda$depression >=15 & eda$depression < 20, "Moderately Severe", "Severe"))))

eda$self_esteem <- ifelse(eda$self_esteem < 15, "Low",
                     ifelse(eda$self_esteem >=15 & eda$self_esteem <= 25, "Normal", "High"))
```


* Kiểu dữ liệu cho mỗi cột có thể được thay đổi thành loại yếu tố vì tập dữ liệu có số lượng lớp tương đối nhỏ, cụ thể là từ 2 đến 6 lớp. Sau khi thay đổi kiểu dữ liệu, các cấp độ lớp trong các cột anxiety_level, self_esteem và depression được đặt từ lớp nhỏ nhất đến lớp cao nhất.
```{r}
eda <- eda %>%
  mutate_all(as.factor)
```

```{r}
eda$anxiety_level <- factor(eda$anxiety_level, levels=c("Minimal", "Mild", "Moderate", "Severe"))
eda$self_esteem <- factor(eda$self_esteem, levels=c("Low", "Normal", "High"))
eda$depression <- factor(eda$depression, levels=c("None-Minimal", "Mild", "Moderate", "Moderately Severe", "Severe"))
```

```{r}
glimpse(eda)

```

Tiếp theo, chúng ta có thể tìm ra tỷ lệ các lớp trong mỗi biến phân loại bằng cách sử dụng `inspect_cat` từ package `inspectdf` như sau:

```{r, fig.width=20, fig.height=8}
# plot inspect_cat
show_plot(inspect_cat(eda))
```


```{r, fig.width=20, fig.height=10}
grid.arrange(
  eda %>%
    ggplot(aes(x = stress_level, fill = stress_level)) +
    geom_bar() +
    scale_fill_manual(values = c("cornsilk3", "aquamarine4", "azure4")) +
    geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, size = 5) + # Tăng cỡ chữ cho số lượng
    labs(title = "Number of students by Severity of stress_level",
         x = "Stress Level",
         y = "Number of students") +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 20),
        axis.title.x = element_text(size = 20),
        axis.title.y = element_text(size = 20),
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        legend.title = element_text(size = 20),
        legend.text = element_text(size = 20)
    ),


  eda %>%
    ggplot(aes(x = depression, fill = depression)) +
    geom_bar() +
    theme_minimal(base_size = 20) +
    scale_fill_manual(values = c("cornsilk3", "aquamarine4", "azure4", "burlywood4", "coral4")) +
    geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
    labs(title = "Number of students by Severity of depression",
         x = "depression",
         y = "Number of students") +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 14),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)
    ),


  eda %>%
    ggplot(aes(x = anxiety_level, fill = anxiety_level)) +
    geom_bar() +
    theme_minimal(base_size = 20) +
    scale_fill_manual(values = c("cornsilk3", "aquamarine4", "azure4", "burlywood4", "coral4")) +
    geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
    labs(title = "Number of students by Severity of Anxiety level",
         x = "Anxiety Level",
         y = "Number of students") +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 20),
        axis.title.x = element_text(size = 20),
        axis.title.y = element_text(size = 20),
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        legend.title = element_text(size = 20),
        legend.text = element_text(size = 20)
    ),


  eda %>%
    ggplot(aes(x = self_esteem, fill = self_esteem)) +
    geom_bar() +
    theme_minimal(base_size = 20) +
    scale_fill_manual(values = c("cornsilk3", "aquamarine4", "azure4")) +
    geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
    labs(title = "Number of students by Severity of self_esteem",
         x = "Self Esteem",
         y = "Number of students") +
    theme_minimal() +
     theme(
        plot.title = element_text(size = 20),
        axis.title.x = element_text(size = 20),
        axis.title.y = element_text(size = 20),
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        legend.title = element_text(size = 20),
        legend.text = element_text(size = 20)
    ),

  ncol = 2
)
```


```{r}
# Using plotly to create interactive plots anxiety_level towards stress_level with text on bar shape
plot_ly(eda, x = ~anxiety_level, color = ~stress_level, colors = c("cornsilk3", "coral4", "black"), type = "histogram") %>%
  layout(title = "Anxiety Level vs Stress Level",
         xaxis = list(title = "Anxiety Level"),
         yaxis = list(title = "Count"),
         barmode = "group")
```

* Từ biểu đồ này, mức độ lo lắng có xu hướng tỷ lệ thuận với mức độ căng thẳng của học sinh. Khi mức độ lo lắng tăng lên thì mức độ căng thẳng cũng có xu hướng tăng lên


```{r}
# Next, we will try to see the correlation between the anxiety_level variable and the target variable stress_level.
# We will use the chi-square test to check the relationship between the two variables.
# The chi-square test is used to determine whether there is a significant association between two categorical variables.
# The null hypothesis is that there is no association between the two variables, while the alternative hypothesis is that there is an association between the two variables.
# If the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant association between the two variables.
# If the p-value is greater than 0.05, we fail to reject the null hypothesis and conclude that there is no significant association between the two variables.

# Chi-square test between anxiety_level and stress_level

chi_square_test_anxiety_stress <- chisq.test(data$anxiety_level, data$stress_level)
chi_square_test_anxiety_stress
```


```{r}
# Next, we will also see the correlation between mental health history and stress levels using the same function as the previous plot.
eda_clone = eda
# change 0 to Doesn't have mental health history and 1 to Have mental health history
eda_clone$mental_health_history <- ifelse(eda_clone$mental_health_history == 0,
                                          "Doesn't have mental health history",
                                          "Have mental health history")

plot_ly(eda_clone,
        x = ~mental_health_history,
        color = ~stress_level,
        colors = c("cornsilk3", "coral4", "black"),
        type = "histogram") %>%
        layout(title = "Mental Health History vs Stress Level",
               xaxis = list(title = "Mental Health History"),
               yaxis = list(title = "Count"),
               barmode = "stack")

```


* Từ biểu đồ trên, ta thấy rằng học sinh có tiền sử về sức khỏe tâm thần có xu hướng có mức độ căng thẳng cao hơn so với học sinh không có tiền sử về sức khỏe tâm thần. Để chắc chắn hơn, ta có thể sử dụng kiểm định chi bình phương để kiểm tra, như sau:

```{r}
# Chi-square test between mental_health_history and stress_level
chi_square_test_mental_health_stress <- chisq.test(data$mental_health_history, data$stress_level)
chi_square_test_mental_health_stress
```

* Thực hiện kiểm định MANOVA trên biến độc lập là: mental_health_history.
H0: Không có sự khác biệt về `depression`, `stress_level`, `anxiety_level` theo tiền sử sức khỏe tâm thần.
H1: Có sự khác biệt về `depression`, `stress_level`, `anxiety_level` theo tiền sử sức khỏe tâm thần.

```{r}
# MANOVA test
manova_test <- manova(cbind(depression, stress_level, anxiety_level) ~ mental_health_history, data = data)

summary(manova_test)

```

* **Nhận xét:**
    - P-value của kiểm định MANOVA < 0.05, chúng ta bác bỏ giả thuyết H0 và chấp nhận giả thuyết H1. Điều này cho thấy rằng có sự khác biệt về `depression`, `stress_level`, `anxiety_level` theo tiền sử sức khỏe tâm thần.

Vì có sự khác biệt đáng kể trong thử nghiệm MANOVA, nên thử nghiệm ANOVA với mức ý nghĩa 0.05 sẽ được thực hiện để xác định xem các biến có giá trị trung bình khác nhau như thế nào theo tiền sử sức khỏe tâm thần.

```{r}
# alpha = 0.05
print("ANOVA test for depression")
anova_test <- aov(depression ~ mental_health_history, data = data)
summary(anova_test)

```

```{r}
print("ANOVA test for stress_level")
anova_test <- aov(stress_level ~ mental_health_history, data = data)
summary(anova_test)
```

```{r}
print("ANOVA test for anxiety_level")
anova_test <- aov(anxiety_level ~ mental_health_history, data = data)
summary(anova_test)

```


* **Nhận xét:**
 - Trong cả ba biến số (trầm cảm, căng thẳng, và lo âu), kết quả ANOVA đều cho thấy F value rất lớn và p-value rất nhỏ (nhỏ hơn 0.001). Điều này cho phép kết luận rằng lịch sử sức khỏe tâm thần có ảnh hưởng đáng kể và có ý nghĩa thống kê đối với mức độ trầm cảm, căng thẳng, và lo âu.


* Physiological Factors:
    - Blood Pressure
    - Headache
    - Breathing Problem
    - Sleep Quality
```{r}
# How many students experience headaches frequently?
headache_freq <- data %>%
  filter(headache >= 3) %>%
  nrow()
print(paste("Number of students experiencing frequent headaches:", headache_freq))
# What is the average blood pressure reading among the students?
avg_blood_pressure <- mean(data$blood_pressure)
print(paste("Average blood pressure reading among students:", round(avg_blood_pressure)))
# How many students rate their sleep quality as poor?
poor_sleep_quality <- data %>%
  filter(sleep_quality <= 2) %>%
  nrow()
print(paste("Number of students rating their sleep quality as poor:", poor_sleep_quality))
```

* Environmental Factors:
    - Academic Performance
    - Study Load
    - Teacher-Student Relationship
    - Future Career Concerns

```{r}
# How many students live in conditions with high noise levels?
high_noise_levels <- data %>%
  filter(noise_level >= 4) %>%
  nrow()
print(paste("Number of students living in conditions with high noise levels:", high_noise_levels))
# What percentage of students feel unsafe in their living conditions?
unsafe_living_conditions <- data %>%
  filter(safety == 1) %>%
  nrow()
print(paste("Number of students who feel unsafe in their living conditions:", unsafe_living_conditions))
# How many students have reported not having their basic needs met?
basic_needs_not_met <- data %>%
  filter(basic_needs <= 2) %>%
  nrow()
print(paste("Number of students who have reported not having their basic needs met:", basic_needs_not_met))
```

* Academic Factors
    - Academic Performance
    - Study Load
    - Teacher-Student Relationship
    - Future Career Concerns

```{r}
# How many students rate their academic performance as below average? using ggplot
below_average_academic_performance <- data %>%
  filter(academic_performance <= 2) %>%
  nrow()
print(paste("Number of students rating their academic performance as below average:", below_average_academic_performance))
# What is the average study load reported by students?
avg_study_load <- mean(data$study_load)
print(paste("Average study load reported by students:", round(avg_study_load,2)))
# How many students have concerns about their future careers?
future_career_concerns <- data %>%
  filter(future_career_concerns >= 4) %>%
  nrow()
print(paste("Number of students who have concerns about their future careers:", future_career_concerns))
```

* Social Factors
```{r}
# How many students feel they have strong social support?
strong_social_support <- data %>%
  filter(social_support >= 3) %>%
  nrow()
print(paste("Number of students who feel they have strong social support:", strong_social_support))
# What percentage of students have experienced bullying?

bullying <- data %>%
  filter(bullying >= 1) %>%
  nrow()
print(paste("Number of students who have experienced bullying:", round(bullying/1100,4)*100, "%"))
# How many students participate in extracurricular activities?
extracurricular_activities <- data %>%
  filter(extracurricular_activities >= 1) %>%
  nrow()
print(paste("Number of students who participate in extracurricular activities:", extracurricular_activities))
#
```

# PCA
Với các phương pháp phân tích nhiều chiều đã học ở học phần, hãy chọn phương
pháp phân tích phù hợp cho vấn đề nghiên cứu do nhóm đặt ra với bộ dữ liệu
này. Nếu có thể thực hiện phân tích dựa trên ít nhất hai phương pháp khác nhau
và so sánh kết quả, rút ra kết luận, ... (nếu cần có thể tham khảo thêm tài liệu)

=> Mục đích: Xác định các thành phần chính ảnh hưởng lớn nhất đến tình trạng sức khỏe tâm thần của học sinh, từ đó đưa ra những đề xuất cải thiện môi trường học tập và sức khỏe tinh thần.

```{r}
data.pca <- PCA(data, graph=F)
data.pca
eig.val <- get_eigenvalue(data.pca)
eig.val
```


* Scree plot
```{r}
fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 100))
```
* Nhận xét:
   - Để quyết định số lượng thành phần chính cần chọn dựa trên scree plot, ta có thể dựa vào điểm gãy của biểu đồ tức là nơi mà độ giảm của phương sai giải thích trở nên nhỏ hơn đáng kể. Từ biểu đồ trên,  thành phần chính đầu tiên giải thích 60.5% phương sai,thành phần chính thứ hai giải thích 5.7%, và từ thành phần chính thứ ba trở đi, phương sai giải thích đều giảm dần nhưng không giảm đột ngột. Ta sẽ cân nhắc chọn 2 hoặc 4 thành phần chính.
   - Nếu chọn 2 thành phần chính: ta sẽ có 66.2% phương sai được giải thích. Mô hình sẽ trở nên đơn giản, tập trung vào các yếu tố chính nhất. Đồng thời dễ dàng trực quan hóa và diễn giải dữ liệu.
   - Nếu chọn 4 thành phần chính: ta sẽ có 72.3% phương sai được giải thích. Ta phải sẵn sàng chấp nhận mức độ phức tạp cao hơn trong mô hình, nhưng sẽ giải thích được nhiều thông tin hơn về dữ liệu.
   => Theo mục đích nghiên cứu ban đầu thì ta sẽ chọn 4 thành phần chính để giải thích dữ liệu.

* Thông tin về loadings

```{r, fig.width=20, fig.height=8}
loadings <- as.data.frame(data.pca$var$coord)
loadings$Variable <- rownames(loadings)
loadings

p1 <- ggplot(loadings, aes(x = reorder(Variable, Dim.1), y = Dim.1)) +
  geom_bar(stat = "identity", fill = "cornsilk4") +
  coord_flip() +
  labs(title = "Loadings on PC1", x = "Variables", y = "Loadings") +
  theme_minimal() +
  theme(
        plot.title = element_text(size = 14),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)
    )

p2 <- ggplot(loadings, aes(x = reorder(Variable, Dim.2), y = Dim.2)) +
  geom_bar(stat = "identity", fill = "coral3") +
  coord_flip() +
  labs(title = "Loadings on PC2", x = "Variables", y = "Loadings") +
  theme_minimal() +
  theme(
        plot.title = element_text(size = 14),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)
    )

p3 <- ggplot(loadings, aes(x = reorder(Variable, Dim.3), y = Dim.3)) +
  geom_bar(stat = "identity", fill = "aquamarine4") +
  coord_flip() +
  labs(title = "Loadings on PC3", x = "Variables", y = "Loadings") +
  theme_minimal() +
  theme(
        plot.title = element_text(size = 14),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)
    )

grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)
```

```{r, fig.width=40, fig.height=8}
contrib_PC1 <- as.data.frame(data.pca$var$contrib[,1])
colnames(contrib_PC1) <- c("Contribution")
contrib_PC1$Variable <- rownames(contrib_PC1)

contrib_PC2 <- as.data.frame(data.pca$var$contrib[,2])
colnames(contrib_PC2) <- c("Contribution")
contrib_PC2$Variable <- rownames(contrib_PC2)

contrib_PC3 <- as.data.frame(data.pca$var$contrib[,3])
colnames(contrib_PC3) <- c("Contribution")
contrib_PC3$Variable <- rownames(contrib_PC3)

p1 = ggplot(contrib_PC1, aes(x = reorder(Variable, Contribution), y = Contribution)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Contribution of Variables to Dim 1", x = "", y = "Contribution (%)") +
  theme_minimal()

p2 = ggplot(contrib_PC2, aes(x = reorder(Variable, Contribution), y = Contribution)) +
  geom_bar(stat = "identity", fill = "salmon") +
  coord_flip() +
  labs(title = "Contribution of Variables to Dim 2", x = "", y = "Contribution (%)") +
  theme_minimal()

p3 = ggplot(contrib_PC3, aes(x = reorder(Variable, Contribution), y = Contribution)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  labs(title = "Contribution of Variables to Dim 3", x = "", y = "Contribution (%)") +
  theme_minimal()

grid.arrange(p1,
             p2,
             p3,
             nrow=1)
```

```{r}
```


```{r, fig.width=15, fig.height=8}
# Biểu đồ biểu diễn các thành phần chính đầu tiên
p1 <-  fviz_pca_var(data.pca,
               col.var = "contrib",
               gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
               repel = TRUE)

p2 <- fviz_pca_var(data.pca, axes = c(2, 3),
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)
grid.arrange(p1, p2, nrow = 1)
```

```{r,  fig.width=15, fig.height=8}
fviz_pca_biplot(data.pca,
                axes = c(1, 2),
                geom.ind = "point",
                col.ind = eda$mental_health_history,
                palette = c("#00AFBB", "#FC4E07"),
                addEllipses = TRUE,
                legend.title = "Groups",
                label = "var")
```


# Phân tích nhân tố
## Kiểm định KMO
```{r}
kmo_fa <- KMO(data)
kmo_fa
```

## Kiểm định Bartlett
```{r}
cortest.bartlett(data)
```

## Xác định số lượng các nhân tố chính rút ra
Trong phân tích EFA, căn cứ để xác định các nhân tố chính được rút ra là sử dụng giá trị của Eigenvalue. Theo tiêu chuẩn của Kaiser thì nhân tố chính được rút ra phải có Eigenvalue > 1. Một tiêu chuẩn khác ít nghiêm ngặt hơn đó là tiêu chuẩn của Jolliffe, theo Jolliffe thì các nhân tố có Eigenvalue > 0.7 có thể được chọn. Trong bộ dữ liệu này, số lượng nhân tố chính được rút ra dựa vào tiêu chuẩn của Kaiser.
```{r}
fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 100), n=10) +
  labs(title = "Scree Plot", x = "Principal Component", y = "Percentage of Variance") +
  theme_minimal()
```

Ở trên, sử dụng Eigenvalue theo tiêu chuẩn Kaiser ta đã trích ra được 3 nhân tố chính, để biết các nhân tố này được cấu thành từ những biến nào ta sử dụng phép xoay Varimax.
```{r}
pc2 = principal(data, nfactors = 3, rotate = "varimax")
print.psych(pc2, cut = 0.4, sort = TRUE)
```
## Kiểm định Cronbach Alpha cho thang đo

```{r}
psych::alpha(data[, c('safety', 'teacher_student_relationship', 'academic_performance', 'basic_needs', 'sleep_quality', 'living_conditions', 'headache', 'self_esteem', 'stress_level')])
```


```{r}
psych::alpha(data[, c('self_esteem', 'blood_pressure')])
```


```{r}
psych::alpha(data[, c('peer_pressure', 'study_load', 'future_career_concerns', 'extracurricular_activities', 'noise_level', 'stress_level', 'bullying', 'anxiety_level', 'future_career_concerns', 'depression', 'breathing_problem', 'mental_health_history')])


```