Course: Humanistic AI & Data Science (4th Semester)
Institution: PUC-SP
✨ Professor: Erick Bacconi
✨ Professor: Rooney Ribeiro Albuquerque Coelho
This project analyzes an anonymized dataset of student grades to uncover insights into academic performance, identify patterns, and explore the viability of predictive models for educational analytics.
It was developed as part of the Social Media Marketing course at PUC-SP, applying principles of data storytelling, analytical thinking, and structured reporting.
Note
-
Projects and deliverables may be made publicly available whenever possible.
-
The course prioritizes hands-on practice with real data in consulting scenarios.
-
All activities comply with the academic and ethical guidelines of PUC-SP.
-
Confidential information from this repository remains private in private repositories.
Important
• This repository: 8- Social Buss: Extension Project: Social Pulse - A Machine Learning Approach to Academic Performance Modeling and Analytics — is part of the overall structure defined in the main hub and follows the same standards, organization, and documentation patterns.
• TheMain Repository:
1-social-buzz-ai-main serves as the central hub for this discipline, consolidating all project files, documentation, and links to the related sub-repositories.
- Overview
- Objectives
- Dataset
- Methodology
- Exploratory Data Analysis
- Feature Engineering
- Predictive Modeling
- Results
- Conclusion
- How to Run
- English Summary
- Hashtags
A complete end-to-end Machine Learning pipeline designed to explore, model, and predict student academic performance using anonymized institutional records. This project demonstrates a full data science workflow — from raw data cleaning to predictive modeling and interpretable insights.
This repository contains a structured analysis of student academic performance, developed as part of an educational project. Although initially created within a Social Media Marketing course, the project leverages the dataset to practice core Data Science and Machine Learning skills, including:
* Data cleaning and preparation
* Exploratory Data Analysis (EDA)
* Feature engineering
* Predictive modeling with Random Forest
* Model explainability using SHAP
* Insight communication and structured storytelling
Tip
- The result is a complete case study in Educational Analytics, focusing on detecting and understanding student dropout risk.
* Metadata mixed with data rows
* Multiple header rows
* Duplicate columns
* Irregular formatting and inconsistent values
Tip
- All issues were systematically corrected, producing a clean and reliable dataset for analysis and modeling.
* High dropout and absenteeism rates
* No response from the student" was the most frequent reason for non-attendance
* Strong patterns linked to academic history and semester status
* Encoding categorical academic attributes
* Handling class imbalance
* Creating derived features based on academic progression
| Metric | Score |
|---|---|
| Accuracy | 94.50% |
| Precision | 87.00% |
| Recall | 94.50% |
| F1-Score | 90.60% |
Tip
- A total of 186 students were predicted to be at risk of dropout or academic disengagement.
* Reason for non-attendance
* Number of course failures
* Previous-semester academic status
* Current-semester status
Tip
- These insights help build data-driven strategies for intervention.
* The initial dataset contained metadata issues, redundant headers, and duplicate columns, all resolved during cleaning.
* Exploratory analysis revealed high dropout risk, with “No response from the student” as the leading cause of absence.
* The Random Forest model demonstrated excellent predictive power with accuracy above 94%.
* Critical dropout risk factors were identified, enabling actionable interpretation.
* A total of 186 students were flagged as at risk.
* Risk factors identified by the model can guide targeted support programs and early interventions.
* The predictive model can be integrated into institutional systems such as early-alert dashboards or student-support chatbots.
* Future improvements may include:
* Hyperparameter optimization
* Comparison with other ML models
* Time series modeling of academic progression
* Real-time prediction pipelines
| 🇧🇷 Português | 🇬🇧 English |
|---|---|
| O conjunto de dados inicial apresentava problemas de metadados, múltiplos cabeçalhos e colunas duplicadas, corrigidos na limpeza. | The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during cleaning. |
| A análise exploratória revelou alta taxa de evasão, sendo “Sem retorno do estudante” o motivo mais comum. | Exploratory analysis showed a high dropout rate, with “No response from the student” as the most common reason. |
| O modelo Random Forest obteve: 94,50% de acurácia; 87,00% de precisão; 94,50% de recall; 90,60% de F1-score. | The Random Forest model achieved: 94.50% accuracy; 87.00% precision; 94.50% recall; 90.60% F1-score. |
| Fatores-chave incluíram: motivo da ausência, número de reprovações, status no semestre anterior e no atual. | Key factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status. |
| 186 estudantes foram previstos como estando em risco de evasão. | 186 students were predicted to be at risk of dropout. |
| Os fatores de risco ajudam a orientar intervenções e programas de apoio. | The risk factors help guide targeted interventions and support programs. |
| O modelo pode ser integrado a sistemas institucionais, garantindo conformidade com a LGPD. | The model can be integrated into institutional systems while ensuring LGPD compliance. |
* The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during the cleaning process.
*Exploratory analysis showed a high dropout risk, with “No response from the student” as the most common reason for non-attendance.
* The Random Forest model achieved strong performance:
* Accuracy: 94.50%
* Precision: 87.00%
* Recall: 94.50%
* F1-Score: 90.60%
* Key risk factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status.
* A total of 186 students were predicted to be at risk of dropout.
* The identified risk factors support the development of targeted interventions and support programs for at-risk students.
* The model can be integrated into institutional systems (early-alert platforms, chatbots, dashboards) while ensuring LGPD compliance.
├── data/
│ ├── raw_dataset.csv
│ ├── cleaned_generated_dataset.csv
├── notebooks/
│ ├── academic_performance_pipeline_AI-ML.ipynb
├── src/
│ ├── data_cleaning.py
│ ├── modeling.py
│ ├── eda.py
│ ├── utils.py
├── outputs/
│ ├── figures/
│ ├── shap_analysis/
│ ├── model_metrics.json
├── README.md* Clone the repository
* Install dependencies (requirements.txt)
* Open the main notebook academic_performance_pipeline_AI-ML.ipynb
* Explore data preparation, modeling, and insights step by step
* Python 3.10+
* Jupyter Notebook
* pandas
* numpy
* scipy
* matplotlib
* seaborn
* scikit-learn
* imbalanced-learn
* SHAP (model explainability)
* Modular Python scripts
* Jupyter-based experimentation
* Reproducible data science pipeline
# Core
numpy==1.26.4
pandas==2.2.2
scipy==1.13.1
# Visualization
matplotlib==3.8.4
seaborn==0.13.2
# Machine Learning
scikit-learn==1.5.0
imbalanced-learn==0.12.2
# Explainability
shap==0.45.1
# Notebook Environment
jupyter==1.0.0
ipykernel==6.29.4
git clone https://github.com/your-username/social-pulse-academic-performance.git
cd social-pulse-academic-performancepython -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windowspip install -r requirements.txtjupyter notebooknotebooks/academic_performance_pipeline_AI-ML.ipynb
Tip
- Execute the notebook step by step to reproduce data cleaning, EDA, modeling, and SHAP analysis.
Below is the planned evolution of the project, combining academic rigor with practical ML expansion.
* Data cleaning and dataset restructuring
** Exploratory Data Analysis
* Feature engineering
* Random Forest modeling
* Model evaluation
* SHAP explainability
* Key insights and outcome report
* Refined notebook documentation
* README optimization
* Improvements in visualization design
* Enhanced sectioning for portfolio presentation
* Hyperparameter tuning with GridSearchCV or Optuna
* Benchmarking alternative models (XGBoost, LightGBM, Logistic Regression)
* Cross-validation and stability assessment
* Streamlit dashboard or
* FastAPI endpoint for real-time predictions
* Prefect or Airflow]()
* CI/CD integration
* Improved explainability (partial dependency plots, feature interactions)
Tip
👌🏻 Contributions are welcome.
- Please follow conventional commit practices, open issues, or submit pull requests with improvements or enhancements.
-
👩🏻🚀 Fabiana ⚡️ Campanari
-
👨🏽🚀 Pedro Barrenco
-
🧑🏼🚀 Pedro Vyctor
🛸๋ My Contacts Hub
────────────── ⊹🔭๋ ──────────────
➣➢➤ Back to Top
Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.