Skip to content

Mindful-AI-Assistants/8-social-buzz-ai-Project-Social-Pulse-A-Machine-Learning-Approach-to-Academic-Performance-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

220 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

[🇧🇷 Português] [🇺🇸 English]


Social Pulse: A Machine Learning Approach to Academic Performance Modeling and Analytics



Course: Humanistic AI & Data Science (4th Semester)
Institution: PUC-SP

Professor: Erick Bacconi
Professor: Rooney Ribeiro Albuquerque Coelho




This project analyzes an anonymized dataset of student grades to uncover insights into academic performance, identify patterns, and explore the viability of predictive models for educational analytics.

It was developed as part of the Social Media Marketing course at PUC-SP, applying principles of data storytelling, analytical thinking, and structured reporting.










Sponsor Mindful AI Assistants





Note

⚠️ Heads Up





Important

• This repository: 8- Social Buss: Extension Project: Social Pulse - A Machine Learning Approach to Academic Performance Modeling and Analytics — is part of the overall structure defined in the main hub and follows the same standards, organization, and documentation patterns.

TheMain Repository:
1-social-buzz-ai-main serves as the central hub for this discipline, consolidating all project files, documentation, and links to the related sub-repositories.





Table of Contents




A complete end-to-end Machine Learning pipeline designed to explore, model, and predict student academic performance using anonymized institutional records. This project demonstrates a full data science workflow — from raw data cleaning to predictive modeling and interpretable insights.




This repository contains a structured analysis of student academic performance, developed as part of an educational project. Although initially created within a Social Media Marketing course, the project leverages the dataset to practice core Data Science and Machine Learning skills, including:


* Data cleaning and preparation

* Exploratory Data Analysis (EDA)

* Feature engineering

* Predictive modeling with Random Forest

* Model explainability using SHAP

* Insight communication and structured storytelling



Tip

  • The result is a complete case study in Educational Analytics, focusing on detecting and understanding student dropout risk.





The original dataset presented several structural issues:


* Metadata mixed with data rows

* Multiple header rows

* Duplicate columns

* Irregular formatting and inconsistent values



Tip

  • All issues were systematically corrected, producing a clean and reliable dataset for analysis and modeling.




The analysis surfaced several important findings:


* High dropout and absenteeism rates

* No response from the student" was the most frequent reason for non-attendance

* Strong patterns linked to academic history and semester status



Key transformations included:


* Encoding categorical academic attributes

* Handling class imbalance

* Creating derived features based on academic progression



A Random Forest Classifier was trained and evaluated, achieving strong performance:


Metric Score
Accuracy 94.50%
Precision 87.00%
Recall 94.50%
F1-Score 90.60%



Tip

  • A total of 186 students were predicted to be at risk of dropout or academic disengagement.




SHAP values were used to identify the most influential predictors:


* Reason for non-attendance

* Number of course failures

* Previous-semester academic status

* Current-semester status



Tip

  • These insights help build data-driven strategies for intervention.




* The initial dataset contained metadata issues, redundant headers, and duplicate columns, all resolved during cleaning.

* Exploratory analysis revealed high dropout risk, with “No response from the student” as the leading cause of absence.

* The Random Forest model demonstrated excellent predictive power with accuracy above 94%.

* Critical dropout risk factors were identified, enabling actionable interpretation.

* A total of 186 students were flagged as at risk.




* Risk factors identified by the model can guide targeted support programs and early interventions.

* The predictive model can be integrated into institutional systems such as early-alert dashboards or student-support chatbots.

* Future improvements may include:

* Hyperparameter optimization

* Comparison with other ML models

* Time series modeling of academic progression

* Real-time prediction pipelines




🇧🇷 Português 🇬🇧 English
O conjunto de dados inicial apresentava problemas de metadados, múltiplos cabeçalhos e colunas duplicadas, corrigidos na limpeza. The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during cleaning.
A análise exploratória revelou alta taxa de evasão, sendo “Sem retorno do estudante” o motivo mais comum. Exploratory analysis showed a high dropout rate, with “No response from the student” as the most common reason.
O modelo Random Forest obteve: 94,50% de acurácia; 87,00% de precisão; 94,50% de recall; 90,60% de F1-score. The Random Forest model achieved: 94.50% accuracy; 87.00% precision; 94.50% recall; 90.60% F1-score.
Fatores-chave incluíram: motivo da ausência, número de reprovações, status no semestre anterior e no atual. Key factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status.
186 estudantes foram previstos como estando em risco de evasão. 186 students were predicted to be at risk of dropout.
Os fatores de risco ajudam a orientar intervenções e programas de apoio. The risk factors help guide targeted interventions and support programs.
O modelo pode ser integrado a sistemas institucionais, garantindo conformidade com a LGPD. The model can be integrated into institutional systems while ensuring LGPD compliance.




* The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during the cleaning process.

*Exploratory analysis showed a high dropout risk, with “No response from the student” as the most common reason for non-attendance.

* The Random Forest model achieved strong performance:

* Accuracy: 94.50%

* Precision: 87.00%

* Recall: 94.50%

* F1-Score: 90.60%

* Key risk factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status.

* A total of 186 students were predicted to be at risk of dropout.




* The identified risk factors support the development of targeted interventions and support programs for at-risk students.

* The model can be integrated into institutional systems (early-alert platforms, chatbots, dashboards) while ensuring LGPD compliance.




├── data/
│   ├── raw_dataset.csv
│   ├── cleaned_generated_dataset.csv
├── notebooks/
│   ├── academic_performance_pipeline_AI-ML.ipynb
├── src/
│   ├── data_cleaning.py
│   ├── modeling.py
│   ├── eda.py
│   ├── utils.py
├── outputs/
│   ├── figures/
│   ├── shap_analysis/
│   ├── model_metrics.json
├── README.md



* Clone the repository

* Install dependencies (requirements.txt)

* Open the main notebook academic_performance_pipeline_AI-ML.ipynb

* Explore data preparation, modeling, and insights step by step



This project was developed using a modern and reliable Data Science stack:


* Python 3.10+

* Jupyter Notebook


* pandas

* numpy

* scipy


* matplotlib

* seaborn


* scikit-learn

* imbalanced-learn

* SHAP (model explainability)


* Modular Python scripts

* Jupyter-based experimentation

* Reproducible data science pipeline




To reproduce the full data pipeline, install the dependencies below:


# Core
numpy==1.26.4
pandas==2.2.2
scipy==1.13.1

# Visualization
matplotlib==3.8.4
seaborn==0.13.2

# Machine Learning
scikit-learn==1.5.0
imbalanced-learn==0.12.2

# Explainability
shap==0.45.1

# Notebook Environment
jupyter==1.0.0
ipykernel==6.29.4




Follow the steps below to set up the environment and reproduce the analysis.



git clone https://github.com/your-username/social-pulse-academic-performance.git
cd social-pulse-academic-performance




python -m venv venv
source venv/bin/activate  # macOS/Linux
venv\Scripts\activate     # Windows




pip install -r requirements.txt




jupyter notebook




Navigate to:


notebooks/academic_performance_pipeline_AI-ML.ipynb



Tip

  • Execute the notebook step by step to reproduce data cleaning, EDA, modeling, and SHAP analysis.



Below is the planned evolution of the project, combining academic rigor with practical ML expansion.


* Data cleaning and dataset restructuring

** Exploratory Data Analysis

* Feature engineering

* Random Forest modeling

* Model evaluation

* SHAP explainability

* Key insights and outcome report


* Refined notebook documentation

* README optimization

* Improvements in visualization design

* Enhanced sectioning for portfolio presentation


* Hyperparameter tuning with GridSearchCV or Optuna

* Benchmarking alternative models (XGBoost, LightGBM, Logistic Regression)

* Cross-validation and stability assessment


* Streamlit dashboard or

* FastAPI endpoint for real-time predictions


* Prefect or Airflow]()

* CI/CD integration

* Improved explainability (partial dependency plots, feature interactions)




Tip

👌🏻 Contributions are welcome.

  • Please follow conventional commit practices, open issues, or submit pull requests with improvements or enhancements.








🛸๋ My Contacts Hub




────────────── ⊹🔭๋ ──────────────

➣➢➤ Back to Top

Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.

Sponsor this project

 

Contributors