8- Social Buzz AI - Extension Project

Social Pulse: A Machine Learning Approach to Academic Performance Modeling and Analytics

Course: Humanistic AI & Data Science (4th Semester)
Institution: PUC-SP

✨ Professor: Erick Bacconi
✨ Professor: Rooney Ribeiro Albuquerque Coelho

This project analyzes an anonymized dataset of student grades to uncover insights into academic performance, identify patterns, and explore the viability of predictive models for educational analytics.

It was developed as part of the Social Media Marketing course at PUC-SP, applying principles of data storytelling, analytical thinking, and structured reporting.

Slides Presentation

Note

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course prioritizes hands-on practice with real data in consulting scenarios.
All activities comply with the academic and ethical guidelines of PUC-SP.
Confidential information from this repository remains private in private repositories.

Important

• This repository: 8- Social Buss: Extension Project: Social Pulse - A Machine Learning Approach to Academic Performance Modeling and Analytics — is part of the overall structure defined in the main hub and follows the same standards, organization, and documentation patterns.

• TheMain Repository:
1-social-buzz-ai-main serves as the central hub for this discipline, consolidating all project files, documentation, and links to the related sub-repositories.

Social Pulse — Academic Performance Modeling & Analytics

A complete end-to-end Machine Learning pipeline designed to explore, model, and predict student academic performance using anonymized institutional records. This project demonstrates a full data science workflow — from raw data cleaning to predictive modeling and interpretable insights.

Overview

This repository contains a structured analysis of student academic performance, developed as part of an educational project. Although initially created within a Social Media Marketing course, the project leverages the dataset to practice core Data Science and Machine Learning skills, including:

* Data cleaning and preparation

* Exploratory Data Analysis (EDA)

* Feature engineering

* Predictive modeling with Random Forest

* Model explainability using SHAP

* Insight communication and structured storytelling

Tip

The result is a complete case study in Educational Analytics, focusing on detecting and understanding student dropout risk.

Project Pipeline

1. Data Cleaning & Preparation

The original dataset presented several structural issues:

* Metadata mixed with data rows

* Multiple header rows

* Duplicate columns

* Irregular formatting and inconsistent values

Tip

All issues were systematically corrected, producing a clean and reliable dataset for analysis and modeling.

2. Exploratory Data Analysis (EDA)

The analysis surfaced several important findings:

* High dropout and absenteeism rates

* No response from the student" was the most frequent reason for non-attendance

* Strong patterns linked to academic history and semester status

3. Feature Engineering

Key transformations included:

* Encoding categorical academic attributes

* Handling class imbalance

* Creating derived features based on academic progression

4. Predictive Modeling

A Random Forest Classifier was trained and evaluated, achieving strong performance:

Metric	Score
Accuracy	94.50%
Precision	87.00%
Recall	94.50%
F1-Score	90.60%

Tip

A total of 186 students were predicted to be at risk of dropout or academic disengagement.

5. Explainability & Insights

SHAP values were used to identify the most influential predictors:

* Reason for non-attendance

* Number of course failures

* Previous-semester academic status

* Current-semester status

Tip

These insights help build data-driven strategies for intervention.

Key Findings

* The initial dataset contained metadata issues, redundant headers, and duplicate columns, all resolved during cleaning.

* Exploratory analysis revealed high dropout risk, with “No response from the student” as the leading cause of absence.

* The Random Forest model demonstrated excellent predictive power with accuracy above 94%.

* Critical dropout risk factors were identified, enabling actionable interpretation.

* A total of 186 students were flagged as at risk.

Insights & Next Steps

* Risk factors identified by the model can guide targeted support programs and early interventions.

* The predictive model can be integrated into institutional systems such as early-alert dashboards or student-support chatbots.

* Future improvements may include:

* Hyperparameter optimization

* Comparison with other ML models

* Time series modeling of academic progression

* Real-time prediction pipelines

Bilingual Summary Table (PT → EN)

🇧🇷 Português	🇬🇧 English
O conjunto de dados inicial apresentava problemas de metadados, múltiplos cabeçalhos e colunas duplicadas, corrigidos na limpeza.	The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during cleaning.
A análise exploratória revelou alta taxa de evasão, sendo “Sem retorno do estudante” o motivo mais comum.	Exploratory analysis showed a high dropout rate, with “No response from the student” as the most common reason.
O modelo Random Forest obteve: 94,50% de acurácia; 87,00% de precisão; 94,50% de recall; 90,60% de F1-score.	The Random Forest model achieved: 94.50% accuracy; 87.00% precision; 94.50% recall; 90.60% F1-score.
Fatores-chave incluíram: motivo da ausência, número de reprovações, status no semestre anterior e no atual.	Key factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status.
186 estudantes foram previstos como estando em risco de evasão.	186 students were predicted to be at risk of dropout.
Os fatores de risco ajudam a orientar intervenções e programas de apoio.	The risk factors help guide targeted interventions and support programs.
O modelo pode ser integrado a sistemas institucionais, garantindo conformidade com a LGPD.	The model can be integrated into institutional systems while ensuring LGPD compliance.

Summary

Data Analysis — Key Findings

* The initial dataset contained metadata issues, multiple headers, and duplicate columns, all resolved during the cleaning process.

*Exploratory analysis showed a high dropout risk, with “No response from the student” as the most common reason for non-attendance.

* The Random Forest model achieved strong performance:

* Accuracy: 94.50%

* Precision: 87.00%

* Recall: 94.50%

* F1-Score: 90.60%

* Key risk factors included: reason for non-attendance, number of failures, previous-semester status, and current-semester status.

* A total of 186 students were predicted to be at risk of dropout.

Insights & Next Steps

* The identified risk factors support the development of targeted interventions and support programs for at-risk students.

* The model can be integrated into institutional systems (early-alert platforms, chatbots, dashboards) while ensuring LGPD compliance.

Repository Structure

├── data/ │ ├── raw_dataset.csv │ ├── cleaned_generated_dataset.csv ├── notebooks/ │ ├── academic_performance_pipeline_AI-ML.ipynb ├── src/ │ ├── data_cleaning.py │ ├── modeling.py │ ├── eda.py │ ├── utils.py ├── outputs/ │ ├── figures/ │ ├── shap_analysis/ │ ├── model_metrics.json ├── README.md

How to Use This Repository

* Clone the repository

* Install dependencies (requirements.txt)

* Open the main notebook academic_performance_pipeline_AI-ML.ipynb

* Explore data preparation, modeling, and insights step by step

Tech Stack

This project was developed using a modern and reliable Data Science stack:

Languages & Core Tools

* Python 3.10+

* Jupyter Notebook

Data Processing & Analysis

* pandas

* numpy

* scipy

Visualization

* matplotlib

* seaborn

Machine Learning

* scikit-learn

* imbalanced-learn

* SHAP (model explainability)

Project Structure & Workflow

* Modular Python scripts

* Jupyter-based experimentation

* Reproducible data science pipeline

Requirements

To reproduce the full data pipeline, install the dependencies below:

# Core numpy==1.26.4 pandas==2.2.2 scipy==1.13.1 # Visualization matplotlib==3.8.4 seaborn==0.13.2 # Machine Learning scikit-learn==1.5.0 imbalanced-learn==0.12.2 # Explainability shap==0.45.1 # Notebook Environment jupyter==1.0.0 ipykernel==6.29.4

Installation

Follow the steps below to set up the environment and reproduce the analysis.

1. Clone the repository

git clone https://github.com/your-username/social-pulse-academic-performance.git cd social-pulse-academic-performance

2. Create a virtual environment (recommended)

python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows

3. Install dependencies

pip install -r requirements.txt

4. Launch Jupyter

jupyter notebook

5. Open the main notebook

Navigate to:

notebooks/academic_performance_pipeline_AI-ML.ipynb

Tip

Execute the notebook step by step to reproduce data cleaning, EDA, modeling, and SHAP analysis.

Roadmap

Below is the planned evolution of the project, combining academic rigor with practical ML expansion.

Phase 1 — Completed

* Data cleaning and dataset restructuring

** Exploratory Data Analysis

* Feature engineering

* Random Forest modeling

* Model evaluation

* SHAP explainability

* Key insights and outcome report

Phase 2 — In Progress

* Refined notebook documentation

* README optimization

* Improvements in visualization design

* Enhanced sectioning for portfolio presentation

Phase 3 — Planned

* Hyperparameter tuning with GridSearchCV or Optuna

* Benchmarking alternative models (XGBoost, LightGBM, Logistic Regression)

* Cross-validation and stability assessment

Deployment prototype:

* Streamlit dashboard or

* FastAPI endpoint for real-time predictions

Automated pipeline using

* Prefect or Airflow]()

* CI/CD integration

* Improved explainability (partial dependency plots, feature interactions)

Contributing

Tip

👌🏻 Contributions are welcome.

Please follow conventional commit practices, open issues, or submit pull requests with improvements or enhancements.

💚 Our Crew:

👩🏻‍🚀 Fabiana ⚡️ Campanari

👨🏽‍🚀 Pedro Barrenco

🧑🏼‍🚀 Pedro Vyctor

💌 Let the data flow... Ping Me!

🛸๋ My Contacts Hub

────────────── ⊹🔭๋ ──────────────

➣➢➤ Back to Top

Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
Briefings		Briefings
Code-academic_performance_pipeline_AI-ML		Code-academic_performance_pipeline_AI-ML
Extension_project_Exploratory		Extension_project_Exploratory
dataset		dataset
gradio-chatbot		gradio-chatbot
.gitignore		.gitignore
Data Cleaning by Zahra Amini .pdf		Data Cleaning by Zahra Amini .pdf
GH_SPARK_TEST.md		GH_SPARK_TEST.md
LICENSE		LICENSE
README.md		README.md
README.pt_BR.md		README.pt_BR.md
requirements.txt		requirements.txt
𝗡𝗟𝗣_𝘄𝗶𝘁𝗵_𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀.pdf		𝗡𝗟𝗣_𝘄𝗶𝘁𝗵_𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀.pdf
🇧🇷-Presentation - Academic Performance Analytics.pptx		🇧🇷-Presentation - Academic Performance Analytics.pptx

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

8- Social Buzz AI - Extension Project

Social Pulse: A Machine Learning Approach to Academic Performance Modeling and Analytics

Table of Contents

The original dataset presented several structural issues:

The analysis surfaced several important findings:

Key transformations included:

A Random Forest Classifier was trained and evaluated, achieving strong performance:

SHAP values were used to identify the most influential predictors:

This project was developed using a modern and reliable Data Science stack:

To reproduce the full data pipeline, install the dependencies below:

Follow the steps below to set up the environment and reproduce the analysis.

Navigate to:

Below is the planned evolution of the project, combining academic rigor with practical ML expansion.

💚 Our Crew:

💌 Let the data flow... Ping Me!

🛸๋ My Contacts Hub

Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages