Skip to content

Hammad-1105/student-performance-ml-dl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PHASE 1

Step Task Purpose
1 Load & preview data Sanity check + get quick structure
2 Basic Info & Nulls Look for missing data, dtypes
3 Column-by-column deep dive Categorical vs numerical vs IDs
4 Correlation matrix Spot potential relationships
5 Visual exploration Univariate, bivariate, outliers
6 Target distribution Check regression suitability
7 Feature transformation plan Scaling, encoding, outliers, etc.
8 Final preprocessing design Pipeline or manual steps

PHASE 2

Step Task Purpose
0 Train–Test Split Prevent data leakage by splitting before preprocessing
1 Encode Categorical Variables Convert categorical data into numerical format for models
2! Drop/Engineer Features Remove irrelevant features, create new mea ningful ones
3 Scale Numerical Features Standardize/normalize numeric data for better model performance
4! Handle Outliers (Optional) Treat extreme values that may distort model training
5 Dimensionality Reduction (Optional) Reduce feature space using PCA for noise removal or visualization
6 Finalize Processed Dataset Ensure X_train/X_test/y_train/y_test are clean and ready
7 Build Preprocessing Pipeline Use Pipeline + ColumnTransformer for reproducible preprocessing

PHASE 3

Step Task Purpose
0 Wire up y_train Define target; keep X_train raw so the pipeline handles all preprocessing per fold
1 Define CV + Metrics Use 5-Fold CV with RMSE/MAE/RΒ² for apples-to-apples comparisons
2 Baselines (Dummy, Linear) Establish performance floor and a simple linear benchmark
3 Regularized Linear (Ridge/Lasso/EN) Handle noise/collinearity; check if linear models are enough
4 Trees & Boosting Try RandomForest / GradientBoosting (optionally XGBoost/LightGBM later)
5 Cross-validate consistently Evaluate each model in a Pipeline(preprocessor β†’ model) to avoid leakage
6 Hyperparameter Tuning (Optional) RandomizedSearchCV on top performers with tight, sensible ranges
7 Refit Best on Full Train Fit the winning pipeline on all training data
8 Diagnostics (Optional) Residuals, feature importance, learning curves
9 Save Best Pipeline Persist the single preprocessor+model pipeline for later test eval & deployment

βœ… Phase 0 β€” Project Setup & GitHub Init Create project directory and environment (conda, Jupyter)

Initialize Git repository and push to GitHub

Set up basic folder structure and README

Save raw dataset to data/raw/

πŸ“¦ Outcome: Clean setup, project versioning started, README and folders ready

πŸ”Ž Phase 1 β€” EDA + Initial Understanding Load dataset, explore columns, understand datatypes

Visualize key features (distribution, correlation, outliers)

Understand missing values, class imbalance, etc.

Label encoding or quick mapping (e.g. 'Yes' β†’ 1, 'No' β†’ 0)

πŸ“¦ Outcome: Deep understanding of the data, hypothesis ready, bad columns marked

🧹 Phase 2 β€” Data Preprocessing Handle missing values (drop, impute, custom methods)

Encode categorical features (Label/One-hot/Ordinal)

Scale/normalize numerical data

Drop irrelevant/noisy features manually

Optional: PCA or dimensionality reduction for pruning

Save cleaned dataset to data/processed/

πŸ”§ We’ll design preprocessing pipeline here and keep it reusable via src/preprocessing.py.

πŸ“¦ Outcome: Cleaned, encoded, scaled dataset β†’ ready for ML/DL

πŸ€– Phase 3 β€” ML Modeling Train baseline models (LogReg, Decision Trees, Random Forest)

Train advanced models (SVM, XGBoost, LightGBM)

Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)

Metric comparison: Accuracy, Precision, Recall, F1, AUC-ROC

Save best model (joblib, pickle)

Visualize confusion matrix, ROC curves, etc.

πŸ“¦ Outcome: Best ML model with evaluation results and saved weights

🧠 Phase 4 β€” Deep Learning (MLP) Train a Deep Neural Net (MLP) using TensorFlow/Keras

Architecture tuning (layers, dropout, activations)

Compare with ML results

Optional: Add batch normalization, weight initialization, callbacks

Save model (.h5 or SavedModel format)

πŸ“¦ Outcome: Working DNN for tabular data with metric comparison

πŸ” Phase 5 β€” Interpretability & Explainability Feature importance (for tree-based models)

SHAP, LIME or ELI5 for DL model explainability

Compare which features affect prediction most

πŸ“¦ Outcome: Interpretability plots + explanations for final report

☁️ Phase 6 β€” Deployment (Optional/Bonus) Convert notebook code to Python scripts/modules

Build simple CLI or Flask web app

Push to HuggingFace Spaces or Render (free tier)

Add usage guide in README

πŸ“¦ Outcome: Minimal deployable project or API, usable by others

πŸ“˜ Phase 7 β€” Final Docs + README Polish Clean and complete README.md

Add model performance charts

Final cleanup, remove junk notebooks, export PDFs

πŸ“¦ Outcome: Shareable GitHub project with clarity and polish

πŸ›  Tools & Tech Stack Python, Pandas, NumPy, Matplotlib, Seaborn

Scikit-learn, XGBoost, LightGBM

TensorFlow/Keras

SHAP, ELI5, LIME

Jupyter Notebook, VS Code

Git, GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors