Supervised Learning on the Trojan Horse Dataset

This project implements a full supervised machine learning pipeline to classify instances in the Trojan Horse dataset.
The notebook walks through data preparation, model training, evaluation, and visualization using multiple ML algorithms.
The goal is to compare classifier performance and understand which input features drive accurate predictions.

Project Overview

This repository demonstrates a complete end-to-end ML workflow that includes:

Data cleaning and preprocessing
Handling categorical & numerical features
Training and evaluating several classification algorithms
Generating confusion matrices and performance comparisons
Visualizing model behavior
Interpreting feature importance (tree-based models)

This project is designed to show practical ML engineering skills using real datasets and reproducible workflows.

Pipeline Overview

1. Load & Clean Data

Import Trojan Horse dataset
Handle missing values
Encode categorical features (One-Hot / Label Encoding)
Normalize numerical input features (StandardScaler / MinMaxScaler)

2. Train/Test Split

Hold-out split for unbiased model evaluation
Ensures reproducibility with random states

3. Model Training (Classifiers Used)

The following models are implemented and benchmarked:

Logistic Regression
Decision Tree
Random Forest
k-Nearest Neighbors (k-NN)
Support Vector Machine (SVM)
Naïve Bayes
Gradient Boosting / XGBoost
Simple Ensemble / Voting Classifier

This allows comparison between linear, tree-based, distance-based, probabilistic, and boosting-based methods.

4. Evaluation

Accuracy, Precision, Recall, F1-Score
Confusion matrices for every classifier
ROC curves (optional)
Feature importance for tree-based models
Misclassification analysis

5. Visualization

Model performance bar charts
Confusion matrix heatmaps
Decision boundary visualizations (for 2-feature slices)
Feature importance plots

Example Results Summary

Model	Accuracy	Notes
Logistic Regression	~53.87%	Baseline
Random Forest	~76.69%	Best overall, stable performance
Decision Tree	~64.90%	Good performance
k-NN	~67.17%	Sensitive to scaling
Naïve Bayes	~51.24%	Fast but innacurate due to correlated features
Ensemble	~68.26%	Competitive and stable

(Replace X% with your actual results if desired.)

Tech Stack

Python
scikit-learn
Pandas
NumPy
Matplotlib / Seaborn
Jupyter Notebook

Skills Demonstrated

End-to-end supervised ML workflow
Data preprocessing and feature engineering
Training & comparing multiple classifiers
Model evaluation and interpretation
Visualization of ML outputs
Reproducible Jupyter Notebook analysis

These skills directly map to real-world ML engineering and data science roles.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CSE_632_Data_Mining_Project_2.pdf		CSE_632_Data_Mining_Project_2.pdf
Poncini Project 2 Code.ipynb		Poncini Project 2 Code.ipynb
README.md		README.md
test_data.xls		test_data.xls
train_data.xls		train_data.xls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised Learning on the Trojan Horse Dataset

Project Overview

Pipeline Overview

1. Load & Clean Data

2. Train/Test Split

3. Model Training (Classifiers Used)

4. Evaluation

5. Visualization

Example Results Summary

Tech Stack

Skills Demonstrated

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Supervised Learning on the Trojan Horse Dataset

Project Overview

Pipeline Overview

1. Load & Clean Data

2. Train/Test Split

3. Model Training (Classifiers Used)

4. Evaluation

5. Visualization

Example Results Summary

Tech Stack

Skills Demonstrated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages