A from-scratch implementation of Logistic Regression using NumPy to classify whether a student passes or fails a DMV written test based on two exam scores.
This project demonstrates the mathematical foundations of logistic regression, including:
- Logistic (sigmoid) function
- Cost function
- Gradient computation
- Gradient descent optimization
- Decision boundary visualization
The entire model is implemented without using machine learning libraries like Scikit-Learn, focusing purely on NumPy-based numerical computation.
This project builds a binary classification model that learns the relationship between two exam scores and the probability of passing the DMV written test.
The workflow includes:
- Loading and exploring the dataset
- Visualizing the dataset
- Implementing the sigmoid function
- Defining the logistic regression cost function
- Computing gradients
- Implementing gradient descent from scratch
- Plotting convergence of the cost function
- Visualizing the decision boundary
- Making predictions using trained parameters
File: DMV_Written_Tests.csv
The dataset contains 100 training examples with two input features.
| Feature | Description |
|---|---|
| DMV_Test_1 | Score in DMV Written Test 1 |
| DMV_Test_2 | Score in DMV Written Test 2 |
| Results | Binary label (1 = Pass, 0 = Fail) |
The goal is to predict the probability of passing based on the two test scores.
Essential libraries used in the project:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
Visualization styles and plotting settings are also configured.
The dataset is loaded using Pandas.
data = pd.read_csv("DMV_Written_Tests.csv")
Initial inspection is done using:
data.head()data.info()
The input features and labels are separated into:
scores = data[['DMV_Test_1', 'DMV_Test_2']].values
results = data['Results'].values
Before training the model, the dataset is visualized using Seaborn scatter plots.
- Green triangles represent students who passed
- Red crosses represent students who failed
This helps visualize whether the dataset is linearly separable.
Logistic regression uses the sigmoid function to map predictions to probabilities.
def logistic_function(x):
return 1 / (1 + np.exp(-x))
- Output range: 0 to 1
- Interpreted as probability
- Threshold of 0.5 used for classification
The cost function for logistic regression is defined as:
def compute_cost(theta, x, y):
m = len(y)
y_pred = logistic_function(np.dot(x, theta))
error = (y * np.log(y_pred)) + (1 - y) * np.log(1 - y_pred)
cost = (-1/m) * sum(error)
gradient = (1/m) * np.dot(x.transpose(), (y_pred - y))
return cost[0], gradient
This function returns:
- Current cost value
- Gradient vector
Feature scaling is applied to normalize the dataset.
scores = (scores - mean_scores) / std_scores
This improves gradient descent convergence speed.
A bias term is then added:
X = np.append(np.ones((rows, 1)), scores, axis = 1)
Gradient descent is implemented from scratch to optimize parameters.
def gradient_descent(x, y, theta, alpha, iterations):
costs = []
for i in range(iterations):
cost, gradient = compute_cost(theta, x, y)
theta -= (alpha * gradient)
costs.append(cost)
return theta, costs
Parameters used:
- Learning rate (α) = 1
- Iterations = 500
The change in cost over iterations is visualized to confirm optimization.
plt.plot(costs)
plt.xlabel("Iterations")
plt.ylabel("J(Θ)")
A monotonically decreasing cost curve indicates proper convergence.
The trained logistic regression model produces a linear decision boundary.
The boundary equation is derived from:
y_boundary = -(theta[0] + theta[1] * x_boundary) / theta[2]
This boundary is plotted on top of the scatter plot to visualize classification separation.
Predictions are made using the optimized parameters.
def predict(theta, x):
results = x.dot(theta)
return results > 0
The model predicts:
- 1 → Pass
- 0 → Fail
Training accuracy is computed as:
p = predict(theta, X)
The number of correct predictions is compared with the actual labels to evaluate the training accuracy of the model.
Training Accuracy ≈ 89%
Example: Predict the probability of passing for a student with scores 50 and 79.
test = np.array([50, 79])
probability = logistic_function(test.dot(theta))
Output:
Predicted Probability of Passing ≈ 0.74
- Python
- NumPy
- Pandas
- Matplotlib
- Seaborn
This project demonstrates:
- Logistic regression without ML libraries.
- Numerical optimization using gradient descent.
- Implementation of cost functions and gradients.
- Feature scaling.
- Decision boundary visualization.
- Binary classification from scratch.