Logistic Regression with NumPy and Python

A from-scratch implementation of Logistic Regression using NumPy to classify whether a student passes or fails a DMV written test based on two exam scores.

This project demonstrates the mathematical foundations of logistic regression, including:

Logistic (sigmoid) function
Cost function
Gradient computation
Gradient descent optimization
Decision boundary visualization

The entire model is implemented without using machine learning libraries like Scikit-Learn, focusing purely on NumPy-based numerical computation.

Project Overview

This project builds a binary classification model that learns the relationship between two exam scores and the probability of passing the DMV written test.

The workflow includes:

Loading and exploring the dataset
Visualizing the dataset
Implementing the sigmoid function
Defining the logistic regression cost function
Computing gradients
Implementing gradient descent from scratch
Plotting convergence of the cost function
Visualizing the decision boundary
Making predictions using trained parameters

Dataset

File: DMV_Written_Tests.csv

The dataset contains 100 training examples with two input features.

Feature	Description
DMV_Test_1	Score in DMV Written Test 1
DMV_Test_2	Score in DMV Written Test 2
Results	Binary label (1 = Pass, 0 = Fail)

The goal is to predict the probability of passing based on the two test scores.

Implementation Details

1. Importing Libraries

Essential libraries used in the project:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Visualization styles and plotting settings are also configured.

2. Loading and Exploring the Dataset

The dataset is loaded using Pandas.

data = pd.read_csv("DMV_Written_Tests.csv")

Initial inspection is done using:

data.head()
data.info()

The input features and labels are separated into:

scores = data[['DMV_Test_1', 'DMV_Test_2']].values
results = data['Results'].values

3. Data Visualization

Before training the model, the dataset is visualized using Seaborn scatter plots.

Green triangles represent students who passed
Red crosses represent students who failed

This helps visualize whether the dataset is linearly separable.

4. Logistic Sigmoid Function

Logistic regression uses the sigmoid function to map predictions to probabilities.

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Implementation:

def logistic_function(x):
    return 1 / (1 + np.exp(-x))

Properties:

Output range: 0 to 1
Interpreted as probability
Threshold of 0.5 used for classification

5. Cost Function and Gradient

The cost function for logistic regression is defined as:

$$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))\right] $$

Implementation:

def compute_cost(theta, x, y):
    m = len(y)
    y_pred = logistic_function(np.dot(x, theta))
    error = (y * np.log(y_pred)) + (1 - y) * np.log(1 - y_pred)
    cost = (-1/m) * sum(error)
    gradient = (1/m) * np.dot(x.transpose(), (y_pred - y))
    return cost[0], gradient

This function returns:

Current cost value
Gradient vector

6. Feature Scaling

Feature scaling is applied to normalize the dataset.

scores = (scores - mean_scores) / std_scores

This improves gradient descent convergence speed.

A bias term is then added:

X = np.append(np.ones((rows, 1)), scores, axis = 1)

7. Gradient Descent Implementation

Gradient descent is implemented from scratch to optimize parameters.

def gradient_descent(x, y, theta, alpha, iterations):
    costs = []
    
    for i in range(iterations):
        cost, gradient = compute_cost(theta, x, y)
        theta -= (alpha * gradient)
        costs.append(cost)
        
    return theta, costs

Parameters used:

Learning rate (α) = 1
Iterations = 500

8. Convergence of Cost Function

The change in cost over iterations is visualized to confirm optimization.

plt.plot(costs)
plt.xlabel("Iterations")
plt.ylabel("J(Θ)")

A monotonically decreasing cost curve indicates proper convergence.

9. Decision Boundary Visualization

The trained logistic regression model produces a linear decision boundary.

The boundary equation is derived from:

$$ \theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0 $$

Implementation:

y_boundary = -(theta[0] + theta[1] * x_boundary) / theta[2]

This boundary is plotted on top of the scatter plot to visualize classification separation.

10. Prediction Function

Predictions are made using the optimized parameters.

def predict(theta, x):
    results = x.dot(theta)
    return results > 0

The model predicts:

1 → Pass
0 → Fail

11. Model Accuracy

Training accuracy is computed as:

p = predict(theta, X)

The number of correct predictions is compared with the actual labels to evaluate the training accuracy of the model.

Training Accuracy ≈ 89%

12. Example Prediction

Example: Predict the probability of passing for a student with scores 50 and 79.

test = np.array([50, 79])
probability = logistic_function(test.dot(theta))

Output:

Predicted Probability of Passing ≈ 0.74

Tech Stack

Python
NumPy
Pandas
Matplotlib
Seaborn

Key Learning Outcomes

This project demonstrates:

Logistic regression without ML libraries.
Numerical optimization using gradient descent.
Implementation of cost functions and gradients.
Feature scaling.
Decision boundary visualization.
Binary classification from scratch.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Dataset		Dataset
Logistic_Regression.ipynb		Logistic_Regression.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logistic Regression with NumPy and Python

Project Overview

Dataset

Implementation Details

1. Importing Libraries

2. Loading and Exploring the Dataset

3. Data Visualization

4. Logistic Sigmoid Function

Implementation:

Properties:

5. Cost Function and Gradient

Implementation:

6. Feature Scaling

7. Gradient Descent Implementation

8. Convergence of Cost Function

9. Decision Boundary Visualization

Implementation:

10. Prediction Function

11. Model Accuracy

12. Example Prediction

Tech Stack

Key Learning Outcomes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Logistic Regression with NumPy and Python

Project Overview

Dataset

Implementation Details

1. Importing Libraries

2. Loading and Exploring the Dataset

3. Data Visualization

4. Logistic Sigmoid Function

Implementation:

Properties:

5. Cost Function and Gradient

Implementation:

6. Feature Scaling

7. Gradient Descent Implementation

8. Convergence of Cost Function

9. Decision Boundary Visualization

Implementation:

10. Prediction Function

11. Model Accuracy

12. Example Prediction

Tech Stack

Key Learning Outcomes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages