🛡️ Phishing Website Detector

📌 Introduction

This project focuses on detecting phishing websites by analyzing URLs using machine learning.
The workflow includes dataset cleaning, feature engineering, training multiple classification models, selecting the best-performing one, and allowing real-time predictions for any user-entered URL.

🎯 Objectives

Clean and prepare the provided phishing dataset.
Extract important numerical features from the URL dataset.
Train three ML models:
- Logistic Regression
- Random Forest
- XGBoost
Evaluate each model using accuracy, precision, recall, and F1-score.
Automatically select and save the best model.
Build a hybrid URL feature extractor to analyze new URLs.
Predict whether a given URL is Phishing or Legitimate.

📂 Dataset

File Used: provided_dataset.csv
Contains:
- URL-based features
- Host/lexical properties
- Labels indicating phishing or legitimate
The column status is converted into a binary label:
- 1 → Phishing
- 0 → Legitimate

🛠️ Tools & Libraries

Python
Pandas, NumPy – data handling
Matplotlib, Seaborn – basic analysis/visualization
Scikit-learn – ML models, preprocessing, evaluation
XGBoost – gradient boosting classifier
WHOIS, socket, requests – real-time URL feature extraction
Joblib – saving the trained model and feature columns

🔎 Data Preprocessing

The dataset goes through the following cleaning steps:

Removing constant and high-missing columns
Converting object-type numeric values to actual numbers
Dropping columns that cannot be converted
Filling NaN and inf values using median values
Preparing feature matrix X and label vector y
Splitting into 70% train and 30% test

🤖 Machine Learning Models

The project trains and evaluates the following classifiers:

Logistic Regression
Random Forest Classifier
XGBoost Classifier

Each model is evaluated using:

Accuracy
Precision
Recall
F1-score

The model with the highest F1-score is chosen as the final model.
This best model is saved automatically as: BestModelName.joblib feature_columns.joblib

🧪 Hybrid URL Feature Extraction

For real-time URL prediction, the project extracts features such as:

URL length
Hostname length
Count of dots, hyphens, slashes, special characters
Digit count & digit-to-length ratio
WHOIS information (domain age, registration length)
DNS record validation
Simple SSL status
Redirect count

Unavailable or unsupported features are assigned a default value (-1) to maintain column alignment.

🚀 URL Prediction

The user can enter any URL in the console: Enter the url:

The system outputs:

Extracted feature vector
Prediction → Phishing or Legitimate
Probability scores (if supported by the model)

Example: URL: http://example.com

Prediction: Legitimate Probabilities: [0.93, 0.07]

Author

Ritesh Kumar Pandit

B.Tech CSE — IILM University

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
XGBoost.joblib		XGBoost.joblib
feature_columns.joblib		feature_columns.joblib
phishing_detection.ipynb		phishing_detection.ipynb
provided_dataset.csv		provided_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Phishing Website Detector

📌 Introduction

🎯 Objectives

📂 Dataset

🛠️ Tools & Libraries

🔎 Data Preprocessing

🤖 Machine Learning Models

🧪 Hybrid URL Feature Extraction

🚀 URL Prediction

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ Phishing Website Detector

📌 Introduction

🎯 Objectives

📂 Dataset

🛠️ Tools & Libraries

🔎 Data Preprocessing

🤖 Machine Learning Models

🧪 Hybrid URL Feature Extraction

🚀 URL Prediction

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages