Skip to content

ritesh-begin/Phishing_website_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ Phishing Website Detector

📌 Introduction

This project focuses on detecting phishing websites by analyzing URLs using machine learning.
The workflow includes dataset cleaning, feature engineering, training multiple classification models, selecting the best-performing one, and allowing real-time predictions for any user-entered URL.


🎯 Objectives

  • Clean and prepare the provided phishing dataset.
  • Extract important numerical features from the URL dataset.
  • Train three ML models:
    • Logistic Regression
    • Random Forest
    • XGBoost
  • Evaluate each model using accuracy, precision, recall, and F1-score.
  • Automatically select and save the best model.
  • Build a hybrid URL feature extractor to analyze new URLs.
  • Predict whether a given URL is Phishing or Legitimate.

📂 Dataset

  • File Used: provided_dataset.csv
  • Contains:
    • URL-based features
    • Host/lexical properties
    • Labels indicating phishing or legitimate
  • The column status is converted into a binary label:
    • 1 → Phishing
    • 0 → Legitimate

🛠️ Tools & Libraries

  • Python
  • Pandas, NumPy – data handling
  • Matplotlib, Seaborn – basic analysis/visualization
  • Scikit-learn – ML models, preprocessing, evaluation
  • XGBoost – gradient boosting classifier
  • WHOIS, socket, requests – real-time URL feature extraction
  • Joblib – saving the trained model and feature columns

🔎 Data Preprocessing

The dataset goes through the following cleaning steps:

  • Removing constant and high-missing columns
  • Converting object-type numeric values to actual numbers
  • Dropping columns that cannot be converted
  • Filling NaN and inf values using median values
  • Preparing feature matrix X and label vector y
  • Splitting into 70% train and 30% test

🤖 Machine Learning Models

The project trains and evaluates the following classifiers:

  1. Logistic Regression
  2. Random Forest Classifier
  3. XGBoost Classifier

Each model is evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1-score

The model with the highest F1-score is chosen as the final model.
This best model is saved automatically as: BestModelName.joblib feature_columns.joblib


🧪 Hybrid URL Feature Extraction

For real-time URL prediction, the project extracts features such as:

  • URL length
  • Hostname length
  • Count of dots, hyphens, slashes, special characters
  • Digit count & digit-to-length ratio
  • WHOIS information (domain age, registration length)
  • DNS record validation
  • Simple SSL status
  • Redirect count

Unavailable or unsupported features are assigned a default value (-1) to maintain column alignment.


🚀 URL Prediction

The user can enter any URL in the console: Enter the url:

The system outputs:

  • Extracted feature vector
  • Prediction → Phishing or Legitimate
  • Probability scores (if supported by the model)

Example: URL: http://example.com

Prediction: Legitimate Probabilities: [0.93, 0.07]

Author

Ritesh Kumar Pandit

B.Tech CSE — IILM University

About

This is an AI Phishing Website Detector. This accurately describes whether the input URL is phishing or not. Even though it is trained for 80+ features but it can be also used to predict using limited features that can be scraped from the URL itself.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors