This project focuses on the automatic detection and classification of spam, smishing (phishing via SMS), and normal messages in the Bengali language.
We used the Banglabarta dataset and fine-tuned three state-of-the-art transformer-based models:
- BERT
- RoBERTa
- XLM-RoBERTa
Our research evaluates their performance on this task and identifies the most effective solution.
Key Finding:
XLM-RoBERTa achieved the best performance, making it the most effective model for Bengali SMS spam detection.
You will need a Python environment with the following libraries installed:
torchtransformersscikit-learnpandasnumpy
Install dependencies using pip:
pip install torch transformers scikit-learn pandas numpydata/ # Small sample of the Banglabarta dataset
notebooks/ # Jupyter notebooks for preprocessing, fine-tuning, evaluation
images/ # Plots and tables for visualization
README.md # Project overview
dataset_resource.md # Dataset details and citation
-
Clone the Repository
git clone https://github.com/your-username/your-repo-name.git cd your-repo-name -
Explore the Notebooks
Open the notebooks in thenotebooks/folder to see the full workflow. -
Download the Full Dataset (Optional)
Thedata/folder contains only a small sample.
For full-scale experiments, download the complete dataset from the link indataset_resource.md.
-
Dataset Preparation
- Used the Banglabarta dataset of Bengali SMS messages
- Created a balanced subset to prevent bias across classes (Smishing, Promotional, Normal)
-
Preprocessing
- Text normalization
- Label encoding
-
Model Fine-Tuning
- Fine-tuned three pre-trained transformer models: BERT, RoBERTa, XLM-RoBERTa
-
Performance Evaluation
- Metrics used: Accuracy, Precision, Recall, F1-score
| Model | Accuracy |
|---|---|
| XLM-RoBERTa | 98.92% |
| BERT | 98.20% |
| RoBERTa | 94.41% |
XLM-RoBERTa outperformed the others.
Its multilingual pre-training on 100+ languages gave it an edge in Bengali text classification.
Meanwhile, RoBERTa lagged behind due to being pre-trained on English-only corpora.
If you use this work, please cite the original research paper.
Full citation details are provided in dataset resource.md.