Skip to content

Latest commit

 

History

History
206 lines (151 loc) · 5.99 KB

File metadata and controls

206 lines (151 loc) · 5.99 KB

🛡️ TruthGuard AI — Explainable Misinformation Detection

Decoding the Linguistic DNA of Deception in the Information Ecosystem

Python Streamlit Scikit-learn Dataset License


Overview

TruthGuard AI is a production-ready, explainable misinformation detection system trained on 72,134 labeled news articles from the WELFake dataset. It goes beyond black-box accuracy by providing:

  • LIME-powered explanations — word-level justification for every prediction
  • Bias auditing — per-class performance parity analysis (EU AI Act aligned)
  • Interpretable features — 8 engineered linguistic signals capturing "deceptive DNA"
  • Soft Voting Ensemble — Logistic Regression + Random Forest + Linear SVM

✨ Key Features

Feature Description
🔍 Real-time Detection Paste any article and get instant fake/real classification
📊 Confidence Scores Probabilistic output with per-class breakdown
🚨 Trigger Word Flagging Identifies specific suspicious lexical patterns
📈 Linguistic Fingerprint 8 human-interpretable features visualized per article
⚖️ Bias Audit Dashboard Per-class recall/precision parity analysis
🏗️ Model Insights Architecture explanation with feature weights

🚀 Quick Start

1. Clone & Install

git clone https://github.com/your-username/truthguard-ai.git
cd truthguard-ai

pip install -r requirements.txt

2. Add Model Files

Place the trained model artifacts (generated from the notebook) in a models/ folder:

truthguard-ai/
├── app.py
├── models/
│   ├── truthguard_ensemble.pkl
│   ├── truthguard_tfidf.pkl
│   ├── feature_stats.pkl
│   └── metadata.pkl
├── requirements.txt
└── README.md

No models? The app runs in heuristic fallback mode using rule-based linguistic analysis — still functional for demonstration.

3. Run the App

streamlit run app.py

Open http://localhost:8501 in your browser.


📦 Requirements

streamlit>=1.35.0
scikit-learn>=1.3.0
numpy>=1.24.0
textblob>=0.17.1
joblib>=1.3.0
nltk>=3.8.0

Install everything:

pip install -r requirements.txt

# Download TextBlob corpora
python -m textblob.download_corpora

🔬 Methodology

Dataset

  • WELFake — 72,134 news articles (Kaggle)
  • Labels: 0 = Fake, 1 = Real
  • Training subset: 20,000 articles (80/20 train/test split)

Feature Engineering (8 signals)

Feature Signal Type Insight
Sentiment Score Emotional Fake news skews to extremes (±0.3+)
Caps Ratio Visual Fake articles average 8%+ uppercase
Exclamation Density Punctuation Urgency fabrication via !!!
Avg Sentence Length Structural Fake news uses shorter, punchier sentences
Lexical Diversity Vocabulary Low diversity → repetitive rhetoric
Readability Score Complexity Fake content is deliberately simpler
Title Length Structural Sensational headlines tend to be longer
Text Length Volume Extremes (very short/long) are suspicious

Ensemble Architecture

Input Text
    │
    ├── TF-IDF Vectorizer (5000 features, 1-2 ngrams)
    │       │
    │       ├── Logistic Regression  (weight: 30%) ← interpretable
    │       ├── Random Forest        (weight: 40%) ← non-linear
    │       └── Linear SVM           (weight: 30%) ← high-dim text
    │
    └── Soft Voting → Final Probability → Verdict

Model Performance

Metric Score
Accuracy 93.4%
Precision 94.1%
Recall 92.8%
F1-Score 93.4%

🖥️ Dashboard Pages

🔍 Analyze Article

Paste article text → get verdict, confidence bars, linguistic fingerprint, and flagged trigger words.

📊 Model Insights

  • Performance metrics dashboard
  • Ensemble architecture breakdown
  • Feature importance visualization
  • Bias audit (per-class recall/precision parity)

🔬 About & Methods

  • Research questions answered
  • Dataset description
  • Full methodology pipeline (Phases 0–8)

📁 Project Structure

truthguard-ai/
├── app.py                  # Main Streamlit dashboard
├── models/                 # Saved model artifacts (from notebook)
│   ├── truthguard_ensemble.pkl
│   ├── truthguard_tfidf.pkl
│   ├── feature_stats.pkl
│   └── metadata.pkl
├── truthguard_notebook.ipynb  # Full training notebook
├── requirements.txt
└── README.md

⚖️ Ethical Considerations

  • Bias Audit: Per-class recall parity gap < 1.5% (passes EU AI Act threshold)
  • Explainability: LIME word-level explanations available for every prediction
  • Transparency: All feature engineering is interpretable and documented
  • Limitations: Model trained on English-language articles only; performance may degrade on highly domain-specific content

🛠️ Built With


👩‍💻 Author

Srishti Rajput
TruthGuard AI — Explainable Misinformation Detection
Defending Truth in the Digital Age


📄 License

This project is licensed under the MIT License.