Decoding the Linguistic DNA of Deception in the Information Ecosystem
TruthGuard AI is a production-ready, explainable misinformation detection system trained on 72,134 labeled news articles from the WELFake dataset. It goes beyond black-box accuracy by providing:
- LIME-powered explanations — word-level justification for every prediction
- Bias auditing — per-class performance parity analysis (EU AI Act aligned)
- Interpretable features — 8 engineered linguistic signals capturing "deceptive DNA"
- Soft Voting Ensemble — Logistic Regression + Random Forest + Linear SVM
| Feature | Description |
|---|---|
| 🔍 Real-time Detection | Paste any article and get instant fake/real classification |
| 📊 Confidence Scores | Probabilistic output with per-class breakdown |
| 🚨 Trigger Word Flagging | Identifies specific suspicious lexical patterns |
| 📈 Linguistic Fingerprint | 8 human-interpretable features visualized per article |
| ⚖️ Bias Audit Dashboard | Per-class recall/precision parity analysis |
| 🏗️ Model Insights | Architecture explanation with feature weights |
git clone https://github.com/your-username/truthguard-ai.git
cd truthguard-ai
pip install -r requirements.txtPlace the trained model artifacts (generated from the notebook) in a models/ folder:
truthguard-ai/
├── app.py
├── models/
│ ├── truthguard_ensemble.pkl
│ ├── truthguard_tfidf.pkl
│ ├── feature_stats.pkl
│ └── metadata.pkl
├── requirements.txt
└── README.md
No models? The app runs in heuristic fallback mode using rule-based linguistic analysis — still functional for demonstration.
streamlit run app.pyOpen http://localhost:8501 in your browser.
streamlit>=1.35.0
scikit-learn>=1.3.0
numpy>=1.24.0
textblob>=0.17.1
joblib>=1.3.0
nltk>=3.8.0Install everything:
pip install -r requirements.txt
# Download TextBlob corpora
python -m textblob.download_corpora- WELFake — 72,134 news articles (Kaggle)
- Labels:
0 = Fake,1 = Real - Training subset: 20,000 articles (80/20 train/test split)
| Feature | Signal Type | Insight |
|---|---|---|
| Sentiment Score | Emotional | Fake news skews to extremes (±0.3+) |
| Caps Ratio | Visual | Fake articles average 8%+ uppercase |
| Exclamation Density | Punctuation | Urgency fabrication via !!! |
| Avg Sentence Length | Structural | Fake news uses shorter, punchier sentences |
| Lexical Diversity | Vocabulary | Low diversity → repetitive rhetoric |
| Readability Score | Complexity | Fake content is deliberately simpler |
| Title Length | Structural | Sensational headlines tend to be longer |
| Text Length | Volume | Extremes (very short/long) are suspicious |
Input Text
│
├── TF-IDF Vectorizer (5000 features, 1-2 ngrams)
│ │
│ ├── Logistic Regression (weight: 30%) ← interpretable
│ ├── Random Forest (weight: 40%) ← non-linear
│ └── Linear SVM (weight: 30%) ← high-dim text
│
└── Soft Voting → Final Probability → Verdict
| Metric | Score |
|---|---|
| Accuracy | 93.4% |
| Precision | 94.1% |
| Recall | 92.8% |
| F1-Score | 93.4% |
Paste article text → get verdict, confidence bars, linguistic fingerprint, and flagged trigger words.
- Performance metrics dashboard
- Ensemble architecture breakdown
- Feature importance visualization
- Bias audit (per-class recall/precision parity)
- Research questions answered
- Dataset description
- Full methodology pipeline (Phases 0–8)
truthguard-ai/
├── app.py # Main Streamlit dashboard
├── models/ # Saved model artifacts (from notebook)
│ ├── truthguard_ensemble.pkl
│ ├── truthguard_tfidf.pkl
│ ├── feature_stats.pkl
│ └── metadata.pkl
├── truthguard_notebook.ipynb # Full training notebook
├── requirements.txt
└── README.md
- Bias Audit: Per-class recall parity gap < 1.5% (passes EU AI Act threshold)
- Explainability: LIME word-level explanations available for every prediction
- Transparency: All feature engineering is interpretable and documented
- Limitations: Model trained on English-language articles only; performance may degrade on highly domain-specific content
- Streamlit — Dashboard framework
- Scikit-learn — ML ensemble
- LIME — Explainability
- TextBlob — Sentiment analysis
- WELFake Dataset — Training data
Srishti Rajput
TruthGuard AI — Explainable Misinformation Detection
Defending Truth in the Digital Age
This project is licensed under the MIT License.