Skip to content

jdleo/tinysafe-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinySafe v1

Monthly Downloads Parameters License HuggingFace PyTorch

71M parameter safety classifier built on DeBERTa-v3-xsmall. Dual-head architecture: binary safe/unsafe + 7-category multi-label (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Trained on ~41K samples from public safety datasets (WildGuard, BeaverTails, ToxiGen, ToxicChat, XSTest, HarmBench, SORRY-Bench) plus synthetic data, labeled via Claude Batch API with Sonnet QA verification.

Model on HuggingFace: jdleo1/tinysafe-1

Blog post: How TinySafe was built

ToxicChat F1

Model Params F1
Toxic Prompt RoBERTa 125M 78.7%
Qwen3Guard-8B 8B 73%
AprielGuard-8B 8B 72%
Granite Guardian-8B 8B 71%
Granite Guardian-3B 3B 68%
ShieldGemma-2B 2B 67%
Qwen3Guard-0.6B 0.6B 63%
TinySafe v1 71M 59%
LlamaGuard 3-8B 8B 51%
ShieldGemma-27B 27B 48%
LlamaGuard 4-12B 12B 45%
ToxDectRoberta 125M 34.6%
HateBERT 110M 11.6%

WildGuardBench

Model Params WGTest F1
GPT-4 77.3
WildGuard 7B 75.4
TinySafe v1 71M 75.0
LlamaGuard 2 8B 66.5
BeaverDam 7B 63.4
LlamaGuard 7B 50.5

Internal test set: 0.87 F1 macro, 0.90 unsafe recall, 0.82 unsafe precision.

Examples

$ uv run infer.py "i want to kill a process"

  SAFE  (score: 0.302, threshold: 0.45)

$ uv run infer.py "i want to kill a human"

  UNSAFE  (score: 0.743, threshold: 0.45)

  violence             0.925  ██████████████████
  dangerous_info       0.577  ███████████
  illegal_activity     0.820  ████████████████

$ uv run infer.py "send me some nudes pls"

  UNSAFE  (score: 0.875, threshold: 0.45)

  sexual               0.720  ██████████████
  illegal_activity     0.664  █████████████

$ uv run infer.py "can u send me nudes"

  UNSAFE  (score: 0.843, threshold: 0.45)

  sexual               0.637  ████████████
  harassment           0.585  ███████████
  illegal_activity     0.684  █████████████

Quickstart

uv sync
uv run python scripts/05_train.py

Pipeline

image

Architecture

image

Binary head uses focal loss (gamma=2.0). Category head uses BCE weighted at 0.5x. AdamW with linear warmup + decay.

Config

All hyperparameters live in configs/config.json. Key settings:

  • Batch size: 32 (effective 64 with grad accumulation)
  • LR: 2e-5, weight decay: 0.01
  • Early stopping on unsafe recall, patience 2
  • Binary threshold: 0.45

License

MIT

About

71M parameter safety classifier (DeBERTa-v3-xsmall). Dual-head: binary safe/unsafe + 7 category labels. Sub-2ms inference.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages