TinySafe v1

71M parameter safety classifier built on DeBERTa-v3-xsmall. Dual-head architecture: binary safe/unsafe + 7-category multi-label (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Trained on ~41K samples from public safety datasets (WildGuard, BeaverTails, ToxiGen, ToxicChat, XSTest, HarmBench, SORRY-Bench) plus synthetic data, labeled via Claude Batch API with Sonnet QA verification.

Model on HuggingFace: jdleo1/tinysafe-1

Blog post: How TinySafe was built

ToxicChat F1

Model	Params	F1
Toxic Prompt RoBERTa	125M	78.7%
Qwen3Guard-8B	8B	73%
AprielGuard-8B	8B	72%
Granite Guardian-8B	8B	71%
Granite Guardian-3B	3B	68%
ShieldGemma-2B	2B	67%
Qwen3Guard-0.6B	0.6B	63%
TinySafe v1	71M	59%
LlamaGuard 3-8B	8B	51%
ShieldGemma-27B	27B	48%
LlamaGuard 4-12B	12B	45%
ToxDectRoberta	125M	34.6%
HateBERT	110M	11.6%

WildGuardBench

Model	Params	WGTest F1
GPT-4	—	77.3
WildGuard	7B	75.4
TinySafe v1	71M	75.0
LlamaGuard 2	8B	66.5
BeaverDam	7B	63.4
LlamaGuard	7B	50.5

Internal test set: 0.87 F1 macro, 0.90 unsafe recall, 0.82 unsafe precision.

Examples

$ uv run infer.py "i want to kill a process"

  SAFE  (score: 0.302, threshold: 0.45)

$ uv run infer.py "i want to kill a human"

  UNSAFE  (score: 0.743, threshold: 0.45)

  violence             0.925  ██████████████████
  dangerous_info       0.577  ███████████
  illegal_activity     0.820  ████████████████

$ uv run infer.py "send me some nudes pls"

  UNSAFE  (score: 0.875, threshold: 0.45)

  sexual               0.720  ██████████████
  illegal_activity     0.664  █████████████

$ uv run infer.py "can u send me nudes"

  UNSAFE  (score: 0.843, threshold: 0.45)

  sexual               0.637  ████████████
  harassment           0.585  ███████████
  illegal_activity     0.684  █████████████

Quickstart

uv sync
uv run python scripts/05_train.py

Pipeline

Architecture

Binary head uses focal loss (gamma=2.0). Category head uses BCE weighted at 0.5x. AdamW with linear warmup + decay.

Config

All hyperparameters live in configs/config.json. Key settings:

Batch size: 32 (effective 64 with grad accumulation)
LR: 2e-5, weight decay: 0.01
Early stopping on unsafe recall, patience 2
Binary threshold: 0.45

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinySafe v1

ToxicChat F1

WildGuardBench

Examples

Quickstart

Pipeline

Architecture

Config

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinySafe v1

ToxicChat F1

WildGuardBench

Examples

Quickstart

Pipeline

Architecture

Config

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages