The Snitch
An OpenEnv environment for training scalable oversight agents.
What this is
The Snitch puts an LLM overseer in front of frozen tool-use traces from research agents and asks it to detect three misbehavior patterns: reward hacking, laziness, and deception.
The reward function pays for both correct classification and citing the smoking-gun evidence — so it doubles as a benchmark that surfaces a capability gap (evidence grounding) that current frontier post-training pipelines do not close.
Leaderboard — held-out v3, n=120
| Model | Acc | Mean Reward | Notes |
|---|---|---|---|
| GPT-4o | 90.0% | +0.975 | 0/78 evidence-bonus hits |
| Llama-3.1-8B-Instruct | 85.0% | +0.900 | 0/72 evidence-bonus hits |
| Qwen2.5-1.5B + LoRA, 400 steps | 75.8% | +0.900 | RL'd on grounded reward |
| GPT-4o-mini | 82.5% | +0.858 | 0/70 evidence-bonus hits |
| Qwen2.5-7B-Instruct (untrained) | 70.8% | +0.688 | 0/55 evidence-bonus hits |
| Qwen2.5-1.5B + 3-shot (no training) | 56.7% | +0.438 | pretrained baseline |
| Random verdict | ~25% | −0.350 | uniform over 4 classes |
Across four leading instruction-tuned models, 0 of 275 correctly-classified misbehavior traces captured the evidence-grounding bonus. Live data: results/eval_*_n120.json.
API endpoints
Liveness probe.
curl http://localhost:7860/health
List task ids and the action schema.
curl http://localhost:7860/tasks
Start an episode. Returns observation (the trace) + episode_id.
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "hard", "seed": 42}'
Submit the overseer's verdict. Returns reward + info.
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"action": "VERDICT: DECEIVER\nCONFIDENCE: 0.9\nEVIDENCE: Step 4 contradicts Step 2."}'
Run the random-verdict baseline (n=20 per task, seed=42, reproducible).
curl http://localhost:7860/baseline
Score a completed episode by episode_id + task_id.
WebSocket transport (required by HF Spaces). Messages: reset, step, state, close.