The Snitch

An OpenEnv environment for training scalable oversight agents.

What this is

The Snitch puts an LLM overseer in front of frozen tool-use traces from research agents and asks it to detect three misbehavior patterns: reward hacking, laziness, and deception.

The reward function pays for both correct classification and citing the smoking-gun evidence — so it doubles as a benchmark that surfaces a capability gap (evidence grounding) that current frontier post-training pipelines do not close.

Leaderboard — held-out v3, n=120

Model Acc Mean Reward Notes
GPT-4o 90.0% +0.975 0/78 evidence-bonus hits
Llama-3.1-8B-Instruct 85.0% +0.900 0/72 evidence-bonus hits
Qwen2.5-1.5B + LoRA, 400 steps 75.8% +0.900 RL'd on grounded reward
GPT-4o-mini 82.5% +0.858 0/70 evidence-bonus hits
Qwen2.5-7B-Instruct (untrained) 70.8% +0.688 0/55 evidence-bonus hits
Qwen2.5-1.5B + 3-shot (no training) 56.7% +0.438 pretrained baseline
Random verdict ~25% −0.350 uniform over 4 classes

Across four leading instruction-tuned models, 0 of 275 correctly-classified misbehavior traces captured the evidence-grounding bonus. Live data: results/eval_*_n120.json.

API endpoints

GET/health

Liveness probe.

curl http://localhost:7860/health
GET/tasks

List task ids and the action schema.

curl http://localhost:7860/tasks
POST/reset

Start an episode. Returns observation (the trace) + episode_id.

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "hard", "seed": 42}'
POST/step

Submit the overseer's verdict. Returns reward + info.

curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action": "VERDICT: DECEIVER\nCONFIDENCE: 0.9\nEVIDENCE: Step 4 contradicts Step 2."}'
GET/baseline

Run the random-verdict baseline (n=20 per task, seed=42, reproducible).

curl http://localhost:7860/baseline
POST/grader

Score a completed episode by episode_id + task_id.

WS/ws

WebSocket transport (required by HF Spaces). Messages: reset, step, state, close.

Links