The Snitch

An OpenEnv environment for training scalable oversight agents.

What this is

The Snitch puts an LLM overseer in front of frozen tool-use traces from research agents and asks it to detect three misbehavior patterns: reward hacking, laziness, and deception.

The reward function pays for both correct classification and citing the smoking-gun evidence — so it doubles as a benchmark that surfaces a capability gap (evidence grounding) that current frontier post-training pipelines do not close.

Leaderboard — held-out v3, n=120

Model	Acc	Mean Reward	Notes
GPT-4o	90.0%	+0.975	0/78 evidence-bonus hits
Llama-3.1-8B-Instruct	85.0%	+0.900	0/72 evidence-bonus hits
Qwen2.5-1.5B + LoRA, 400 steps	75.8%	+0.900	RL'd on grounded reward
GPT-4o-mini	82.5%	+0.858	0/70 evidence-bonus hits
Qwen2.5-7B-Instruct (untrained)	70.8%	+0.688	0/55 evidence-bonus hits
Qwen2.5-1.5B + 3-shot (no training)	56.7%	+0.438	pretrained baseline
Random verdict	~25%	−0.350	uniform over 4 classes

Across four leading instruction-tuned models, 0 of 275 correctly-classified misbehavior traces captured the evidence-grounding bonus. Live data: results/eval_*_n120.json.

API endpoints

GET/health

Liveness probe.

curl http://localhost:7860/health

GET/tasks

List task ids and the action schema.

curl http://localhost:7860/tasks

POST/reset

Start an episode. Returns observation (the trace) + episode_id.

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "hard", "seed": 42}'

POST/step

Submit the overseer's verdict. Returns reward + info.

curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action": "VERDICT: DECEIVER\nCONFIDENCE: 0.9\nEVIDENCE: Step 4 contradicts Step 2."}'

GET/baseline

Run the random-verdict baseline (n=20 per task, seed=42, reproducible).

curl http://localhost:7860/baseline

POST/grader

Score a completed episode by episode_id + task_id.

WS/ws

WebSocket transport (required by HF Spaces). Messages: reset, step, state, close.

The Snitch

What this is

Leaderboard — held-out v3, n=120

API endpoints

Links