AI Engineer YouTube · June 6, 2026

Evals Are Broken, Use Them Anyway — Ara Khan, Cline

Evals Are Broken, Use Them Anyway — Ara Khan, Cline video thumbnail
Why it matters

Cline started at 43% on Terminal Bench. The improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques specific to Anthropic model families that do not transfer to Codex or Gemini. Not from switching to a better model. Ara Khan's argument is that benchmark numbers are n

My takeaway: Evals Are Broken, Use Them Anyway — Ara Khan, Cline is an agent-security signal. The practical read is that autonomy, memory, tool permissions, and third-party integrations are the control surface that needs threat modeling and monitoring.