Building Long-Running AI Agents: Harness Design and the Claude Evolution | Tom Karels

Building Long-Running AI Agents: Harness Design and the Claude Evolution

YouTube

This video features Ash Prabaker and Andrew Wilson from Anthropic's Applied AI team, discussing the evolution and design of long-running autonomous agents. They address the core challenges that cause agents to lose coherence over time, including finite context windows, planning difficulties, and the inability of a single model to objectively judge its own output. The presentation charts the historical progress of the Claude model series, demonstrating how improvements in model intelligence have been paired with sophisticated harness designs to extend autonomous task completion from mere minutes to over 12 hours. The speakers introduce a multi-agent architectural pattern involving a Planner, a Generator, and an Evaluator. This system uses adversarial pressure to ensure quality, where an independent evaluator agent critiques the generator's work against a specific rubric. They showcase practical examples, such as building a fully functional retro game maker and a music production app from simple natural language prompts. The talk concludes with a call to simplify harness scaffolding as base models improve, emphasizing that developers should focus on reading model traces to identify where judgment diverges from human intent.

AI Agents Anthropic Claude

Visual Summary

Infographic visualizing Building Long-Running AI Agents: Harness Design and the Claude Evolution

This video provides an in-depth exploration of how to build AI agents capable of running for several hours without losing focus or coherence. Presented by engineers from Anthropic, the talk covers the transition from basic model prompting to sophisticated multi-agent harness designs that utilize a Planner-Generator-Evaluator loop. Viewers will learn about the history of the Claude model series and the specific engineering techniques used to overcome context rot, planning failures, and self-evaluation traps.

Key Takeaways

Autonomous agents often fail due to finite context windows, poor planning, and the inability to objectively judge their own work.
Anthropic has seen agent runtimes increase from 20 minutes to over 12 hours by co-evolving models and harnesses.
Separating the builder (Generator) from the judge (Evaluator) creates adversarial pressure that significantly improves output quality.
A dedicated Planner should break down vague human prompts into actionable, high-level specifications without over-specifying technical details.
The primary debugging loop for these systems involves reading model traces to find where the AI's judgment diverges from human judgment.

Diagram

Loading diagram...

Timestamps

00:14

IntroductionAsh and Andrew introduce the topic of long-running agents.

02:34

Three Reasons Agents Lose the PlotChallenges including context, planning, and verification.

04:18

Fix 1: Training the BehaviorImproving the base models' innate capabilities.

04:50

Fix 2: Harness DesignBuilding scaffolding around the model.

06:03

History of Claude EvolutionA timeline of Claude model releases and accompanying harness features.

18:27

The Generator/Evaluator LoopSeparating building from judging using adversarial patterns.

21:13

Grading Subjective QualityUsing rubrics to evaluate design taste and craft.

22:30

Frontend Loop in ActionDemonstration of a website built through multiple feedback rounds.

23:44

Target Audience

AI engineers, software developers, and researchers interested in building autonomous agents and understanding LLM harness design.

Use Cases

-Developing autonomous coding assistants for complex, multi-hour projects
-Building quality assurance pipelines for generative AI outputs
-Architecting multi-agent systems for end-to-end product development
-Optimizing context window usage for long-running AI sessions
-Tuning LLM evaluators to apply subjective quality rubrics

Key Topics

Agent Harness Design