Building Long-Running AI Agents: Harness Design and the Claude Evolution
YouTube
This video features Ash Prabaker and Andrew Wilson from Anthropic's Applied AI team, discussing the evolution and design of long-running autonomous agents. They address the core challenges that cause agents to lose coherence over time, including finite context windows, planning difficulties, and the inability of a single model to objectively judge its own output. The presentation charts the historical progress of the Claude model series, demonstrating how improvements in model intelligence have been paired with sophisticated harness designs to extend autonomous task completion from mere minutes to over 12 hours.
The speakers introduce a multi-agent architectural pattern involving a Planner, a Generator, and an Evaluator. This system uses adversarial pressure to ensure quality, where an independent evaluator agent critiques the generator's work against a specific rubric. They showcase practical examples, such as building a fully functional retro game maker and a music production app from simple natural language prompts. The talk concludes with a call to simplify harness scaffolding as base models improve, emphasizing that developers should focus on reading model traces to identify where judgment diverges from human intent.
This video provides an in-depth exploration of how to build AI agents capable of running for several hours without losing focus or coherence. Presented by engineers from Anthropic, the talk covers the transition from basic model prompting to sophisticated multi-agent harness designs that utilize a Planner-Generator-Evaluator loop. Viewers will learn about the history of the Claude model series and the specific engineering techniques used to overcome context rot, planning failures, and self-evaluation traps.
Key Takeaways
Autonomous agents often fail due to finite context windows, poor planning, and the inability to objectively judge their own work.
Anthropic has seen agent runtimes increase from 20 minutes to over 12 hours by co-evolving models and harnesses.
Separating the builder (Generator) from the judge (Evaluator) creates adversarial pressure that significantly improves output quality.
A dedicated Planner should break down vague human prompts into actionable, high-level specifications without over-specifying technical details.
The primary debugging loop for these systems involves reading model traces to find where the AI's judgment diverges from human judgment.
Diagram
Loading diagram...
Timestamps
00:14
IntroductionAsh and Andrew introduce the topic of long-running agents.
02:34
Three Reasons Agents Lose the PlotChallenges including context, planning, and verification.
04:18
Fix 1: Training the BehaviorImproving the base models' innate capabilities.
04:50
Fix 2: Harness DesignBuilding scaffolding around the model.
06:03
History of Claude EvolutionA timeline of Claude model releases and accompanying harness features.
18:27
The Generator/Evaluator LoopSeparating building from judging using adversarial patterns.
21:13
Grading Subjective QualityUsing rubrics to evaluate design taste and craft.
22:30
Frontend Loop in ActionDemonstration of a website built through multiple feedback rounds.
23:44
Target Audience
AI engineers, software developers, and researchers interested in building autonomous agents and understanding LLM harness design.
Use Cases
-Developing autonomous coding assistants for complex, multi-hour projects
-Building quality assurance pipelines for generative AI outputs
-Architecting multi-agent systems for end-to-end product development
-Optimizing context window usage for long-running AI sessions
-Tuning LLM evaluators to apply subjective quality rubrics
As base models become more intelligent, developers should actively delete unnecessary harness scaffolding to reduce cost and complexity.
Three Reasons Agents Lose the Plot
Building agents that stay on track is difficult due to three primary psychological and technical barriers in current LLMs. First is the context problem: models have finite windows and suffer from coherence degradation as those windows fill up, a phenomenon known as context rot. Second is the planning problem: models often attempt to one-shot entire projects, running out of context mid-feature or stopping prematurely after seeing partial progress. Third is the verification problem: models are notoriously poor at judging their own output, often rating mediocre work as good because they cannot separate their generation bias from objective quality requirements.
The Evolution of Claude and its Harness
The history of Claude's development shows a parallel evolution between the model's innate abilities and the external scaffolding provided by engineers. Early versions struggled with basic bash commands, but the introduction of Sonnet 3.5 marked the beginning of serious agentic capabilities. The harness has evolved from simple artifact generation to complex systems involving the Model Context Protocol (MCP) and Computer Use APIs. By 2026, the combination of Opus 4.6 and refined multi-agent teams allowed for autonomous runs exceeding 30 hours, moving beyond simple code generation to full application development.
The Planner-Generator-Evaluator Loop
To achieve high-quality results, Anthropic utilizes a three-agent system. The process begins with a Planner that converts a single line of human intent into a comprehensive feature specification. The Generator then works on one sprint at a time, building out features. Crucially, an independent Evaluator, often using tools like Playwright for web apps, navigates the live application and scores it against a detailed rubric. This adversarial setup is effective because tuning a standalone evaluator to be skeptical is far more tractable than making a generator self-critical. The agents negotiate a contract for each sprint, ensuring they agree on what constitutes a finished feature before code is even written.
Making Subjective Quality Gradable
A common misconception is that subjective taste cannot be graded by AI. Anthropic engineers have found success by creating rubrics that focus on design quality, originality, craft, and functionality. By weighting these criteria and providing few-shot examples of good versus bad design, an evaluator agent can effectively drive the generator toward professional aesthetics. This prevents the common pitfall of AI slop and ensures that the final product feels coherent and polished.
Practical Applications
Developers can apply these patterns today using the primitives already shipping in tools like Claude Code. By implementing an auto-mode for safe unattended handling and using sub-agents for QA roles, teams can begin automating the long-tail of software development. The most important practice is reading the traces. Finding the point where a model's judgment fails is the key to updating prompts and refining the system. As the frontier of model capability moves, engineers must remain flexible, deleting the scaffolding that models have outgrown to keep systems efficient and cost-effective.
Frequently Asked Questions
Why shouldn't a model evaluate its own work?
Models have a natural bias toward the text they generate. They often rationalize errors or overlook missing features because the logic used to create the error is the same logic used during self-review. Using a separate evaluator agent with a harsh system prompt breaks this cycle of sycophancy.
How does the Planner prevent scope creep?
The Planner is designed to work at a high level of abstraction. It breaks down a project into sprints and high-level specifications without getting bogged down in implementation details. This leaves the technical execution to the Generator and Evaluator, who negotiate the specific technical contracts on the fly.
What is the most effective way to debug long-running agents?
The primary debugging tool is the model trace. Engineers must read through the logs to identify the specific turn where the model's logic deviated from human expectations. Updating the system prompt based on these specific failure points is more effective than broad architectural changes.
Scaling to Full-StackIntroducing the Planner into the three-agent system.
32:59
The Debugging LoopThe importance of reading traces to align AI judgment.
38:55
Key TakeawaysSummary of best practices for agent development.