· FRONTEND-2026-RADAR · 2026.05.05 · 83 MIN ·

Anthropic Engineering —《长跑应用开发的 Harness 设计》(全文)

Anthropic Labs 团队 2026-03-24 原文逐段翻译 + 译注。GAN 启发的 generator-evaluator 多 agent;planner-generator-evaluator 三体架构;前端设计可分级评分准则。 · by fancyoung

AI · HERO seed:5320260505 Anthropic Labs 团队 2026-03-24 原文逐段翻译 + 译注。GAN 启发的 generator-evaluator 多 agent;planner-generator-evaluator 三体架构;前端设计可分级评分准则。

FIG.00 — cover · ai-generated · placeholder

原文: https://www.anthropic.com/engineering/harness-design-long-running-apps 作者: Prithvi Rajasekaran(Anthropic Labs 团队) 发布: 2026-03-24 本译版定位: 完整逐段翻译 + 译注

译者前言

这篇是 Anthropic 工程团队 2026-03-24 发布的官方 harness 论。作者 Prithvi Rajasekaran 来自 Anthropic Labs 团队,把过去几个月在「让 Claude 做出更高质量的前端设计」与「让 Claude 在无人干预下完成完整应用」两条战线上的工作合并起来,系统性地讲清了 GAN 启发的 generator–evaluator 多 agent 结构、planner–generator–evaluator 三体架构,以及随着模型从 Sonnet 4.5 → Opus 4.5 → Opus 4.6 演进时,harness 应当如何被精简、被重新评估。

它和 Addy Osmani 在民间技术圈写的 Harness Engineering 系列(详见配套精读)互为补充——Addy 给的是社区视角的工程模式总结,Anthropic 这篇则带着第一手实验数据(retro 游戏制作器、Web Audio DAW 两组完整 build,小时数与 token 成本逐项列出),把「为什么需要 evaluator」「为什么需要 sprint contract」「为什么 4.6 之后又可以删掉 sprint」讲得很硬核。

为什么 2026 前端工程师必读?因为「Frontend design: making subjective quality gradable」这一节直接给出了一份可复用的设计评分 rubric(design quality / originality / craft / functionality)和 Playwright MCP 主导的可视评估 loop——这是把「美感」这种主观维度装进自动化反馈环路的最早一批工程化范式之一,前端从业者无论是不是用 Anthropic 的栈,都能直接拿来对自家 AI 设计产线做框架对照。

🟢 译者注:harness 在英文里直译为「马具/挽具」,在 agent 工程语境里指「围绕模型的脚手架与控制层」——包括 prompt 模板、agent 编排、上下文管理、工具接入、循环控制等所有不属于模型本体、但决定模型在真实任务里能跑多远的工程基础设施。本文中 harness 全部保留英文不译,以保证语境一致。

长跑应用开发的 Harness 设计

原文:Published Mar 24, 2026

发布于 2026 年 3 月 24 日

原文:Harness design is key to performance at the frontier of agentic coding. Here’s how we pushed Claude further in frontend design and long-running autonomous software engineering.

Harness 设计是 agent 编程前沿性能的关键。本文讲的是我们如何在前端设计与长跑自主软件工程这两条战线上,把 Claude 又向前推了一步。

原文:Written by Prithvi Rajasekaran, a member of our Labs team.

作者:Prithvi Rajasekaran,Anthropic Labs 团队成员。

🟢 译者注:Anthropic Labs 是 Anthropic 内部专门做实验性产品与早期工程探索的团队,Cowork 等概念性产品多出自这里。

原文:Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

过去几个月里,我一直在解决两个相互关联的问题:让 Claude 产出高质量的前端设计,以及让它在无人干预下构建完整的应用。这项工作起源于我们更早的两个项目——frontend design skill(前端设计技能)和 long-running coding agent harness(长跑编程 agent 的 harness)。在那两个项目里,我和同事通过 prompt engineering 和 harness 设计,把 Claude 的表现拉到了远高于 baseline 的水平,但最终都撞到了天花板。

原文:To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

为了突破天花板,我开始寻找一种 AI 工程方法,要求它在两个差别极大的领域都能成立——一个领域由主观品味定义(前端设计),另一个由可验证的正确性与可用性定义(完整应用)。我从 Generative Adversarial Networks(GAN,生成对抗网络)里得到灵感,设计了一个由 generator(生成 agent)与 evaluator(评估 agent)组成的多 agent 结构。要让 evaluator 既稳定打分、又有审美,首先必须开发出一组评分准则,把「这个设计好不好?」这种主观判断,转译成具体、可分级的指标。

原文:I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

之后我把这些技术应用到长跑自主编程上,沿用了我们早期 harness 工作的两条经验:把 build 分解成可处理的片段;用结构化产物在 session 之间交接 context。最终成型的是一套三 agent 架构——planner、generator、evaluator——它能在持续数小时的自主编程 session 中产出功能丰富的全栈应用。

🟢 译者注:session 在 agent 语境里指一次 Claude 进程的连续对话上下文(到 context window 满 / 进程结束就算一个 session 结束)。artifact 在这里是「结构化交接产物」,通常是文件(spec、handoff log、contract 等),用于跨 session 把状态从一个 agent 传给下一个。

为什么朴素实现行不通

原文:We’ve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the “Ralph Wiggum” method using hooks or scripts to keep agents in continuous iteration cycles.

我们之前已经展示过:harness 设计对长跑 agent 编程效果有显著影响。在更早的一次实验里,我们用一个 initializer agent 把产品 spec 拆成任务清单,再用一个 coding agent 一次实现一个 feature,完成后通过 artifacts 把 context 交接到下一个 session。更广义的开发者社区也收敛到了类似的见解,比如「Ralph Wiggum」方法,用 hooks 或脚本让 agents 持续运行在迭代循环里。

🟢 译者注:Ralph Wiggum 方法是 ghuntley 提出的极简 agent 编排范式——名字来自《辛普森一家》里那个总是傻乎乎重复 “I’m helping” 的小孩;核心做法就是 while true; do claude -p "fix the next thing"; done 这种循环式硬调度。

原文:But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

但有一些问题始终没解决。在更复杂的任务上,agent 还是会随着时间推移逐渐脱轨。在拆解这个问题的过程中,我们观察到 agents 执行这类任务时有两种常见的失败模式。

原文:First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit “context anxiety,” in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps—addresses both these issues.

第一种是:在长任务中,随着 context window 被逐渐填满,模型往往会失去 coherence(连贯性)(参见我们关于 context engineering 的文章)。一些模型还会表现出「context anxiety」(context 焦虑)——当它们觉得自己快撞到 context 上限时,会过早地结束工作。Context resets(上下文重置)——彻底清空 context window、启动一个全新的 agent,再配合一份结构化 handoff(交接文件,内含上一个 agent 的状态与下一步动作)——能够同时解决这两个问题。

🟢 译者注:context anxiety 是 Anthropic 内部对模型在 context 接近上限时表现出的「焦虑早收尾」行为的命名,Sonnet 4.5 表现尤甚,这是后文 4.5/4.6 对 harness 改动的核心动因。

原文:This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn’t give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn’t sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

这与 compaction(压缩)不同。compaction 是把对话的早期部分就地摘要,让同一个 agent 在更短的历史上继续工作。compaction 保留了连续性,但没有给 agent 一张白纸,所以 context anxiety 还是会继续存在。reset 提供了一张白纸,代价是 handoff artifact 必须装得下足够多的状态,让下一个 agent 能干净地接过工作。在我们更早的测试里,Claude Sonnet 4.5 的 context anxiety 强到 compaction 单独不足以撑起长任务性能,所以 context resets 成了 harness 设计中不可或缺的一部分。它解决了核心问题,但也给每一次 harness 运行增加了编排复杂度、token 开销与延迟。

🟢 译者注:compaction 是 Anthropic 官方对「自动摘要历史以腾出 context 空间」机制的命名,Claude Agent SDK 里有自动 compaction;它和 reset 是两种处理 context 上限的不同策略。

原文:A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

第二个问题是 self-evaluation(自我评估),这个我们之前没有处理过。当 agents 被要求评估自己产出的工作时,它们往往会自信地夸奖自己——哪怕在人类看来这份工作质量明显平庸。这个问题在像设计这种主观任务上尤其严重,因为这里没有「二元检查」可以做(像可验证的软件测试那样)。一个布局究竟「有质感」还是「平庸」是个判断题,而 agents 给自己打分时一致地偏向乐观。

原文:However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn’t immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

不过,即便是在那些有可验证结果的任务上,agents 有时也会表现出糟糕的判断力,妨碍它们完成任务。把「干活的 agent」和「打分的 agent」分开,被证明是处理这个问题的强力杠杆。这种分离并不能立刻消除偏松的倾向——evaluator 自己也是个 LLM,天然倾向于宽容地对待 LLM 生成的产出。但事实证明:把一个独立的 evaluator 调成「持怀疑态度」,比让 generator 自己批评自己要可行得多。而一旦外部反馈存在,generator 就有了具体的东西可以迭代。

前端设计:让主观质量变得可分级

原文:I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

我从前端设计开始做实验,因为 self-evaluation 问题在这里最明显。在没有任何干预的情况下,Claude 倾向于做出那种安全、可预测的布局——技术上能跑,但视觉上毫无亮点。

原文:Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. “Is this design beautiful?” is hard to answer consistently, but “does this follow our principles for good design?” gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

有两条洞察决定了我为前端设计搭的这套 harness 的形状。第一,审美确实没法完全还原成一个分数——个体口味也总是会有差异——但可以通过一组承载了设计原则与偏好的评分准则来改进它。「这个设计美吗?」很难一致地回答,但「这个设计是否遵循我们认可的好设计原则?」就给了 Claude 一个具体的评分对象。第二,把「前端生成」和「前端打分」分开,我们就能造出一个反馈回路,驱动 generator 产出更强的输出。

原文:With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

带着这两条洞察,我写了四条评分准则,同时塞进 generator 和 evaluator 两个 agent 的 prompt 里:

Design quality(设计质量): 这个设计感觉上是一个连贯的整体,还是一堆零件的拼接?在这一项上做得好,意味着颜色、字体、布局、图像与其他细节,共同组合出一种鲜明的气质与身份认同。
Originality(原创性): 是否有定制化决定的痕迹?还是只是模板布局、库的默认值,和 AI 生成的套路?一个人类设计师应该能识别出刻意为之的创作选择。未经修改的现成组件——或者那种「白卡片配紫色渐变」之类一看就是 AI 生成的标志——在这里都不及格。

🟢 译者注:作者举的「白卡片配紫色渐变」(purple gradients over white cards)是 2024–2025 年 AI 生成 UI 高度同质化的标志特征,在 Anthropic 内部和设计圈里都被当成「AI slop」(AI 平庸输出)的典型反面教材。

Craft(工艺): 技术执行层面:字体层级、间距一致性、配色和谐度、对比度。这一项考的是基本功,不是创意。大部分合理实现默认就能过关;不及格意味着基本功有破洞。
Functionality(功能性): 与美学无关的可用性。用户能否搞清楚这个界面是做什么的、找到主操作、并在不靠猜的情况下完成任务?

原文:I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

我有意把 design quality 和 originality 的权重压过 craft 和 functionality。Claude 在 craft 和 functionality 上默认就能拿高分,因为要求的技术能力对它来说很自然。但在 design 和 originality 上,Claude 经常给出最多也只能叫「平淡」的输出。准则明确惩罚了那些高度通用的「AI slop」(AI 平庸套路)模式,而且通过给 design 和 originality 更大的权重,把模型往更敢承担美学风险的方向推。

🟢 译者注:AI slop 是 2024–2025 年逐渐流行的英语网络词,泛指「AI 大量产出的平庸、千篇一律、可识别 AI 味的内容」,Anthropic 在这里把它正式纳入了 prompt 评分语汇。

原文:I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

我用 few-shot examples(带详细分项打分的示例)来校准 evaluator。这保证 evaluator 的判断与我的偏好对齐,也减少了跨迭代的打分漂移。

原文:I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator’s critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn’t working.

我把这个 loop 搭在 Claude Agent SDK 上,编排逻辑因此保持得很简单。generator agent 先根据用户 prompt 生成一份 HTML/CSS/JS 前端。我把 Playwright MCP 给到 evaluator,让它能直接与运行中的页面交互,再针对每条准则打分并写出详细批评。实践中,evaluator 会自己浏览页面、截图、仔细研究实现细节,然后才出评估意见。这份反馈再流回 generator,作为下一轮迭代的输入。每次 generation 我跑 5 到 15 轮迭代,每一轮通常会把 generator 推向一个更鲜明的方向,作为对 evaluator 批评的回应。因为 evaluator 是真的在「浏览」页面而不是给一张静态截图打分,每轮都要花真实的 wall-clock time。完整跑一次最长能到四个小时。我也指示 generator 在每轮评估之后做一个战略决定:如果分数趋势在变好,就在当前方向上精修;如果路子不对,就整体转向一种完全不同的美学。

🟢 译者注:Playwright MCP 是把 Playwright(浏览器自动化框架)封装成一个 Model Context Protocol(MCP)服务器,让 agent 可以像人一样点击、截图、检查 DOM。这是 Anthropic 这套设计 evaluator 区别于「截图打分」朴素方案的关键工程动作。

原文:Across runs, the evaluator’s assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

跨多次运行来看,evaluator 的评估随迭代次数提升,在到达 plateau(平台期)之前一直在改进——而且仍有提升空间。有些 generation 是渐进式精修;另一些则在迭代之间出现剧烈的美学转向。

原文:The wording of the criteria steered the generator in ways I didn’t fully anticipate. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

准则的措辞会以我没完全预料到的方式引导 generator。比如加入「the best designs are museum quality(最好的设计是博物馆级的)」这种短语,会把设计推向某种特定的视觉收敛——这说明,与准则相关的 prompting 直接塑造了输出的「性格」。

原文:While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

虽然分数总体上随迭代变好,但并不总是干净的线性曲线。后期实现整体上往往更好,但我经常遇到这种情况:某一中间迭代比最后一轮更得我心。实现复杂度也倾向于随轮数上升——generator 会在 evaluator 反馈的驱动下,去够更有野心的方案。哪怕在第一轮迭代上,输出就明显好于完全不加 prompting 的 baseline——这说明,即便没有任何 evaluator 反馈做后续精修,这些准则与相关措辞本身就把模型从「通用默认值」上推开了。

原文:In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn’t seen before from a single-pass generation.

有一个值得讲的例子:我让模型为一家荷兰艺术博物馆做一个网站。到第 9 轮迭代时,它已经做出了一份干净的暗色主题落地页,视觉上打磨得不错,但基本符合我的预期。然后,在第 10 轮,它把整个思路推翻,把这个网站重新想象成一种空间体验:一个用 CSS perspective 渲染出来的 3D 房间,地面是棋盘格,墙上以自由位置挂着艺术品,展厅之间的导航不是滚动或点击,而是穿过门洞。这种创造性跳跃,是我以前从单次生成里没见过的。

扩展到全栈编程

原文:With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

带着这些发现,我把这套 GAN 启发的模式应用到全栈开发上。generator–evaluator 循环天然地映射到软件开发生命周期上——code review 和 QA 在结构上扮演的角色,正好和设计 evaluator 是一回事。

架构

原文:In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK’s automatic compaction handling context growth along the way.

在我们更早的 long-running harness 里,我们用 initializer agent + 单 feature 推进的 coding agent + session 之间 context resets 解决了多 session 编程的连贯性问题。Context resets 是关键解锁:那时 harness 用的是 Sonnet 4.5,会表现出前面提到的「context anxiety」倾向。能跨 context resets 良好工作的 harness,才是把模型保持在任务上的关键。Opus 4.5 在很大程度上自己消除了这个行为,所以这次我可以把 context resets 从 harness 里完全拿掉。整个 build 期间,agents 作为一个连续 session 运行,过程中由 Claude Agent SDK 的 automatic compaction 处理 context 增长。

原文:For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I’d observed in prior runs. The system contained the following agent personas:

在这次工作里,我在原始 harness 的基础上,搭了一个三 agent 系统,每个 agent 都对应我在之前运行中观察到的一个具体短板。系统包含以下三种 agent 人格:

原文:Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs.

Planner: 我们之前的 long-running harness 要求用户先提供一份详细 spec。我想把这一步自动化,所以做了一个 planner agent——它接收一段 1–4 句话的简单 prompt,然后扩写成一份完整的产品 spec。我在 prompt 里要求它在 scope 上要有野心,但聚焦在产品上下文与高层技术设计上,而不是细节技术实现。这样强调,是因为我担心如果 planner 在前期就尝试敲定颗粒度过细的技术细节、又恰好搞错了,这些 spec 里的错误会级联进下游实现。更聪明的做法似乎是:在「要交付什么」这件事上约束 agents,让它们在干活时自己摸出实现路径。我还要求 planner 主动寻找把 AI features 织进产品 spec 的机会。

原文:Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Generator: 之前 harness 里「一次一个 feature」的做法对 scope 管理很有效。我在这里沿用了类似模型——指示 generator 按 sprint 工作,每个 sprint 从 spec 里挑一个 feature 来实现。每个 sprint 都用 React、Vite、FastAPI 和 SQLite(后来换成 PostgreSQL)这一套技术栈实现应用,generator 在每个 sprint 结尾要先做一次 self-evaluate,然后再交给 QA。它还配了 git 用于版本控制。

原文:Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Evaluator: 来自早期 harness 的应用经常看上去很唬人,但一旦你真去用,还是会撞到实打实的 bug。为了抓住这些 bug,evaluator 用 Playwright MCP 像真实用户一样点击运行中的应用,测试 UI 功能、API 端点和数据库状态。然后它针对两件事给每个 sprint 打分:它自己发现的 bug,以及一组比照前端实验设计的准则——这里改写过,覆盖 product depth(产品深度)、functionality(功能性)、visual design(视觉设计)和 code quality(代码质量)。每条准则都有一个硬阈值,只要任何一条没过线,这个 sprint 就 fail,generator 会拿到关于问题出在哪儿的详细反馈。

原文:Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what “done” looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

每个 sprint 开工之前,generator 和 evaluator 会先谈出一份 sprint contract(sprint 合约):在写任何代码之前,先就「这一块工作的『完成』长什么样」达成一致。设这一步是因为产品 spec 故意写得很 high-level,我需要一个步骤把 user stories 与可测试实现之间的鸿沟桥接起来。generator 提议要建什么、如何验证成功;evaluator 审查这个提议,确保 generator 在做正确的事。两边来回迭代,直到达成共识。

🟢 译者注:sprint contract 是 Anthropic 这套 harness 里的一个原创术语——它不是 Scrum 里的 sprint backlog,而更像「validator-side acceptance criteria + generator-side implementation plan 的双向锁定」。这是把模糊 spec 转成可测试目标的关键编排环节。

原文:Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

agent 之间的通讯通过文件完成:一个 agent 写一个文件,另一个 agent 读它,然后要么在同一份文件里回复,要么写一份新文件让前者再去读。然后 generator 按照达成共识的 contract 实现工作,完成后再交给 QA。这样既能让工作忠于 spec,又不会过早把实现细节写死。

跑起来

原文:For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

第一版 harness 用的是 Claude Opus 4.5。我把同一个用户 prompt 同时跑在 full harness 和一个 single-agent 系统上做对比。用 Opus 4.5 是因为我开始这组实验时,这是我们最强的编程模型。

原文:I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

我写了下面这条 prompt 来生成一个复古风游戏制作器:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode. (创建一个 2D 复古游戏制作器,功能包括关卡编辑器、精灵编辑器、实体行为系统,以及一个可玩的测试模式。)

原文:The table below shows the harness type, length it ran for, and the total cost.

下表展示了 harness 类型、运行时长与总成本。

Harness	Duration	Cost
Solo	20 min	$9
Full harness	6 hr	$200

原文:The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

full harness 的成本是 solo 的 20 倍以上,但输出质量的差距立刻可见。

原文:I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

我期待的是一个界面:我可以构造一个关卡及其组成部分(精灵、实体、瓦片布局),然后按 play 真正玩这个关卡。我先打开了 solo run 的产物,初始应用看起来基本符合预期。

原文:As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

但当我点进去之后,问题就开始浮现。布局浪费空间,固定高度面板让大部分视口空着;工作流僵硬——我尝试往关卡里填东西,系统提示我要先创建精灵和实体,但 UI 里没有任何东西引导我按这个顺序操作。更要命的是,实际的游戏是坏的——我的实体出现在屏幕上,但对任何输入都没反应。深入代码发现,entity 定义与游戏运行时之间的连线断了,而表面上没有任何提示告诉你断在哪儿。

原文:After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

评完 solo run,我把注意力转到 harness run 上。这一次从同一句话 prompt 出发,但 planner 那一步把它扩写成了一份覆盖 16 个 feature、分布在 10 个 sprint 的 spec,远远超出了 solo run 尝试的范围。除了核心编辑器和 play mode,spec 还包括精灵动画系统、行为模板、音效与音乐、AI 辅助精灵生成器与关卡设计器,以及带可分享链接的游戏导出。我把我们的 frontend design skill 给了 planner,它读了之后用来在 spec 里为这个应用创造一套视觉设计语言。对每个 sprint,generator 与 evaluator 谈出一份 contract,定义这个 sprint 的具体实现细节,以及验证完成所需的可测试行为。

原文:The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I’d seen in the solo run did remain—the workflow still didn’t make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

应用立刻就比 solo run 更精致、更流畅。画布占满整个视口,面板尺寸合理,界面有一种紧扣 spec 设计方向的一致视觉身份。我在 solo run 里看到的一些笨拙之处仍然存在——工作流还是没说清楚要先做精灵与实体再去填关卡,我得自己摸出来。这看起来是 base model 在产品直觉上的短板,而不是 harness 本来要解决的问题——不过它也提示了:在 harness 内部做有针对性的迭代,能进一步改善输出质量。

原文:Working through the editors, the new run’s advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

把各个编辑器逐一过一遍之后,新一轮相对 solo 的优势就更明显了。精灵编辑器更丰富、功能更完整——工具面板更干净、调色器更好、缩放控制更可用。

原文:Because I’d asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

因为我让 planner 把 AI features 织进 spec,这个应用还自带了一个 Claude 集成——让我可以通过 prompting 生成游戏的不同部分。这大大加快了工作流。

原文:The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

最大的差别在 play mode。我真的能让我的实体动起来、玩这个游戏。物理上有些粗糙边缘——我的角色跳到一个平台上,但最后是和平台重叠了,直觉上感觉不对——但核心能跑,这是 solo run 没做到的。走动几下之后,我撞到了 AI 关卡构造的一些限制——有一堵很大的墙我跳不过去,所以卡住了。这提示我:harness 可以再处理一些常识性改进与边界情形,把应用进一步打磨。

原文:Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract’s test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator’s findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

通读 logs,可以清楚看到 evaluator 把实现牢牢拽在 spec 上。每个 sprint,它都会沿着 sprint contract 的测试准则走一遍,通过 Playwright 跑动运行中的应用,对任何偏离预期行为的地方都开 bug。这些 contract 颗粒度很细——光 Sprint 3 就有 27 条覆盖 level editor 的准则——而且 evaluator 的发现具体到不需要额外调查就可以直接处理。下表是 evaluator 标出的几个问题样例:

Contract criterion	Evaluator finding
Rectangle fill tool allows click-drag to fill a rectangular area with selected tile	FAIL — Tool only places tiles at drag start/end points instead of filling the region. `fillRectangle` function exists but isn’t triggered properly on mouseUp.
User can select and delete placed entity spawn points	FAIL — Delete key handler at `LevelEditor.tsx:892` requires both `selection` and `selectedEntityId` to be set, but clicking an entity only sets `selectedEntityId`. Condition should be `selection \|\| (selectedEntityId && activeLayer === 'entity')`.
User can reorder animation frames via API	FAIL — `PUT /frames/reorder` route defined after `/{frame_id}` routes. FastAPI matches `reorder` as a frame_id integer and returns 422: “unable to parse string as an integer.”

🟢 译者注:这张表是全文最具说服力的 demo。注意 evaluator 的反馈格式——「定位到具体文件:行号」「指出错误条件」「给出建议修复表达式」——这是把 LLM QA 从「找 bug」推到「可以直接被 generator 编译进 patch」的工程范式。

原文:Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren’t a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator’s logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn’t exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn’t work, the lift was obvious.

把 evaluator 调到这个水平,是要花功夫的。开箱即用的 Claude 是一个糟糕的 QA agent。早期跑的时候,我亲眼看到它识别出一个合法问题,然后又在自己脑子里把自己说服「这事没那么大」,最后还是批准了这份工作。它也倾向于做表面化的测试,而不是去探边界情形,所以更微妙的 bug 经常漏掉。我的调优循环是:读 evaluator 的 logs,找出它的判断与我的判断分歧的例子,然后更新 QA 的 prompt 来解决那些问题。这套开发循环跑了好几轮,evaluator 的打分才变得在我看来合理。即便如此,harness 的产出仍然暴露出模型 QA 能力的边界:小的布局问题、有些地方交互不直观,以及在 evaluator 没充分演练过的深层嵌套 feature 里没被发现的 bug。显然,继续调优还能再啃下一块验证的 headroom。但相对 solo run——在那里应用的核心 feature 根本不能用——这个 lift 是显而易见的。

🟢 译者注:这一段是对「为什么 evaluator 不能开箱即用」最坦诚的供述——很多读者读到「LLM-as-judge」会以为只要换个 prompt 就行,Anthropic 这里明确说了:你需要花数轮 dev loop,边读 logs 边迭代 prompt,才能让 evaluator 的判断标准与你对齐。

在 harness 上迭代

原文:The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as “find the simplest solution possible, and only increase complexity when needed,” and it’s a pattern that shows up consistently for anyone maintaining an agent harness.

第一组 harness 结果令人鼓舞,但它也笨重、慢、且昂贵。下一步合乎逻辑的事情,就是找办法在不降级性能的前提下简化 harness。这一半是常识,一半来自一条更一般的原则:harness 里的每一个组件都编码了一个『模型独自做不到某事』的假设;这些假设值得被压力测试——既因为它们可能本来就错,也因为它们会随着模型变强而很快过期。我们之前的博文《Building Effective Agents》把背后的思路总结为「先找最简方案,只在需要时增加复杂度」,这是任何一个维护 agent harness 的人都会反复看到的模式。

🟢 译者注:这条是 Anthropic 在 agent 工程上的核心方法论——harness 是「短期假设的物化」,不是「永久基础设施」。模型每升一代,所有 harness 组件都要被重新审视一遍。

原文:In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn’t able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

第一次尝试简化时,我把 harness 砍得很激进,还试了几个创新想法,但没能复现原版的性能。这也让我难以分辨 harness 设计里哪些部件其实是 load-bearing(承重)的、又是以什么方式承重的。基于这段经验,我换成更有方法的做法——一次只去掉一个组件,看它对最终结果的影响。

原文:As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: “[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes.” It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

在我跑这些迭代循环的过程中,我们发布了 Opus 4.6,这进一步刺激了我去削减 harness 复杂度。有充分理由相信 4.6 需要的 scaffolding 比 4.5 更少。来自我们 launch blog 的描述:「[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes.」(规划更仔细、能维持 agent 任务更久、在更大代码库中更可靠、有更强的 code review 与 debug 能力以抓住自己的错。)它在 long-context retrieval 上也有大幅改进——而这些恰恰都是 harness 当初被设计出来去补的能力。

移除 sprint 结构

原文:I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

我从彻底移除 sprint 结构开始。sprint 结构原本帮助把工作切成小块,让模型保持连贯。考虑到 Opus 4.6 的改进,有充分理由相信模型可以原生处理这件事,不再需要这种拆分。

原文:I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

planner 和 evaluator 我都保留了,因为两者都继续提供明显价值。没有 planner,generator 会 under-scope(范围太保守):给它原始 prompt,它会不先 spec 就开干,最终做出来的应用 feature 不如 planner 推动出来的丰富。

原文:With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model’s raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator’s check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

移除 sprint 结构之后,我把 evaluator 改成在整个运行结束时做一次 single pass(单次扫描),而不是每个 sprint 打一次分。因为模型能力强了很多,evaluator 在某些运行里的承重程度也跟着变了——它的有用性取决于任务相对于「模型独自能可靠完成的边界」处在哪里。在 4.5 上,这条边界离我们很近——我们的 build 正好在 generator 单干能力的边缘,evaluator 抓到了贯穿整个 build 的实质性问题。到 4.6,模型的原始能力上升,这条边界外推了。原本需要 evaluator 把关才能连贯实现的任务,现在 generator 自己就能搞定;对这条边界以内的任务,evaluator 反而变成多余开销。但对仍然处在 generator 能力边界上的部分,evaluator 继续提供真实的提升。

原文:The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

实际启示是:evaluator 不是一个固定的「要 / 不要」决策。当任务超出当前模型独自能可靠完成的范围时,它就值这个成本。

原文:Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app’s own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude’s training data covers it thinly. But with enough tuning, the generator was building agents correctly.

在做结构简化的同时,我也加了一些 prompting 来改进 harness 在每个应用里嵌入 AI features 的方式,具体说就是让 generator 构造一个真正的 agent——能通过 tools 驱动应用自己的功能。这一步花了真功夫,因为相关知识太新,Claude 训练数据里的覆盖很薄。但调够了之后,generator 能正确构造出 agent。

🟢 译者注:这里 Anthropic 自己也撞上了一个有意思的现象——「让 LLM 写一个调用 Tool Use 的 LLM agent」,因为这部分知识在训练数据里覆盖薄,需要靠 prompting 补齐。这对所有正在做「AI 写 AI 应用」的团队是一个直接的工程提醒。

更新版 harness 的结果

原文:To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

为了测试更新后的 harness,我用下面这条 prompt 生成一个 Digital Audio Workstation(DAW)——一个用于作曲、录音和混音的音乐制作程序:

Build a fully featured DAW in the browser using the Web Audio API. (在浏览器里用 Web Audio API 构建一个功能完整的 DAW。)

原文:The run was still lengthy and expensive, at about 4 hours and $124 in token costs.

这次运行依然又长又贵——大约 4 小时,token 成本约 $124。

原文:Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

大部分时间花在 builder 上,它没有 Opus 4.5 当年需要的 sprint 拆分,连续连贯地跑了两个多小时。

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total V2 Harness	3 hr 50 min	$124.70

原文:As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

和之前的 harness 一样,planner 把一行 prompt 扩成了一份完整 spec。从 logs 看,generator 模型在规划应用与 agent 设计、把 agent 接起来、并在交给 QA 之前自测这些事情上做得都很好。

原文:That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can’t be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren’t edge cases — they’re the core interactions that make a DAW usable, and the spec explicitly calls for them.

话虽如此,QA agent 还是抓到了实打实的缺口。第一轮反馈里它写道:

这是一个不错的应用,设计忠实度高,AI agent 扎实,后端也好。主要失分点在 Feature Completeness(功能完整性)——虽然应用看起来很出彩、AI 集成工作良好,但有几个核心 DAW 功能只是「展示性」的,没有交互深度:clips 不能在时间轴上拖动 / 移动,没有乐器 UI 面板(合成器旋钮、鼓垫),也没有视觉化的效果编辑器(EQ 曲线、压缩器电平表)。这些不是边角情况——这些就是让 DAW 真正可用的核心交互,而且 spec 里明确要求过它们。

原文:In its second round feedback, it again caught several functionality gaps:

Remaining gaps:

Audio recording is still stub-only (button toggles but no mic capture)

Clip resize by edge drag and clip split not implemented

Effect visualizations are numeric sliders, not graphical (no EQ curve)

第二轮反馈里,它又抓到几处功能缺口:

剩余缺口:

录音仍然只是 stub(按钮能切换但没有麦克风捕获)

通过边缘拖拽调整 clip 长度、以及 clip 切分,都还没实现

效果可视化只是数字滑块,没有图形化(没有 EQ 曲线)

原文:The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

把 generator 完全放任不管,它仍然容易漏掉细节、或者把 feature 留成 stub;QA 在抓那些「最后一公里」问题、让 generator 去修这件事情上,仍然在创造价值。

原文:Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

按照 prompt,我期待的是这样一个程序:能创建旋律、和声、鼓点,把它们编排成一首歌,过程中有一个集成的 agent 一路协助。下面的视频展示了结果。

原文:The app is far from a professional music production program, and the agent’s song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

这个应用离专业音乐制作程序还很远,agent 的作曲技能也明显还有大量提升空间。另外 Claude 实际上听不到声音,这让 QA 反馈循环在「音乐品味」这个维度上效果打了折扣。

🟢 译者注:这是文中难得一见的「诚实硬约束承认」——Claude 不能听音频,所以 evaluator-loop 在音乐品味维度上是断的;同样地,如果未来要做视频 / 音乐 / 触觉这类多模态产品,evaluator 必须接入对应模态的感知能力。

原文:But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

但最终的应用具备了一个可用音乐制作程序的所有核心部件:一个能用的编排视图、混音器,以及一套跑在浏览器里的 transport。除此之外,我完全通过 prompting 拼出了一个短小的歌曲片段:agent 设了 tempo 和调性,铺了一段旋律,搭了一条鼓轨,调了混音器的电平,还加了 reverb。作曲所需的核心 primitives 都在,而且 agent 能自主驱动它们,用 tools 端到端地做出一个简单制作。你或许会说它还没到 pitch-perfect(完美音准)——但它在路上了。

🟢 译者注:文末这个 “pitch-perfect” 是双关——既是音乐术语(完美音准),也呼应整篇 harness 设计的迭代结果「还没完美但在收敛」。

接下来会怎样

原文:As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

随着模型继续变强,我们大致可以预期它们能工作更久、应对更复杂的任务。在某些情况下,这意味着围绕模型的脚手架的重要性会随时间下降——开发者可以等下一个模型,看着某些问题自己解决掉。另一方面,模型越强,可以发展出来的 harness 空间也越大——它们能处理超出模型 baseline 的复杂任务。

原文:With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you’re building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

带着这个思路,这次工作里有几条经验值得带走:始终实验你正在构建对应的那个模型、阅读它在真实问题上的 traces、并把它的性能调到你想要的结果——这些都是好做法。在更复杂的任务上,把任务拆开、给问题的每个方面派上专门的 agent,有时能拿到额外的 headroom。每当一个新模型落地,通常都应该重新审视 harness——剥掉那些对性能不再 load-bearing 的部件,加上能让你触及之前不可能的能力的新部件。

原文:From this work, my conviction is that the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

通过这次工作,我形成了一个信念:有趣的 harness 组合空间不会随着模型变强而收缩,它只是移动——而 AI engineers 真正有意思的工作,是不断去找下一个新颖的组合。

🟢 译者注:这是全文最重要的一句结论——也是和 Addy Osmani 民间版的 Harness Engineering 互相印证的一句。模型变强不是 harness 工程师失业的理由,而是工作面位移的理由。

Appendix(附录):planner agent 生成的示例 spec

原文:Example plan generated by planner agent.

planner agent 生成的示例 plan。

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard & Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection
All associated sprites, tilesets, levels, and entity definitions

...

🟢 译者注:这份是 planner agent 从一句话 prompt(「Create a 2D retro game maker…」)扩写出来的产品 spec 节选。注意三件事:

写作风格不像 task list,而像产品策划文档——有 Overview、目标用户、Features 章节,每个 feature 下都有「User Stories: As a user, I want to…」格式。

technical 细节克制——只到 Project Data Model 这种结构层级,不会一上来就指定具体的库或文件结构;这正是文中说的「constrain on deliverables, let agents figure out the path」。

AI features 是织入式的——「By weaving AI assistance throughout (powered by Claude)」直接写进 Overview。这是因为 planner 的 prompt 里就要求「find opportunities to weave AI features into the product specs」。

译者总评

harness 不是永久基础设施,而是「短期假设的物化」。 整篇文章最值得抄下来的一句是:「every component in a harness encodes an assumption about what the model can’t do on its own.」每升级一代模型,所有组件都要被重新 stress test。Sonnet 4.5 → Opus 4.5 直接干掉 context resets;Opus 4.5 → Opus 4.6 直接干掉 sprint 结构。中文 AI 团队尤其要警惕「写一次跑三年」的工程惯性。
可分级的评分准则,是把「主观品味」装进自动化反馈环路的入场券。 「is this design beautiful?」打不分,「does this follow our principles for good design?」可以打分。前端工程师可以直接照搬这套四维评分准则(design quality / originality / craft / functionality)、再叠一层 Playwright MCP 让 evaluator 实地点页面,这是对所有「AI 产 UI」管线最直接可复用的工程范式。
evaluator 是「条件性昂贵开销」,不是固定结构件。 当任务超出当前模型独自能可靠完成的边界时,evaluator 值钱;在边界内,它就是无谓 overhead。这句话颠覆了许多团队默认「multi-agent 一定比 single-agent 强」的偏见。
「LLM-as-judge」开箱即用是糟糕的——必须经历多轮 dev loop 把它训成『持怀疑态度』。 文中坦白说 Claude 默认是「poor QA agent」,会自己说服自己批准坏代码。要让它好用,必须读 logs、找分歧、改 prompt,反复几轮。这条对所有正在做 AI QA / AI code review 的产品经理都是清醒剂。
Sprint Contract 是把模糊 spec 转成可测试目标的关键编排环节。 由 generator 提议「要建什么、如何验证」,evaluator 审核到双方一致后才允许写代码——这一步等同于把「产品需求评审」搬到了 agent 之间。这种 contract-first、validator-side acceptance criteria 的范式,值得迁移到任何 long-running、多 agent 协作的工程场景里。

调研来源

原文: https://www.anthropic.com/engineering/harness-design-long-running-apps
配套精读:Cursor SDK + Zed 1.0 与 Addy Osmani 三连的 Harness Engineering(民间版)
Anthropic 早期 harness 论文: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Anthropic 《Building Effective Agents》: https://www.anthropic.com/research/building-effective-agents
Anthropic 《Effective Context Engineering for AI Agents》: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
frontend design skill 源码: https://github.com/anthropics/claude-code/blob/main/plugins/frontend-design/skills/frontend-design/SKILL.md
Claude Agent SDK: https://platform.claude.com/docs/en/agent-sdk/overview
Ralph Wiggum 方法(社区): https://ghuntley.com/ralph/

📝 配套精读 + 译者点评:Cursor SDK + Zed 1.0:编辑器赛道转向 / Addy Osmani 三连