· FRONTEND-2026-RADAR · 2026.05.05 · 62 MIN ·

Addy Osmani —《Agent Harness 工程化》(全文)

Addy Osmani 2026-04-19 原文逐段翻译 + 译注。Agent = 模型 + Harness;Top 30 → Top 5 仅靠换 harness;skill issue 反框架与棘轮纪律。 · by 思扬

AI · HERO seed:5320260505 Addy Osmani 2026-04-19 原文逐段翻译 + 译注。Agent = 模型 + Harness;Top 30 → Top 5 仅靠换 harness;skill issue 反框架与棘轮纪律。

FIG.00 — cover · ai-generated · placeholder

原文: https://addyosmani.com/blog/agent-harness-engineering/ 作者: Addy Osmani(Google Director, Cloud AI / Gemini) 发表日期: 2026 年 4 月 19 日 本译版定位: 完整逐段翻译 + 译注。配套精读:Addy Osmani 三连

译者前言

这篇是 Addy Osmani 2026 年 Q1 三连(Factory Model / Comprehension Debt / Harness Engineering)的实战收官。前两篇定义了”AI 时代工程的范式重画”和”维护期风险”,而这一篇直接把 agent + harness 这套抽象拉到工程类目级别 —— 像 IDE、像测试框架那样,有自己的最佳实践和模式语言。

如果你在 2026 年还在做 prompt engineering 而不是 harness engineering,你可能已经落后一个版本号。模型(Claude / GPT / Gemini)是 runtime,harness(Claude Code / Cursor / Codex / Aider / Cline 等)才是 OS。Viv Trivedy 在 Twitter 提出 harness engineering 这个名字,Anthropic 工程团队在博客系统化它,Addy 这篇把它从松散的工程师智慧整理成方法论。

读这篇前最好对 Claude Code 的 hooks / sandbox / subagents / MCP / AGENTS.md 等基础设施有概念。如果完全没用过,可以先看配套精读和 Playwright MCP 精读。

Agent Harness 工程化

原文:A coding agent is the model plus everything you build around it. Harness engineering treats that scaffolding as a real artifact, and it tightens every time the agent slips.

coding agent = 模型 + 你围绕它搭建的一切。harness 工程化把这层脚手架当成一份真正的工程产物,每次 agent 出错都让它收紧一点。

原文:Roughly: anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

粗略地说:每当你发现 agent 犯了一个错误,你就花时间设计一个工程化的解决方案,让它再也不会犯同样的错。

原文:We’ve spent the last two years arguing about models. Which one is smartest, which one writes the cleanest React, which one hallucinates less. That conversation is fine as far as it goes, but it’s missing the other half of the system. The model is one input into a running agent. The rest is the harness: the prompts, tools, context policies, hooks, sandboxes, subagents, feedback loops, and recovery paths wrapped around the model so it can actually finish something.

过去两年,我们一直在争论模型。哪个最聪明,哪个写出来的 React 最干净,哪个幻觉(hallucination)最少。这个对话本身没问题,但它漏掉了系统的另一半。模型只是一个跑起来的 agent 的一个输入。剩下的部分是 harness:包裹在模型外面的 prompts、tools、context policies、hooks、sandboxes、subagents、feedback loops、recovery paths,这些东西让模型真正能”干完一件事”。

🟢 译者注:harness 这个词很难翻译。字面是”挽具”(套在马身上的那种),引申为”约束 + 驱动 + 让其完成工作的装置”。中文社区有人译成”驾驭层""脚手架""工装”,但都不够精准。本文保留 harness 不翻译,是 Addy 三连里反复出现的核心术语。

原文:A decent model with a great harness beats a great model with a bad harness. I’ve watched this play out on my own work over and over. And increasingly the interesting engineering isn’t in picking the model, it’s in designing the scaffolding around it.

好 harness 配普通模型,胜过烂 harness 配最强模型。 我在自己的工作里反复见证这一点。越来越多的有趣工程不在”选哪个模型”,而在”模型外面那层脚手架怎么设计”。

原文:That discipline now has a name. Viv Trivedy coined the term harness engineering, and his “Anatomy of an Agent Harness” post is the cleanest derivation of what a harness actually is and why each piece exists. Dex Horthy has been tracking the pattern as it emerges. HumanLayer frames most agent failures as “skill issues” that come down to configuration rather than model weights. Anthropic’s engineering team has published what I think is the best public breakdown of how to design a harness for long-running work. And Birgitta Böckeler has a good overview of what this looks like from the user’s side.

这门学问现在有名字了。Viv Trivedy 创造了 harness engineering 这个术语,他的《Anatomy of an Agent Harness》是迄今最干净的推导:harness 究竟是什么、每块为什么存在。Dex Horthy 在跟踪这个模式的演化。HumanLayer 把多数 agent 失败定义为”skill issue”(技能问题)—— 是配置问题,不是模型权重问题。Anthropic 工程团队发布了我认为最好的公开拆解,讲怎么为长跑任务设计 harness。Birgitta Böckeler 有一份从用户侧看 harness 的很好综述。

🟢 译者注:“skill issue” 是游戏圈黑话,本意是”你菜不是游戏的问题”,在这里被 HumanLayer 反讽地借用 —— 多数所谓”模型不行”其实是”你 harness 没配好”。这个反框架很有效。

原文:This post is my attempt to pull those threads together.

这篇文章是我把这些线索串起来的尝试。

什么才是 harness?

原文:Viv’s one-liner does most of the work:

Agent = Model + Harness. If you’re not the model, you’re the harness.

Viv 的一句话基本说完了:

原文金句:Agent = Model + Harness. If you’re not the model, you’re the harness.

中译:Agent = 模型 + Harness。如果你不是那个模型,你就是 harness。

原文:A harness is every piece of code, configuration, and execution logic that isn’t the model itself. A raw model is not an agent. It becomes one once a harness gives it state, tool execution, feedback loops, and enforceable constraints.

harness 包括所有不是模型本身的代码、配置和执行逻辑。裸模型不是 agent。只有当 harness 给它配上状态、工具执行、反馈回路和可强制执行的约束之后,它才变成 agent。

harness 解剖图。模型在中间;harness 围绕它,提供 context injection、control flow、action、persistence、observation。

原文:Concretely, a harness includes:

System prompts, CLAUDE.md, AGENTS.md, skill files, and subagent prompts

Tools, skills, MCP servers, and their descriptions

Bundled infrastructure (filesystem, sandbox, browser)

Orchestration logic (subagent spawning, handoffs, model routing)

Hooks and middleware for deterministic execution (compaction, continuation, lint checks)

Observability (logs, traces, cost and latency metering)

具体来说,一份 harness 包含:

System prompts、CLAUDE.md、AGENTS.md、skill 文件、subagent prompts
Tools、skills、MCP servers 以及它们的描述
打包好的基础设施(filesystem、sandbox、browser)
编排逻辑(subagent 派生、交接 handoff、模型路由 model routing)
Hooks 和 middleware,用于确定性执行(compaction、continuation、lint checks)
可观测性(logs、traces、成本和延迟监控)

🟢 译者注:CLAUDE.md / AGENTS.md 是 Claude Code / 通用 agent 工具识别的项目级 system prompt 文件,放在仓库根目录。MCP 是 Anthropic 的 Model Context Protocol,见 MCP 精读。compaction 指上下文窗口快满时,把旧 turn 摘要、释放 token 的过程。

原文:Simon Willison reduces the loop part to its essence: an agent is a system that “runs tools in a loop to achieve a goal.” The skill is in the design of both the tools and the loop.

Simon Willison 把”loop”那部分提炼到本质:agent 是一个 “在循环里调用工具,以达成一个目标” 的系统。技艺在于工具的设计、以及 loop 的设计 —— 两者都要好。

原文:If that sounds like a lot of surface area, it is. And it’s your surface area, not the model provider’s. Claude Code, Cursor, Codex, Aider, Cline: these are all harnesses. The model underneath is sometimes the same, but the behaviour you experience is dominated by what the harness does.

听起来像很多接触面?确实是。而且这是你的接触面,不是模型供应商的。Claude Code、Cursor、Codex、Aider、Cline:这些都是 harness。底层模型有时甚至是同一个,但你感受到的行为差异,几乎完全由 harness 决定。

coding agent = AI model(s) + harness

原文:This equation, articulated by Viv and echoed by HumanLayer, is where the work actually lives. The debate over the left-hand side is loud. Most of the actual leverage sits on the right.

Viv 提出、HumanLayer 呼应的这个等式,就是真正干活的地方。等式左边的辩论很热闹,真正的杠杆几乎全在右边。

“Skill Issue” 的反框架

原文:There’s a pattern I watch engineers fall into. The agent does something dumb, the engineer blames the model, and the blame gets filed under “wait for the next version.”

我经常看见工程师掉进同一个套路:agent 干了件蠢事,工程师怪模型,锅丢给”等下一版吧”。

原文:The harness-engineering mindset rejects that default. The failure is usually legible. The agent didn’t know about a convention, so you add it to AGENTS.md. The agent ran a destructive command, so you add a hook that blocks it. The agent got lost in a 40-step task, so you split it into a planner and an executor. The agent kept “finishing” broken code, so you wire a typecheck back-pressure signal into the loop.

harness 工程化的思维方式拒绝这个默认反应。失败通常是可读的。agent 不知道某个约定,你就把它加进 AGENTS.md。agent 跑了个破坏性命令,你就加个 hook 拦住它。agent 在 40 步任务里迷路,你就把它拆成 planner 和 executor。agent 老是把残缺的代码标记为”完成”,你就把 typecheck 的 back-pressure 信号接进 loop 里。

🟢 译者注:back-pressure(背压)源自反应式编程,这里指 typecheck 失败的错误信息被反向”灌”回 agent 的 context,迫使它修正。这是 hooks 的一个核心模式。

原文:HumanLayer says: “it’s not a model problem. It’s a configuration problem.” Harness engineering is what happens when you take that seriously.

HumanLayer 说:“这不是模型问题,这是配置问题。” harness 工程化就是当你认真对待这句话时发生的事。

原文:There’s a striking data point that shows up in both Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 running inside Claude Code scores far lower than the same model running in a custom harness. Viv’s team moved a coding agent from Top 30 to Top 5 by changing only the harness. Models get post-training coupled to the harness they were trained against. Moving them into a different harness, with better tools for your codebase, a tighter prompt, and sharper back-pressure, can unlock capability the original harness was leaving on the floor.

Viv 和 HumanLayer 各自的文章里都引用了一个惊人的数据点。在 Terminal Bench 2.0 上,同样一个 Claude Opus 4.6,跑在 Claude Code 里的得分,远低于跑在一个自定义 harness 里的得分。Viv 的团队仅靠换 harness,把一个 coding agent 从 Top 30 推到 Top 5。模型在 post-training 阶段会和它训练时所用的 harness 耦合;把它换到一个不同的 harness 里,配上更适合你代码库的工具、更紧凑的 prompt、更尖锐的 back-pressure,可以释放出原 harness 留在地上没捡的能力。

原文:This is the opposite of the “just wait for GPT-6” narrative. The gap between what today’s models can do and what you see them doing is largely a harness gap.

这与”等 GPT-6 就好了”的叙事正相反。

原文金句:The gap between what today’s models can do and what you see them doing is largely a harness gap.

中译:今天的模型 能做到的事 和你 看到它们做到的事 之间那道沟,大部分是 harness 的差距。

棘轮:每个错误都变成一条规则

原文:The most important habit in harness engineering is treating agent mistakes as permanent signals. Not one-off stories to laugh about, not “bad runs” to retry. Signals.

harness 工程化里最重要的习惯,是把 agent 的每个错误当成永久信号。不是用来当饭后笑话的”翻车一次”,也不是”这次没跑好,重来一遍”。是信号。

原文:If the agent ships a PR with a commented-out test and I merge it by accident, that’s an input. The next version of my AGENTS.md says “never comment out tests; delete them or fix them.” The next version of my pre-commit hook greps for .skip( and xit( in the diff. The next version of my reviewer subagent flags commented-out tests as a blocker.

如果 agent 提交了一个 PR,里面把某个 test 注释掉了,而我不小心 merge 了 —— 这是个输入。下一版的 AGENTS.md 写上:“绝对不要注释掉 test;要么删掉,要么修好。” 下一版的 pre-commit hook 在 diff 里 grep .skip( 和 xit(。下一版的 reviewer subagent 把”注释掉的 test”标为 blocker。

原文:You only add constraints when you’ve seen a real failure. You only remove them when a capable model has made them redundant. Every line in a good AGENTS.md should be traceable back to a specific thing that went wrong.

你只在见过真实失败之后才加约束;只在某个更强的模型让它们变得冗余之后才移除。一份好的 AGENTS.md,每一行都应当追溯回某次具体出错。

原文:This is also why harness engineering is a discipline rather than a framework. The right harness for your codebase is shaped by your failure history. You can’t download it.

这也是为什么 harness 工程化是学问(discipline)而不是框架(framework)。适合你代码库的 harness,是由你自己的失败历史塑造出来的。你下载不到它。

🟢 译者注:这一段我个人最喜欢。“棘轮”(ratchet)是一个机械装置 —— 只能朝一个方向收紧,不会倒退。Addy 在用这个比喻说:harness 是单向积累的,失败一次紧一次,不能后退。这跟传统软件工程里的”测试用例 -> bug 复现 -> 写回归测试”是同构的,只是层级换到了”prompt + hook + 子 agent prompt”。

从行为反推 harness

原文:The framing from Viv that I find most useful when I’m actually designing a harness is to start from the behaviour you want and derive the harness piece that delivers it. His pattern: behaviour we want (or want to fix) → harness design to help the model achieve this.

我实际设计 harness 时,Viv 的一个 framing 最有用:从你想要的行为出发,反推出实现该行为所需的那块 harness。他的模式是:我们想要的(或想修的)行为 → 用什么 harness 设计帮模型做到这件事。

每个 harness 组件都来自模型自己做不到的某个行为。"持久处理真实数据"对应 filesystem 和 git;"写代码并执行"对应 bash 和 code execution;"安全执行 + 默认值"对应 sandboxed environment 和工具;"记住新知识"对应 memory file、web search、MCP;"长 context 表现稳"对应 compaction、tool offloading、skills;"长跨度任务"对应 Ralph loops、planning、verification。

原文:The useful thing about deriving it this way is that every harness component has a specific job. If you can’t name the behaviour a component exists to deliver, it probably shouldn’t be there.

这样反推的好处是:每个 harness 组件都有具体职责。如果你说不出某个组件存在是为了交付什么行为,那它大概就不该在那里。

原文:The rest of this section walks the pieces in roughly the order Viv does, with the specific patterns I’ve found worth stealing.

下面这一节按 Viv 的顺序大致走一遍各组件,顺带列出我觉得值得抄的具体模式。

Filesystem 和 Git:持久状态

原文:The filesystem is the most foundational primitive, and it tends to be underrated because it’s boring. Models can only directly operate on what fits in context. Without a filesystem, you’re copy-pasting into a chat window, and that isn’t a workflow.

文件系统是最基础的原语(primitive),也最容易被低估,因为它太”无聊”。模型只能直接操作放得进 context 的东西。没有 filesystem,你就只是在往聊天窗口粘贴文本,那不叫 workflow。

原文:Once you have a filesystem, the agent gets a workspace to read data, code, and docs; a place to offload intermediate work instead of holding it in context; and a surface where multiple agents and humans can coordinate through shared files. Adding Git on top gives you versioning for free, so the agent can track progress, roll back errors, and branch experiments.

一旦你有了 filesystem,agent 就有了一个工作区:可以读数据、代码、文档;可以把中间产物 offload 到磁盘而不必塞进 context;还提供了一个面,让多个 agent 和人类通过共享文件协作。在它之上加 Git,你就免费拿到了版本管理,agent 可以记录进度、回滚错误、为实验开分支。

原文:Most of the other harness primitives end up pointing at the filesystem for something.

其他大多数 harness 原语,最后都会指回 filesystem。

Bash 和 code execution:通用工具

原文:The main agent loop today is a ReAct loop: the model reasons, takes an action via a tool call, observes the result, and repeats. But a harness can only execute the tools it has logic for. You can try to pre-build a tool for every possible action, or you can give the agent bash and let it build the tools it needs on the fly.

现在主流的 agent loop 是 ReAct loop:模型推理 → 通过 tool call 采取一个动作 → 观察结果 → 重复。但 harness 只能执行它有逻辑的工具。你可以试图为每个可能的动作都预先建一个工具,或者干脆给 agent 一个 bash,让它现场拼出它需要的工具。

🟢 译者注:ReAct = Reasoning + Acting,Yao et al. 2022 那篇论文的命名,是当下几乎所有 coding agent 的基本循环模式。

原文:Willison’s take on this is that agents already excel at shell commands; most tasks collapse to a few well-chosen CLI invocations. Harnesses still ship focused tools, but bash plus code execution has become the default general-purpose strategy for autonomous problem solving. It’s the difference between teaching someone to use a single kitchen gadget and handing them a kitchen.

Willison 的观察是:agent 在 shell 命令上已经很擅长;多数任务收敛到几个挑得好的 CLI 调用就解决了。harness 仍然会带专用工具,但 “bash + code execution” 已经成为自主问题求解的默认通用策略。这是”教某人用一个厨房小工具” vs. “把整个厨房交给他”的差别。

Sandboxes 和默认工具

原文:Bash is only useful if it runs somewhere safe. Running agent-generated code on your laptop is risky, and a single local environment doesn’t scale to many parallel agents.

bash 只有在跑在某个安全的地方才有用。在你笔记本上直接跑 agent 生成的代码很危险,而且单个本地环境也撑不起多个并行 agent。

原文:Sandboxes give agents an isolated operating environment. Instead of executing locally, the harness connects to a sandbox to run code, inspect files, install dependencies, and verify work. You can allow-list commands, enforce network isolation, spin up new environments on demand, and tear them down when the task is done.

sandbox 给 agent 一个隔离的运行环境。harness 不在本地执行,而是连接到 sandbox 去跑代码、看文件、装依赖、验证工作。你可以做命令白名单、强制网络隔离、按需起新环境、任务完了就销毁。

原文:A good sandbox ships with good defaults: pre-installed language runtimes and packages, Git and test CLIs, a headless browser for web interaction. Browsers, logs, screenshots, and test runners are what let the agent observe its own work and close the self-verification loop.

好的 sandbox 自带好的默认值:预装好的语言 runtime 和包、Git 和测试 CLI、用于网页交互的 headless browser。浏览器、日志、截图、test runner —— 这些是让 agent 观察自己工作、形成自验证 loop 的东西。

原文:The model doesn’t configure its execution environment. Deciding where the agent runs, what’s available, and how it verifies its output are all harness-level calls.

模型不会配置自己的执行环境。决定 agent 跑在哪、有什么可用、怎么验证输出 —— 这些全是 harness 级的决策。

Memory 和 search:持续学习

原文:Models have no additional knowledge beyond their weights and what’s currently in context. Without the ability to edit weights, the only way to add knowledge is through context injection.

模型除了权重和当前 context 之外没有任何额外知识。没法改权重,那唯一注入知识的途径就是 context injection(把信息塞进上下文)。

原文:The filesystem is again the primitive. Harnesses support memory file standards like AGENTS.md that get injected on every start. As the agent edits that file, the harness reloads it, and knowledge from one session carries into the next. This is a crude but effective form of continual learning.

filesystem 再次是原语。harness 支持 memory 文件规范,例如 AGENTS.md —— 每次启动就被注入。当 agent 编辑这个文件时,harness 重新加载,一次 session 里学到的知识就传到下一次。粗糙,但有效,这是一种持续学习的形式。

原文:For knowledge that didn’t exist at training time (new library versions, current docs, today’s data) web search and MCP tools like Context7 bridge the cutoff. These are useful primitives to bake into the harness rather than leaving to the user.

对训练时还不存在的知识(新库版本、当前文档、今天的数据),web search 和 MCP 工具(比如 Context7)弥补 cutoff。这些是值得直接烤进 harness 的原语,而不要把它丢给用户去配。

🟢 译者注:Context7 是一个流行的 MCP server,把最新的 npm / library 文档实时喂给 agent,绕开模型训练 cutoff。

对抗 context rot

原文:Context rot is the observation that models get worse at reasoning and completing tasks as the context window fills up. Context is scarce, and harnesses are largely delivery mechanisms for good context engineering.

Context rot 是这样一个观察:context window 越满,模型推理和完成任务的能力越差。context 是稀缺资源,harness 在很大程度上就是好的 context engineering 的交付机制。

🟢 译者注:context rot 字面是”上下文腐烂”。Anthropic 的工程团队反复提这个,意思是:就算你的上下文窗口有 1M token,塞满了它的表现也比同样问题在 50k 时差。所以 harness 必须主动管理 context,不能”等满了再说”。

原文:Three techniques show up repeatedly:

有三种技术反复出现:

原文:Compaction. When the window gets close to full, something has to give. Letting the API error is not an option for a production harness, so the harness intelligently summarizes and offloads older context so the agent can keep working.

Compaction(压缩)。 窗口快满时,得舍掉点什么。让 API 直接报错对生产 harness 不是选项,所以 harness 会聪明地把旧 context 摘要、offload 出去,让 agent 继续工作。

原文:Tool-call offloading. Large tool outputs (think 2,000-line log files) clutter context without adding much signal. The harness keeps the head and tail tokens above a threshold and offloads the full output to the filesystem, where the agent can read it on demand.

Tool-call offloading(工具输出外置)。 大工具输出(比如 2000 行的日志文件)会塞满 context,但没多少信号。harness 在头尾各保留一段超过阈值的 token,把完整输出 offload 到 filesystem,需要时 agent 再去读。

原文:Skills with progressive disclosure. Loading every tool and MCP into context at startup degrades performance before the agent takes a single action. Skills let the harness reveal instructions and tools only when the task actually calls for them.

Skills + 渐进式披露(progressive disclosure)。 启动时就把所有 tool 和 MCP 灌进 context,会让 agent 在还没动手之前就性能下降。Skills 让 harness 只在任务真的需要时才暴露相关指令和工具。

🟢 译者注:Anthropic 在 2026 年初推出了 Skill (SKILL.md) 规范,见 SKILL.md 精读。它的核心思路就是”按需加载”,而不是 startup 一次性塞光。

原文:Anthropic’s harness post adds one more technique for the really long jobs: full context resets, where the harness tears the session down and rebuilds it from a compact hand-off file. They’re explicit that compaction alone wasn’t sufficient for long tasks; sometimes you need to start fresh with a structured brief. This is closer to how humans onboard a new engineer than to how we usually think about “memory.”

Anthropic 的 harness 文章对真正长跑的任务追加了一种技术:full context resets(完整上下文重置)。harness 拆掉整个 session,从一份紧凑的 hand-off 文件重建。他们明确说:对长任务来说 compaction 单独不够,有时你必须从一个结构化 brief 重新开始。这更接近人类入职(onboard)新工程师的方式,而不像我们平常想象的”memory”。

长跨度执行:Ralph Loops、planning、verification

原文:Autonomous long-horizon work is the holy grail and the hardest thing to get right. Today’s models suffer from early stopping, poor decomposition of complex problems, and incoherence as work stretches across multiple context windows. The harness has to design around all of that.

自主长跨度工作是圣杯,也是最难做对的事。今天的模型受困于:早停(early stopping)、复杂问题分解差、跨多个 context window 时不连贯。harness 必须围绕这些缺陷做设计。

原文:I’ve written about autonomous coding loops like the Ralph Loop before in self-improving agents and in my 2026 trends piece, but it’s worth restating in this framing: a hook intercepts the model’s attempt to exit and re-injects the original prompt into a fresh context window, forcing the agent to continue against a completion goal. Each iteration starts clean but reads state from the previous one through the filesystem. It’s a surprisingly simple trick for turning a single-session agent into a multi-session one, and it’s the kind of primitive you’d never derive from “just use a smarter model.”

我以前在 self-improving agents 和 2026 trends 里写过 Ralph Loop 这类自主编码 loop。在这个 framing 下值得重述:一个 hook 拦截模型的退出意图,把原始 prompt 重新注入一个新的 context window,强迫 agent 继续追完一个 completion goal。每次 iteration 都是干净启动,但通过 filesystem 读取上一轮留下的状态。这是一个极简单的小把戏,把单 session agent 变成多 session agent —— 这种原语,你从”用更聪明的模型不就完了”那边永远推不出来。

🟢 译者注:Ralph Loop(Ralph Wiggum 技术)得名自《辛普森一家》里那个老把同一句话当回事重复的小孩。Geoffrey Huntley 让这个名字流传开 —— 字面就是”agent 说自己做完了,我把同一个 prompt 再丢给它一次,直到它真的做完”。简单到丑陋,有效到恐怖。下一篇《Long-running Agents》会详细展开。

原文:Planning is when the model decomposes a goal into a sequence of steps, usually into a plan file on disk. The harness supports this with prompting and reminders about how to use the plan file. After each step, the agent checks its work via self-verification: hooks run a pre-defined test suite and loop failures back to the model with the error text, or the model reviews its own output against explicit criteria.

Planning(规划):模型把一个目标分解成一系列步骤,通常落到磁盘上的一个 plan file。harness 通过 prompting 和如何使用 plan file 的提醒来支持。每一步之后,agent 通过自验证检查工作:hooks 跑一个预定义的 test suite,把失败信息连同错误文本灌回模型;或者模型对照明确标准 review 自己的输出。

原文:Planner / generator / evaluator splits. Anthropic’s long-running harness work is explicit that separating generation from evaluation into distinct agents outperforms self-evaluation, because agents reliably skew positive when grading their own work. It’s GANs for prose. The related pattern is the sprint contract, where the generator and evaluator negotiate what “done” actually means before code gets written. In my own workflows, writing down the done-condition before starting has caught more scope drift than any prompt change I’ve ever made.

Planner / generator / evaluator 分裂。 Anthropic 的长跑 harness 工作明确指出:把生成和评估拆成不同 agent,胜过让一个 agent 自我评估,因为 agent 在给自己工作打分时,几乎一定会过于宽容。散文版的 GAN(生成对抗网络)。相关的模式是 sprint contract:generator 和 evaluator 在写代码之前先就”做完到底意味着什么”达成一致。在我自己的 workflow 里,在开始之前把 done condition 写下来,比我改过的任何 prompt 都更能抓住 scope drift。

🟢 译者注:GAN(Generative Adversarial Network,生成对抗网络)是 Goodfellow 2014 年的工作,生成器 vs 判别器对抗训练。Addy 把这个比喻挪用到 prose-level agent 编排上,生成 agent vs. 评估 agent。

Hooks:执行层

原文:Hooks are what separate “I told the agent to do X” from “the system enforces X.”

hooks 把”我告诉 agent 去做 X”和”系统强制 X”区分开。

原文:A hook is a script that runs at a specific lifecycle point: before a tool call, after a file edit, before commit, on session start. They’re the right place for things the agent should never forget but often does. Run typecheck and lint and tests after every edit and surface failures. Block destructive bash (rm -rf, git push --force, DROP TABLE). Require approval before opening a PR or pushing to main. Auto-format on write so the agent doesn’t waste tokens on whitespace.

hook 是一个脚本,在某个生命周期节点运行:tool call 之前、文件编辑之后、commit 之前、session 启动时。它们是放”agent 永远不该忘但经常忘”的事的正确地方。每次编辑后跑 typecheck、lint、test,把失败暴露出来。拦住破坏性 bash(rm -rf、git push --force、DROP TABLE)。开 PR 或推送到 main 之前要求审批。写入时自动格式化,免得 agent 浪费 token 在空白字符上。

原文:The principle HumanLayer highlights and I’ve come to agree with is: success is silent, failures are verbose. If typecheck passes, the agent hears nothing. If it fails, the error text gets injected into the loop and the agent self-corrects. That makes the feedback loop almost free in the common case and directly actionable when something goes wrong.

HumanLayer 强调的、我现在也赞同的原则是:

原文金句:success is silent, failures are verbose.

中译:成功要静音,失败要话痨。 如果 typecheck 通过,agent 什么都听不到;一旦失败,错误文本就被注入 loop,agent 自我纠正。这让常见情况下的 feedback loop 几乎免费,出问题时又能直接驱动行动。

`AGENTS.md` 和工具选择

原文:The flat markdown rulebook at the root of your repo is still the single highest-leverage configuration point, because it lands in the system prompt every turn. Conventions go here: package manager, test framework, formatting, “never touch /legacy,” “always use our logger.” Two hard-won lessons:

放在仓库根的那份扁平 markdown 规则手册,仍然是杠杆最高的单一配置点,因为它每一轮都进 system prompt。约定写在这里:包管理器、测试框架、格式化规则、“永远别碰 /legacy”、“始终用我们的 logger”。两条来之不易的教训:

原文:Keep it short. HumanLayer keeps theirs under 60 lines. Every line is competing for attention, and more rules make each rule matter less. Pilot’s checklist, not style guide.

保持简短。 HumanLayer 把他们的控制在 60 行以下。每一行都在争夺注意力,规则越多,每一条的分量越小。

原文金句:Pilot’s checklist, not style guide.

中译:这是飞行员的 checklist,不是 style guide。

原文:Earn each line. Rules should trace to a specific past failure or a hard external constraint. If they don’t, they’re noise. Ratchet; don’t brainstorm.

每一行都要挣来。 规则应该追溯到某次具体的过往失败,或某个硬性的外部约束。如果做不到,它就是噪声。棘轮式收紧,不要头脑风暴。

原文:Same discipline applies to tools. Each tool’s name, description, and schema gets stamped into the prompt every request. Ten focused tools outperform fifty overlapping ones because the model can hold the menu in its head. HumanLayer also flags a real security concern here: tool descriptions populate the prompt, so any MCP server you install is trusted text the model will read. A sloppy or malicious MCP can prompt-inject your agent before you’ve typed anything.

同样的纪律也适用于 tools。每个 tool 的名字、描述、schema,每次请求都被烙进 prompt。十个聚焦的工具,胜过五十个互相重叠的,因为模型能把菜单记在脑子里。HumanLayer 还指出一个真实的安全问题:tool descriptions 是 prompt 的一部分,所以你装的任何 MCP server,都是模型会读的”可信”文本。一个潦草或恶意的 MCP,可以在你还没敲一个字之前就 prompt-inject 你的 agent。

🟢 译者注:这是 MCP 生态在 2026 年开始浮现的安全风险。普通用户随手装一个第三方 MCP,等于把它的 description 字符串塞进自己每一轮的 system prompt。详见 MCP 精读。

实战中的样子

原文:The clearest public picture I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of Claude Code’s architecture, and it’s worth sitting with the diagram for a minute.

我见过最清楚的成熟 harness 公开图景,是 Fareed Khan 对 Claude Code 架构的(推测)拆解,值得对着那张图坐一会。

Claude Code 架构,按层标注:输入层(用户界面、session manager、permission gate);知识层(skill registry、context compressor、task graph、memory store);集成层(MCP runtime 和外部 server);执行层(tool dispatch、streaming runtime、prompt cache);输出层(返回经验证的任务结果);可观测层(event bus、background executor);多 agent 层(subagent spawning、teammate mailboxes、FSM protocol、autonomous board、worktree isolator)。中间是 master agent loop,所有层的箭头都指向它。

原文:Almost every concept from the previous section shows up on this diagram as a named component. Context injection is the knowledge layer. Loop state lives in the memory store and the worktree isolator. Destructive-action hooks sit behind the permission gate. Subagent context firewalls are the entire multi-agent layer. The tool dispatch registry is where MCP servers and bash both plug in. Khan’s argument is the same as Viv’s, just worked through a shipping product: Claude Code’s trajectory is about the harness at least as much as about the model underneath it.

上一节的几乎每个概念,都以一个命名组件出现在这张图上。Context injection 是知识层。Loop state 住在 memory store 和 worktree isolator。破坏性动作的 hooks 坐在 permission gate 后面。Subagent 的 context firewall 就是整个 multi-agent 层。tool dispatch registry 是 MCP server 和 bash 共同插入的地方。Khan 的论点和 Viv 一样,只是用一个真出货的产品讲完:Claude Code 的轨迹,至少和它底下的模型一样,是关于 harness 的。

Harness 不会缩水,只会迁移

原文:One of the better observations in the Anthropic write-up is that as models improve, the space of interesting harness combinations doesn’t shrink. It moves.

Anthropic 那篇文章里更精到的观察之一是:随着模型变强,有趣的 harness 组合空间并不缩小,它只是迁移。

原文:The naive story is that better models make harnesses obsolete. If the model can plan, no planner. If the model is coherent at long horizons, no context resets. And yes, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely as it approached what it thought was its context limit), which means a whole class of anxiety-mitigation scaffolding I was writing six months ago is now dead code.

幼稚的叙事是:模型变好,harness 就没用了。模型会 plan 了,planner 就不需要了。模型在长跨度上连贯了,context reset 就不需要了。是的,Opus 4.6 基本上杀掉了”context 焦虑”那种失败模式(Sonnet 4.5 以前会在它以为快到 context 上限时提前收尾)—— 这意味着我半年前写的一整类”焦虑缓解脚手架”,现在变成 dead code 了。

原文:But the ceiling moved with the model. Tasks that were unreachable are in play, and they have their own failure modes. The anxiety scaffolding goes away, and in its place you need a multi-day memory policy, or a harness that coordinates three specialized agents, or evaluators for design quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.

但是天花板也跟着模型一起上移了。原本碰不到的任务现在可以做了,而它们带着各自的失败模式。焦虑脚手架消失,取而代之的是:多天 memory 策略、协调三个专精 agent 的 harness、评估生成 UI 设计质量的 evaluator。假设变了,编码这些假设的脚手架也跟着变。

原文:Anthropic puts it cleanly: “every component in a harness encodes an assumption about what the model can’t do on its own.” When the model gets better at something, that component becomes load-bearing for nothing and should come out. When the model unlocks something new, new scaffolding is needed to reach the new ceiling.

Anthropic 说得很干净:

原文金句:every component in a harness encodes an assumption about what the model can’t do on its own.

中译:harness 里的每个组件,都编码了某个”模型自己做不到的事”的假设。

当模型在某件事上变好,那块组件就不再承重,应该拿掉。当模型解锁新东西,就需要新脚手架去够到新天花板。

模型 - harness 训练循环

原文:The other thing that’s happening, which Viv names explicitly, is a feedback loop between harness design and model training.

另一件正在发生的事,Viv 明确点出:harness 设计和模型训练之间存在一个反馈 loop。

模型 - harness 训练 loop。一个有用的原语在 harness 里被发现,被标准化进产品,被用在下一代模型的训练中,下一代模型在使用这个原语上变得更好。循环重复。

原文:Today’s agent products are post-trained with harnesses in the loop. The model gets specifically better at the actions the harness designers think it should be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels different inside Claude Code than inside someone else’s harness, and it’s why changing a tool’s logic sometimes causes strange regressions. A genuinely general model wouldn’t care whether you used apply_patch or str_replace, but co-training creates overfitting.

今天的 agent 产品在 post-training 时,把 harness 也圈进去一起训。模型在 harness 设计者认为它应该擅长的动作上变得更强:文件系统操作、bash、planning、subagent 派发。这就是为什么 Opus 4.6 在 Claude Code 里感觉跟别人家 harness 里不一样,也是为什么改一个 tool 的逻辑有时会引发诡异的 regression。 一个真正通用的模型不会在意你用 apply_patch 还是 str_replace,但 co-training 制造了过拟合。

原文:The practical implication is twofold. A harness is a living system, not a config file you set up once. And the “best” harness isn’t necessarily the one the model was trained inside; it’s the one designed for your task. Viv’s Top 30 to Top 5 Terminal Bench jump is the clearest proof point I’ve seen.

实践含义有两层。

第一,harness 是一个有生命的系统,不是你一次性 setup 完就放着的 config file。

第二,“最佳” harness 不一定是模型在它里面训出来的那个;而是为你的任务设计的那个。Viv 的”Terminal Bench Top 30 → Top 5”那一跳,是我见过最清楚的证据点。

Harness-as-a-Service

原文:Viv’s other contribution is the HaaS framing: Harness-as-a-Service. The observation is that we’re moving from building on LLM APIs (which give you a completion) to building on harness APIs (which give you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Agents SDK all point in the same direction. You get the loop, the tools, the context management, the hooks, and the sandbox primitives out of the box, and you customize them.

Viv 另一个贡献是 HaaS 这个 framing —— Harness-as-a-Service。观察到的趋势是:我们正在从”构建在 LLM API 之上”(它给你 completion)迁移到”构建在 harness API 之上”(它给你 runtime)。Claude Agent SDK、Codex SDK、OpenAI Agents SDK 都指向同一个方向。你开箱拿到 loop、工具、context 管理、hooks 和 sandbox 原语,然后定制它们。

原文:The shift matters because the default path used to be: build your own loop, wire up your own tool-calling, handle your own conversation state, invent your own approval flow. Now the default path is: pick a harness framework, configure it along the four pillars (system prompt, tools, context, subagents), and put the rest of your effort into domain-specific prompt and tool design.

这个迁移很重要,因为原本的默认路径是:自己写 loop,自己接 tool-calling,自己处理对话状态,自己发明审批流。现在的默认路径是:选一个 harness 框架,沿四根支柱(system prompt、tools、context、subagents)配置它,剩下的精力放在领域专用的 prompt 和工具设计上。

原文:That’s what makes “skill issue” tractable. You’re not rebuilding an agent from scratch every time something goes wrong. You’re tuning a configuration surface that’s already well-factored.

这就是”skill issue”变得可处理的关键 —— 你不是每次出问题都从零重建一个 agent,而是在一个已经被很好分解的配置面上调参。

原文:Viv’s line on this is also the best argument for starting messy: “good agent building is an exercise in iteration. You can’t do iterations if you don’t have a v0.1.”

Viv 关于这件事的一句话,也是支持”一开始就粗糙地开干”的最强论据:

原文金句:good agent building is an exercise in iteration. You can’t do iterations if you don’t have a v0.1.

中译:好的 agent 建设是迭代的练习。没 v0.1,你就没法迭代。

这是要去哪

原文:Look at the top coding agents side by side (Claude Code, Cursor, Codex, Aider, Cline) and they look more like each other than their underlying models do. The models are different. The harness patterns are converging. I don’t think that’s an accident. It’s the industry slowly finding the load-bearing pieces of scaffolding that turn a generative model into something that can ship.

把头部 coding agent 排在一起看(Claude Code、Cursor、Codex、Aider、Cline)—— 它们彼此之间的相似度,比它们底下的模型彼此之间的相似度还高。 模型各不相同,harness 模式在收敛。我不觉得这是巧合 —— 行业正在慢慢找到那些”承重的脚手架”,把一个生成模型转化成一个能出货的东西。

原文:Viv’s framing of the open problems is the one I find most exciting: orchestrating many agents working in parallel on a shared codebase; agents that analyze their own traces to identify and fix harness-level failure modes; harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being pre-configured at startup.

Viv 对未解问题的归纳,我觉得最让人兴奋:

协调很多 agent 在共享代码库上并行工作;
agent 分析自己的 trace,识别并修复 harness 层级的失败模式;
harness 不再是启动时预先配好,而是按任务、即时(just-in-time)地动态组装出对的工具和 context。

原文:That last one, in particular, feels like where harnesses stop being static config and start becoming something closer to a compiler.

最后这一条尤其让人感觉:

原文金句:where harnesses stop being static config and start becoming something closer to a compiler.

中译:这是 harness 从”静态 config”变成”更接近 compiler 的东西”的拐点。

🟢 译者注:把 harness 类比为 compiler 是个高浓度比喻。compiler 把高级语言翻译成机器能执行的指令;dynamic harness 把”我想做的事”翻译成”对当前 agent + 当前上下文最有效的工具集 + prompt 注入序列”。这个方向上,LangGraph、Letta、CrewAI 等都在试,但还远没到 compiler 的成熟度。

译者总评

Harness Engineering 是 2026 年的新工程类目:像测试框架、像 IDE,有自己的最佳实践、有自己的论文、有自己的 benchmark。别再争模型,先把 harness 修对。
棘轮原则(ratchet)是核心仪式:每次 agent 出错,都把它转化成一条永久规则(AGENTS.md 行 / hook / subagent prompt)。一份好的 AGENTS.md 应该可以追溯到具体失败,不是从 ChatGPT 抄来的”最佳实践清单”。
success is silent, failures are verbose:hooks 的设计哲学。typecheck 通过沉默,失败把错误文本反向灌进 loop。这个模式可以推广到任何 agent feedback 设计。
HaaS 是新默认路径:不要从零写 loop,用 Claude Agent SDK / Codex SDK / OpenAI Agents SDK,在它们的四根支柱上做 domain 定制。这个建议直接挑战了 LangChain / LangGraph 这一代”自己拼装”路线。
Top 30 → Top 5:同一个 Claude Opus 4.6,只换 harness,Terminal Bench 排名跳跃近 25 位。这是 2026 年最有力的”harness > model”证据。如果你团队还在追”用最新的模型”,可能你的瓶颈不是模型。

🔗 调研来源

原文: https://addyosmani.com/blog/agent-harness-engineering/
配套精读: Addy Osmani 三连
相关原文: Anthropic — Harness design for long-running application development
相关原文: Viv Trivedy — Anatomy of an Agent Harness
相关原文: HumanLayer — Skill issue: harness engineering for coding agents
相关原文: Birgitta Böckeler on martinfowler.com

📝 配套精读 + 译者点评:Addy Osmani 三连