Captain's Log #3

Can a coding agent figure out your product?

AVP is brand new, so no AI agent has heard of it. We tried five ways to deliver it to a coding agent, scored each on the same six tasks, and let AVP grade itself. The fanciest channels lost to a plain doc dropped in the context window.

Every team shipping for AI agents asks the same thing: when an agent meets our product for the first time, how do we get it productive fast? Write a Skill? Stand up an MCP server? Drop an llms.txt? Just tell it to read --help? Everyone has an opinion; nobody has a number.

So we measured it. AVP (the Agent Voyager Project) is an open standard for running agents and recording what they do, and it postdates every current model's training, so it is a clean test subject: the agent genuinely does not know it. We gave Goose and Claude Code (both on cheap Haiku) six questions about operating AVP — which CLI command runs an eval, which scorer grades table fidelity, what the ParseBench dataset id is — and varied only one thing: the channel the knowledge arrived through. Each channel is one AVP commission; the answers are graded by the avp CLI's own scorer.

The leaderboard

Accuracy over six questions, with mean cost and turns per question. The first three rows are the same channel — a doc in the system prompt — in three different formats.

Channel / assetGooseClaude Code
llms.txtDoc in the system prompt100%$0.0024 · 1.0 turns100%$0.0021 · 1.0 turns
Perfect, one turn, a fifth of a cent. The answer is already in front of the model.
AGENTS.mdDoc in the system prompt100%$0.0014 · 1.0 turns100%$0.0041 · 1.0 turns
Identical channel to llms.txt, different text — and identical result. The format doesn't matter.
raw docs dumpDoc in the system prompt83%$0.0016 · 1.0 turns83%$0.0053 · 1.0 turns
Same channel again, but our excerpt never spelled out the acronym — so it missed exactly that one question. Completeness is the whole game.
explore the CLIRun the CLI100%$0.17 · 9.7 turns100%$0.02 · 7.3 turns
Also 100% — the CLI is legible enough to self-teach — but 7-10 turns of poking, 10-80x the cost of a doc.
MCP knowledge serverTool-fetched retrieval50%$0.52 · 19.2 turns50%$0.03 · 7.7 turns
Same facts as the docs, fetched tool-by-tool: half the accuracy, up to $1.22 and 33 turns on a single question.
Agent SkillInline skill33%$0.65 · 22.8 turns17%$0.08 · 14.4 turns
Loses on completeness: the skill covers the wire, not the CLI, so both agents flail on CLI questions (40+ turns) and still miss.
nothingControl0%$0.0009 · 1.0 turns0%$0.0031 · 1.0 turns
Zero, as it should be. AVP postdates the models' training.

A doc in the context window wins — and the format barely matters

The top of the board is one channel: put the docs in the system prompt. We tried it three ways — an llms.txt, an AGENTS.md, and a raw dump of README/spec excerpts. The first two went six-for-six on both agents, in a single turn, for about a fifth of a cent; llms.txt versus AGENTS.md made no measurable difference. The reason is boring and important: the answer was already in front of the model. No tool calls, no fetching, no deciding what to look up.

What did matter was completeness. The raw docs dump — same channel, same one-turn cost — scored 83%, and the per-question view below shows exactly why: it missed one question, the AVP acronym, because our excerpt never spelled it out. Not the format, not the channel. A single hole in the doc, and the agent falls straight through it.

Where each channel actually breaks

The top-line accuracy hides the interesting part. Here is which of the six questions each strategy got right, across both agents:

strategy ╲ questionAVP acronymParseBench scorerParseBench datasetrun an evalvalidate a commissionexact-match scorer
llms.txt✓✓✓✓✓✓✓✓✓✓✓✓
AGENTS.md✓✓✓✓✓✓✓✓✓✓✓✓
explore the CLI✓✓✓✓✓✓✓✓✓✓✓✓
raw docs dump··✓✓✓✓✓✓✓✓✓✓
MCP server✓✓✓·✓·····✓✓
Agent Skill✓✓········✓·
nothing············

Each cell: how many of the two agents (Goose, Claude Code) got that question right. ✓✓ both · ✓· one · ·· neither.

You can read the failure modes straight off it. The docs dump misses one cell, the acronym. The Skill knows the acronym (it's in the skill) but every CLI question is blank, because the skill doesn't cover the CLI. The MCP server is a patchwork — and which cells it hits flips between the two agents, because it depends on each model driving the search tools well. The clean sweeps are the in-context docs and, more surprisingly, letting the agent explore the CLI.

Letting the agent explore your CLI also works — if your CLI is legible

We gave one group no docs at all, just “the avp CLI is installed, go run it and figure it out.” It also went six-for-six. The agents ran avp --help, scaffolded an example, read the output, and found every answer — a genuinely good sign about the product: the CLI teaches itself. But it cost 7 to 10 turns of poking around, 10 to 80 times the price of just handing over the doc. Self-service onboarding is real, but the agent pays for your missing documentation in tokens and latency, every run.

The fancy channels underperformed — and AVP recorded every wasted turn

The MCP knowledge server had the exact same facts as the winning docs, exposed as search-and-read tools. It scored 50%. Forced to fetch knowledge tool-by-tool, the agents often queried the wrong doc, gave up, or ground for ages. This is one such run — the same question the docs answered in a single turn for $0.0002:

What AVP recorded for one runGoose · MCP knowledge server · “ParseBench dataset” · ended in a wrong answer
101 events
33 model turns
32 tool calls
$1.22 spent
turn 1$0.0002
turn 5$0.08
turn 10$0.19
turn 20$0.42
turn 33$1.22

The Agent Skill loses the same way the docs dump did, only worse: on completeness. The skill we attached covers the AVP wire format but not the CLI, so when asked a CLI question the agent believes the answer is findable and flails — 40-plus turns and over a dollar on a single question. A partial onboarding artifact is the worst outcome of all: the control, handed nothing and no tools, failed instantly for a tenth of a cent; the half-informed agent burned a thousand times that, chasing an answer that was never in the doc. The lesson isn't “skip skills” — it's that an onboarding artifact is only as good as its coverage, and a hole in it is expensive.

The agent tried to cheat

Our first draft asked wire-format trivia: “what is the source on every AVP event?” The cold control — no docs, no knowledge — started passing it. The trajectory showed why: the agent ran cat on its own log file, which AVP writes as it goes, and every event in it carries source: avp://agent. It answered the quiz by reading its own homework. We rewrote every question to ask about the CLI layer, which never appears in an agent's own runtime artifacts. Worth remembering if you ever benchmark an agent on a system it is actively running.

Advice for businesses

If you are building for agents (you should): write a complete llms.txt / AGENTS.md, keep your CLI legible, and don't make the agent dig.

Curious how we ran it? The whole eval is open in the AVP repo: avp-cli/examples/onboarding. Clone it and run avp eval run onboarding.eval.json yourself.