Same model, same PDFs, six commissions.

Read the specAVP v0.1 — wire format, envelopes, conformance

Task: Parse a PDF page into semantic HTML, tables reconstructed cell-by-cell.
Configurations: step-by-step · baseline · with-anthropic-pdf-skill · with-pdfplumber-mcp · terse-prompt · few-shot
Dataset: ParseBench (LlamaIndex). Ten table-category pages, identical across runs.
Model: claude-haiku-4-5-20251001, identical across all six configurations.
Runs: 10 pages per configuration. 60 runs total. Reported numbers are means.
Scoring: Structural HTML comparison against the reference: column headers, row count, cell content, and merged-cell topology.
Snapshot: Run on 2026-05-13. Eval version parsebench-table·20260513-014902. SDK: claude-agent-sdk.

`step-by-step`

accuracy: 95%
cost / run: $0.33
steps: 2.5

The only configuration that asked the agent to check its work before submitting. Won on accuracy and cost less than baseline, because the agent didn't have to retry.

`baseline`

accuracy: 82%
cost / run: $0.35
steps: 2.7

Structured prompt, no self-check. The control group.

`with-anthropic-pdf-skill`

accuracy: 81%
cost / run: $0.31
steps: 2.8

Baseline plus a bundled PDF skill in context. Same accuracy as baseline — the skill didn't move the needle here.

`with-pdfplumber-mcp`

accuracy: 70%
cost / run: $0.20
steps: 2.2

Terse prompt plus a specialized MCP server for table extraction. Cheapest run, but accuracy dropped — the model reading the PDF natively beat the specialized tool.

`terse-prompt`

accuracy: 68%
cost / run: $0.59
steps: 3.4

One sentence of guidance. The agent burned nearly twice the baseline cost figuring out a workflow on its own.

`few-shot`

accuracy: 67%
cost / run: $0.82
steps: 3.4

Given a worked example to pattern-match against. The example anchored the agent on a structure that didn't fit every page. Worst accuracy, highest cost.