Same model, same PDFs, six commissions.

Read the specAVP v0.1 — wire format, envelopes, conformance
Task
Parse a PDF page into semantic HTML, tables reconstructed cell-by-cell.
Configurations
step-by-step · baseline · with-anthropic-pdf-skill · with-pdfplumber-mcp · terse-prompt · few-shot
Dataset
ParseBench (LlamaIndex). Ten table-category pages, identical across runs.
Model
claude-haiku-4-5-20251001, identical across all six configurations.
Runs
10 pages per configuration. 60 runs total. Reported numbers are means.
Scoring
Structural HTML comparison against the reference: column headers, row count, cell content, and merged-cell topology.
Snapshot
Run on 2026-05-13. Eval version parsebench-table·20260513-014902. SDK: claude-agent-sdk.

step-by-step

accuracy
95%
cost / run
$0.33
steps
2.5

The only configuration that asked the agent to check its work before submitting. Won on accuracy and cost less than baseline, because the agent didn't have to retry.

baseline

accuracy
82%
cost / run
$0.35
steps
2.7

Structured prompt, no self-check. The control group.

with-anthropic-pdf-skill

accuracy
81%
cost / run
$0.31
steps
2.8

Baseline plus a bundled PDF skill in context. Same accuracy as baseline — the skill didn't move the needle here.

with-pdfplumber-mcp

accuracy
70%
cost / run
$0.20
steps
2.2

Terse prompt plus a specialized MCP server for table extraction. Cheapest run, but accuracy dropped — the model reading the PDF natively beat the specialized tool.

terse-prompt

accuracy
68%
cost / run
$0.59
steps
3.4

One sentence of guidance. The agent burned nearly twice the baseline cost figuring out a workflow on its own.

few-shot

accuracy
67%
cost / run
$0.82
steps
3.4

Given a worked example to pattern-match against. The example anchored the agent on a structure that didn't fit every page. Worst accuracy, highest cost.