Benchmark Paper Plan¶

Objective¶

Research paper showing the impact of RTFM on the quality, time, and cost of Claude Code on real development tasks (FeatureBench).

4 experimental conditions (11 tasks × 4 configs)¶

Config	Description
A: Standard	Original FeatureBench prompt (provides files + interfaces) — unrealistic
B: Discovery baseline	Realistic prompt (paths stripped, `--discovery`) without RTFM
C: RTFM FTS	Discovery prompt + RTFM with FTS only, pre-parsed DB
D: RTFM + Embeddings	Discovery prompt + RTFM hybrid search, pre-generated DB (FTS+embeddings)

Realistic setup protocol (IMPORTANT)¶

Configs C AND D must use pre-built DBs mounted as volumes. In real usage, RTFM is already initialized in the project — on-the-fly sync is a test protocol artifact, not a user reality.

Config C: mount the pre-parsed FTS-only DB (same principle as D)
Config D: mount the pre-generated FTS+embeddings DB (already done)
Parsing/indexing time is reported as "initialization cost" in the paper, not as "setup time per run"

RTFM initialization costs (one-time per project)¶

Repo	Books	Chunks	Parse+FTS	+Embeddings	DB FTS	DB FTS+Embed
metaflow	876	~5,060	~10s	+161s	12 Mo	22 Mo
pydantic	771	~14,762	~15s	+444s	18 Mo	48 Mo
astropy	1,123	~41,231	~30s	+1,232s	52 Mo	133 Mo
mlflow	8,260	180,262	78s	+5,368s	234 Mo	592 Mo

Embedding throughput constant at ~33 chunks/sec on CPU (linear).

Setup per run (with pre-built DBs)¶

Step	Config C	Config D
Install RTFM	~18s (`[mcp]`)	~30s (`[mcp,embeddings]`)
Copy pre-built DB	~1s	~1s (592Mo mlflow)
Warm fastembed model	N/A	~17s
Total setup	~20s	~50s

TODO: Modify claude_code_rtfm.py (Config C) to copy the pre-parsed FTS DB instead of syncing on the fly. Create FTS-only DBs for each repo.

Metrics to collect per run¶

Performance¶

Total time (wall clock)
Agent time (excluding RTFM setup)
RTFM setup time (install + copy DB + warm model)

Cost¶

Tokens input / output / cache read
Cost $ (via Claude Code total_cost_usd)
Number of turns (API round-trips)

Quality¶

Resolve rate: the test passes or not (binary, evaluated by FeatureBench fb eval)
F2P pass rate: percentage of fail-to-pass tests that pass
Patch size (chars)
Patch correctness (the patch touches the right files)

Tool usage¶

Number of calls per tool (Grep, Glob, Read, Edit, Bash, rtfm_search, rtfm_expand, etc.)
Discovery vs coding tools ratio

RTFM transparency (amortized costs)¶

Parsing time per repo (one-time)
Embedding time per repo (one-time)
Generated DB size
Fastembed cold start (~17s, one-time per session)

11 tasks (4 images, no-GPU, lite split level 1)¶

metaflow (1 task, 624 books, 5060 chunks)¶

Netflix__metaflow.test_stub_generator

pydantic (1 task, 771 books, 14762 chunks)¶

pydantic__pydantic.test_deprecated_fields

astropy (2 tasks, ~1122 books)¶

astropy__astropy.test_quantity_erfa_ufuncs
astropy__astropy.test_table

mlflow (7 tasks, 8260 books, 180262 chunks)¶

mlflow__mlflow.test_validation
mlflow__mlflow.test_judge_tool_search_traces
mlflow__mlflow.test_serialization
mlflow__mlflow.test_span
mlflow__mlflow.test_trace
mlflow__mlflow.test_databricks_tracing_utils
mlflow__mlflow.test_responses_agent

Historical data — Config A (Standard, Feb 22, Sonnet 4.0)¶

Source: benchmark_final_results.jsonl — 10 tasks (no fb eval)

Task	Base dur (s)	RTFM dur (s)	Delta	Base turns	RTFM turns	RTFM searches
test_stub_generator (metaflow)	565	354	-37%	60	35	4
test_quantity_erfa_ufuncs (astropy)	536	606	+13%	62	83	12
test_table (astropy)	675	632	-6%	86	89	8
test_databricks_tracing (mlflow)	524	688	+31%	52	79	11
test_judge_tool (mlflow)	571	457	-20%	69	68	18
test_responses_agent (mlflow)	746	1022	+37%	84	79	2
test_serialization (mlflow)	481	509	+6%	47	47	6
test_span (mlflow)	707	958	+36%	81	116	4
test_trace (mlflow)	676	782	+16%	54	99	4
test_validation (mlflow)	701	386	-45%	45	41	3

WARNING: no eval → we don't know if the tests actually pass.

Historical data — Config B/C Discovery (Feb 25, Sonnet 4.0, mlflow only)¶

Source: 13 runs in ~/projects/FeatureBench/runs/2026-02-25__*

Task	Config	Duration (s)	Cost ($)	Turns	RTFM calls	F2P	Resolved
test_validation	B	596	$2.22	76	0	6/11 (54.5%)	No
test_validation	C	372	$1.31	60	1	11/11 (100%)	YES
test_databricks_tracing	B	667	$2.86	61	0	11/18 (61.1%)	No
test_databricks_tracing	C	441	$2.12	51	5	13/18 (72.2%)	No
test_judge_tool	B	427	$1.42	58	0	3/18 (16.7%)	No
test_judge_tool	C	500	$1.55	58	8	3/18 (16.7%)	No
test_responses_agent	B	TIMEOUT (~1200)	~$10.64	~178	0	-	-
test_responses_agent	C	917	$3.58	101	15	0/1 (0%)	No

Runs Feb 27-28 (Sonnet 4.0, OAuth MAX, post-FastEmbed, pre-parsed DBs)¶

test_responses_agent — Full A/B/C/D matrix (worst case)¶

Metric	Config A (Standard)	Config B (Discovery)	Config C (FTS)	Config D (Embed+)
Resolved	No (3.5% F2P)	No (TIMEOUT)	No (0%)	No (0%)
Total duration	1156s	TIMEOUT 1283s	1175s	1013s
Agent duration	1019s	~1200s (killed)	872s	872s
Cost (Claude)	$5.03	N/A (timeout)	$3.66	$3.84
Cost (calculated)	$8.34	$6.90	-	-
Turns	118	139 (incomplete)	91	103
Tool calls	117	138 (incomplete)	90	102
Read	39	50	27	43
Grep	7	39	15	7
Edit	40	21	26	24
Bash	22	18	-	-
RTFM search	0	0	10	13
Cache read	23,810,024	18,933,950	8,676,183	8,959,902
Output tokens	529	682	40,096	49,653
Patch size	92,474 chars	0 (timeout)	50,539 chars	91,048 chars
Interfaces covered	15/15	0/15	8/15	12/15
Files modified	15	0	8 (+2 RTFM)	12 (+2 RTFM)

Key observations (4 configs, same task): 1. No config resolves the task — Sonnet 4.0 cannot handle 78K of prompt + 15 interfaces 2. Config B (discovery) timeout at 1200s — without paths AND without RTFM, the agent is lost among 8260 files 3. Config A (standard) covers the 15 interfaces (paths in the prompt) but still fails (3.5% F2P) 4. Config D (embeddings) covers 12/15 — embeddings guide the agent toward the right files better than FTS alone (8/15) 5. Patch size correlated with interfaces: A≈D~91K >> C~50K >> B=0 6. Config C and D have the lowest cache usage (8-9M vs 19-24M for A/B) — RTFM reduces context 7. Config A has the most cache read (24M) — paths in the prompt direct the agent to all files, but it reads them entirely

test_responses_agent — Sonnet 4.6 attempt (abandoned, insufficient quota)¶

Exploratory tests Config A and D with Sonnet 4.6, timeout 2400s (40min):

Metric	S4.0 Config A	S4.0 Config D	S4.6 Config A	S4.6 Config D
Duration	1156s	1013s	TIMEOUT 2480s	TIMEOUT 2545s
Turns	118	103	357	657
Tool calls	117	102	355	650
Read	39	43	129	306
Edit	40	24	55	87
Bash	22	-	159	176
RTFM search	0	13	0	18
Subagents	0	0	1	8
Cache read	23.8M	9.0M	54.5M	49.5M
Cost (calculated)	$8.34	$3.84	$21.65	$20.86

Sonnet 4.6 observations: - Works 3-6x more than 4.0 (657 vs 102 tool calls in Config D) - Runs tests via Bash (159-176 calls), unlike 4.0 (0 tests) - Uses subagents to parallelize (8 in D) - But does not complete in 40min — MAX quota insufficient to explore further - RTFM amplifies exploratory behavior (657 turns D vs 357 A) - Key argument: at nearly identical duration and cost (~2500s, ~$21), Config D (RTFM) does 657 turns vs 357 (Config A) — 2x more useful work for the same price. Cost/turn is lower with RTFM ($0.032 vs $0.061) because cache read/turn drops (75K vs 153K): the agent goes directly to the right files instead of exploring everything. - Abandoned: task too heavy even for 4.6, not representative of the real use case

Failure analysis: test_responses_agent (Sonnet 4.0)¶

Why 0% resolution despite RTFM:

Disproportionate task: 78K chars of prompt, 15 interfaces, ground truth = 226K chars across 60 files. This is an outlier in FeatureBench.
Config C: ImportError — the agent only implemented 7/15 interfaces. It said: "Due to space constraints, let me focus on the most critical interfaces" and abandoned the 8 most complex ones. The test cannot be imported.
Config D: SyntaxError — the agent covered 12/15 interfaces (embeddings guided it better) but an editing bug on responses.py corrupted the file. 4 successive Edits to insert output_to_responses_items_stream ended up turning a continue into continue(chunks:... → immediate SyntaxError.
No agent ran tests: neither pytest nor even python -c "import ...". Errors would have been detected immediately.
Lost in the middle: Config D's TodoList contained 9 items instead of 15. The 3 missing interfaces (responses_helpers.py, data_validation.py, models/model.py) are those appearing at the END of the 78K char prompt.
The FeatureBench prompt does NOT ask to test: it says "pytest will be used to test" (passive) — never "run the tests yourself". No incentive for feedback loop.

RTFM impact still positive: - D covers 12/15 interfaces vs 7/15 for C (+71%) - D produces a 91K char patch vs 50K (+80%) - Hybrid search guides the agent to the right files: +59% Read, -53% Grep - But Sonnet 4.0 cannot handle 78K chars of prompt with 15 complex interfaces

This is not a retrieval problem, it is a model capability problem.

test_stub_generator (metaflow) — ABCD matrix (small repo, 624 books)¶

Metric	Config A (Standard)	Config B (Discovery)	Config C (FTS)	Config D (Embed+)
Resolved	YES (100%)	YES (100%)	YES (100%)	No (96.8%)
F2P	31/31	31/31	31/31	30/31
Total duration	370s	395s	454s	541s
Cost (Claude)	$0.97	$1.07	$1.30	$1.44
Cost (calculated)	$1.44	$1.65	$2.08	$2.33
Turns	42	51	58	67
Tool calls	42	51	58	67
Grep	20	15	16	21
Read	4	10	16	17
Edit	4	5	7	5
Bash	5	14	5	18
RTFM search	0	0	1	1
RTFM expand	0	0	1	0
RTFM discover	0	0	1	0
Patch size	27K	22K	22K	23K

Observations (small repo — easy task): 1. The 4 configs resolve the task (except D which misses 1 test/31: test_class_stub_generation) 2. Config A is the fastest — paths in the prompt eliminate discovery 3. Config B (discovery) resolves equally well in +7% time — on a small repo direct nav suffices 4. RTFM provides no measurable advantage on a 624-file repo: - Config C is +23% slower than A, +22% more expensive - Config D is +46% slower, +48% more expensive, and misses 1 test 5. The RTFM agent barely uses RTFM (1-2 calls) — it navigates directly because the repo is small 6. More Read with RTFM (16-17 vs 4-10) — overhead without gain

Conclusion: RTFM does not help on small repos. Setup overhead and MCP calls add latency without benefit when the agent can navigate directly. RTFM is designed for large codebases where discovery is the bottleneck.

test_validation (mlflow) — ABCD matrix (large repo, 8260 books)¶

Metric	Config A (Standard)	Config B (Discovery)	Config C (FTS)	Config D (Embed+)
Resolved	No	No	YES	YES
F2P	6/11 (55%)	7/11 (64%)	11/11 (100%)	11/11 (100%)
Total duration	479s	459s	732s	605s
Cost (Claude)	$1.12	$1.50	$2.42	$1.33
Cost (calculated)	$1.87	$2.54	$4.04	$2.23
Turns	44	53	81	50
Tool calls	44	53	81	50
Grep	8	6	7	13
Read	9	13	23	12
Edit	4	6	12	5
Bash	18	22	20	9
Glob	0	0	7	1
RTFM search	0	0	3	2
RTFM expand	0	0	2	0
Patch size	16K	15K	24K	23K

Observations (large repo — medium task): 1. RTFM raises resolution from 55-64% to 100% — the most striking result of the benchmark 2. A and B fail on the same tests: test_validate_scorers_invalid_all_scorers, test_validate_data_with_correctness, test_validate_data_missing_columns → tests requiring understanding of distant modules (validation.py ↔ scorers.py ↔ data.py) 3. Config D is the most efficient: 50 turns vs 81 (C), $2.23 vs $4.04 (C) → embeddings guide directly to the right files, less random navigation 4. Config C is slower but resolves: FTS suffices when terms are in the prompt 5. More Read in C (23) than in D (12): without embeddings, the agent reads more files to find the right one 6. Less Bash in D (9) than A/B (18-22): the RTFM agent codes more, debugs less 7. Larger patch with RTFM (23-24K vs 15-16K): more complete implementation

Conclusion: on a large repo, RTFM changes the outcome. The agent without RTFM cannot find dependencies between modules → incomplete implementation → failed tests.

Prompt difference A vs B/C/D¶

The discovery mode only removes 751 chars out of 78,036 (< 1%): - 16 lines Path: /testbed/... removed - "under the specified path" → "Explore the existing codebase to determine where" - Everything else (78K) is identical: description, interfaces, signatures, docstrings

Infrastructure¶

FeatureBench Agents¶

claude_code.py — Config A and B (standard Claude Code)
claude_code_rtfm.py — Config C (FTS, on-the-fly sync → TO BE MODIFIED for pre-parsed DB)
claude_code_rtfm_embed.py — Config D (pre-generated DB, hybrid search)

Benchmark Script¶

run_benchmark.sh — 4 configs × N tasks, auto eval + metrics
Metrics → reports/benchmark/metrics.jsonl
Tool usage parsed from content blocks in stream output

Auth: OAuth MAX only (no API key)¶

Credentials copied from local → PC2
Token auto-refreshes via refresh_token
Need to refresh before each batch run (scp ~/.claude/.credentials.json)

TODO — Next session¶

Preparation (before launching runs)¶

Create pre-parsed FTS-only DBs for each repo — DONE 28/02 → /mnt/data/rtfm-dbs-fts/: metaflow 12Mo, pydantic 18Mo, astropy 52Mo, mlflow 234Mo
Modify claude_code_rtfm.py (Config C) to copy the pre-parsed FTS DB — DONE 28/02 → copies from /opt/rtfm-dbs-fts/<repo>/library.db, fallback to sync if absent
Refresh OAuth credentials on PC2 — DONE 28/02

Priority runs¶

Launch A and B on test_responses_agent — IN PROGRESS (A launched 28/02)
Launch A/B/C/D on test_validation — DONE 28/02 (RTFM C+D resolve, A+B don't!)
Launch A/B/C/D on test_stub_generator (metaflow) — DONE 28/02 (4/4 resolve except D 30/31)

Full runs¶

Re-run Config A with eval (11 tasks including pydantic)
Launch Config B/C/D on non-mlflow tasks (metaflow, astropy, pydantic)
Re-launch Config B for serialization and responses_agent (previously failed)
Number of repetitions per condition (significance)
Full matrix 11 tasks × 4 configs