RTFM Benchmark Results — Complete Analysis¶

Date: 2026-03-02 Version: RTFM v0.3.1

Table of Contents¶

Study Design
B10 Real Task (Musicology)
FeatureBench 4-Condition Study
Sonnet 4.6 Model Comparison
Key Findings
Limitations
Raw Data References

Study Design¶

Configurations¶

Config	Description	RTFM	File paths
A	Standard prompt	No	Visible
B	Discovery baseline	No	Stripped from prompt
C	RTFM FTS	Yes (FTS5)	Stripped (agent must discover via RTFM)
D	RTFM + Embeddings	Yes (FTS5 + semantic)	Stripped (agent must discover via RTFM)

What B vs C/D tests¶

B strips file paths from prompts like C/D but has no RTFM. This isolates the RTFM effect from the path-stripping effect. If C outperforms B, the gain comes from RTFM's search, not from having paths in the prompt.

Metrics¶

F2P (Fail-to-Pass): fraction of failing tests the agent made pass
Resolved: 100% F2P (all tests pass)
Cost: API cost in USD
Duration: wall clock time in seconds
RTFM calls: total rtfm_search + rtfm_expand + rtfm_discover calls

B10 Real Task (Musicology)¶

Task: "Rédige l'article B10" — generate a scholarly article from indexed research notes. Model: Claude Opus 4 — Repo: musicology-phd (Obsidian vault, ~40 articles)

Iterative A→H sessions¶

8 sessions tested progressively, same prompt, fixing RTFM issues between iterations.

Key sessions (single runs, not averaged)¶

Session	Config	Duration	Cost	Tokens	RTFM calls	Language	Article
A	RTFM v1	12m00s	~$13.00	3.95M	18	ENGLISH (bug)	36K/8 sections
B	No RTFM (baseline)	8m16s	$22.61	8.21M	0	French	31K/10 sections
C	RTFM v2	11m23s	~$5.11	5.87M	7	French	38.5K/14 sections
D	RTFM v2+	15m48s	$2.62	3.97M	12	French	31.4K/18 sections
H	RTFM v3 final	6m58s	$11.14	3.22M	—	French	—

H vs B (final comparison)¶

Metric	B (No RTFM)	H (RTFM v3)	Delta
Duration	8m16s	6m58s	-16%
Cost	$22.61	$11.14	-51%
Tokens	8.21M	3.22M	-61%

Behavioral Analysis¶

Session A (RTFM v1) — what went wrong: - 18 RTFM calls: 7 necessary (39%), 6 redundant (33%), 5 useless (28%) - Indexed English translations (_en/B4.md) overwrote French originals (slug collision bug) - Agent saw lang: en in results → wrote entire article in English - Each call injected ~2500 tokens → 45K tokens of RTFM overhead

Session B (No RTFM) — why it's expensive: - 85 tool calls (37 direct + 48 in subagents) for file discovery - 15 dead ends (wrong directories, missing files) — but each was cheap (<1s, ~0 tokens) - Parallel subagents made discovery fast (~2min) - Correct language because no English sources surfaced

Session C (RTFM v2) — progressive disclosure works: - Only 7 RTFM calls (vs 18 in v1) thanks to metadata-only search - ~300 tokens per search result instead of ~2500 - Language correct (FR/EN distinction visible in results) - Still 29 redundant Glob/Read calls alongside RTFM

Session H (RTFM v3 final): - Agent trusts RTFM results, minimal redundant exploration - -51% cost, -16% duration vs baseline

Fixes between iterations¶

v1→v2: Metadata-only search (300 tokens vs 2500), absolute file paths, language metadata
v2→v2+: Anti-duplication template, CLI removed, hooks for remember
v2+→v3: FTS default (no 6min MiniLM cold start), pure data output, "search first" template

Data source¶

Session transcripts: ~/.claude/projects/musicology-phd/ on dev machine
Analysis: memory/ab_test_b10_analysis.md in this repo

FeatureBench 4-Condition Study¶

Model: Claude Sonnet 4 (claude-sonnet-4-20250514) — Timeout: 20 min Platform: FeatureBench (SWE-bench-style Docker), 4 repos, 11 tasks (level 1)

Aggregate Results (Sonnet 4)¶

Config	Task	N	Duration(s)	Cost($)	Tokens(M)	Turns	RTFM	F2P	Resolved
A	test_stub_generator	4	400	1.21	2.56	52	0.0	N/A	0/4
A	test_table	1	635	2.41	5.60	95	0.0	25.6%	0/1
A	test_validation	3	471	1.27	2.71	47	0.0	N/A	0/3
B	test_stub_generator	3	460	1.32	2.81	58	0.0	N/A	0/3
B	test_table	1	737	3.27	8.34	126	0.0	30.2%	0/1
B	test_validation	3	482	1.36	3.00	53	0.0	N/A	0/3
C	test_stub_generator	9	560	1.52	3.35	65	3.7	100% (2/9)	2/9
C	test_table	1	823	2.89	7.01	112	9.0	23.3%	0/1
C	test_validation	8	583	1.65	3.72	62	5.9	81.8% (2/8)	1/8
D	test_stub_generator	8	634	1.50	3.29	63	3.6	100% (2/8)	2/8
D	test_table	1	847	4.61	12.24	114	32.0	30.2%	0/1
D	test_validation	8	627	1.54	3.45	58	7.0	59.1% (2/8)	0/8
A	test_responses_agent	1	1156	5.03	12.08	118	0.0	N/A	0/1
D	test_responses_agent	1	1013	3.84	9.12	103	14.0	N/A	0/1

Per-Task Analysis¶

test_stub_generator (mlflow, 8259 files)¶

RTFM's strongest result. C and D are the only configs that ever resolve the task.

Config	Resolved	Resolution rate	Avg cost	Avg duration
A	0/4	0%	$1.21	400s
B	0/3	0%	$1.32	460s
C	2/9	22%	$1.52	560s
D	2/8	25%	$1.50	634s

Why RTFM helps: The task requires finding stub_generator.py in a large repo (8259 files). Without RTFM, agents can't locate it within the timeout. RTFM surfaces the relevant file directly.

Why it's not 100%: Even with RTFM pointing to the right file, the code change is non-trivial — the agent must understand the stub generation logic and make the correct modification. 75% of RTFM runs find the file but fail on implementation.

test_validation (mlflow, 8259 files)¶

Config	Resolved	Best F2P	Avg cost	Avg duration
A	0/3	N/A	$1.27	471s
B	0/3	N/A	$1.36	482s
C	1/8	11/11 (100%)	$1.65	583s
D	0/8	7/11 (64%)	$1.54	627s

C resolves once (11/11 F2P) in 8 runs. D gets close (7/11) but never fully resolves. A/B never produce evaluable patches.

Key insight: FTS (C) outperforms embeddings (D) here. The search terms are exact function/class names — FTS excels at exact matching while embeddings add noise from semantically similar but irrelevant results.

test_table (astropy, 3515 files)¶

Config	F2P	Cost	Tokens	Duration	RTFM calls
A	11/43 (25.6%)	$2.41	5.60M	635s	0
B	13/43 (30.2%)	$3.27	8.34M	737s	0
C	10/43 (23.3%)	$2.89	7.01M	823s	9
D	13/43 (30.2%)	$4.61	12.24M	847s	32

RTFM's weakest result. No config resolves. RTFM adds overhead without improving F2P.

Why RTFM hurts here: This is a medium-sized repo where standard tools (Grep, Glob) work well. The task requires modifying table.py which is easy to find. RTFM adds 9-32 extra calls that consume tokens and time without value. D is particularly bad: 32 RTFM calls, 2x cost of A, same F2P.

Behavioral analysis (from stream logs): - A (95 turns): Focused — finds the file quickly, spends most time on implementation. 70 Read, 73 Bash, 25 Grep. - B (126 turns): Path-stripped, so more exploration. Higher Grep (25) and Read (75) counts. - C (112 turns): 9 RTFM calls early on, then pivots to standard tools. The RTFM results don't add value over what Grep would find instantly. - D (114 turns): 32 RTFM calls throughout — agent keeps going back to RTFM for every question. 103 Read + 59 Grep + 32 RTFM = massive token consumption.

test_responses_agent (mlflow, 8259 files)¶

Only A and D tested (1 run each). Neither resolves. D uses 14 RTFM calls and costs $3.84 vs A's $5.03 — RTFM may help with cost but sample size is too small to conclude.

Sonnet 4.6 Model Comparison¶

Model: Claude Sonnet 4.6 (claude-sonnet-4-6) — Task: test_table (astropy) Timeout: 20 min (1200s)

Results¶

Config	Duration	Tools	RTFM	Tokens	F2P
A	TIMEOUT	192	0	~21.9M	0%
B	TIMEOUT	185	0	~22.8M	0%
C	TIMEOUT	204	9	~23.0M	0%
D	TIMEOUT	233	10	~28.8M	0%

Tool Call Breakdown¶

Tool	A	B	C	D
Read	70	75	86	103
Bash	73	57	64	37
Grep	25	25	19	59
Edit	13	20	20	15
TodoWrite	6	7	4	6
Agent	2	1	1	1
Glob	3	0	1	2
RTFM search	—	—	4	5
RTFM expand	—	—	4	4
RTFM discover	—	—	1	1

Sonnet 4.6 vs Sonnet 4 (test_table, all configs)¶

Metric	Sonnet 4 (avg)	Sonnet 4.6 (all)	Delta
Duration	760s	1200s (TIMEOUT)	+58%
Tokens	8.30M	24.1M	+190%
Tool calls	112	204	+82%
F2P	25.8%	0%	-100%
Patch produced	Yes	No	—

Sonnet 4.6 is catastrophically worse on this benchmark: - 3x more tokens consumed - All 4 configs TIMEOUT at 20 min with 0% F2P - No config produces any patch (0 bytes) - The model appears to spend more time exploring/reading without converging on a solution

This is not an RTFM-specific finding — all configs fail equally. The model may need different prompt engineering or longer timeouts for SWE-bench-style tasks.

Key Findings¶

Where RTFM helps¶

Large repos with non-obvious file locations (test_stub_generator)
A/B: 0% resolution rate → C/D: 22-25% resolution rate
RTFM is the difference between "never solves" and "sometimes solves"
The gain is purely from file discovery — RTFM surfaces relevant files the agent can't find with Grep/Glob in time
Real-world multi-source knowledge tasks (B10 article generation)
-51% cost, -16% duration, -61% tokens vs baseline
RTFM allows the agent to navigate a complex knowledge base (40+ articles, research notes, drafts) efficiently
Without RTFM, the agent resorts to expensive exploration (85 tool calls, 48 in subagents)
FTS outperforms embeddings for code tasks
C (FTS) matches or beats D (FTS+embeddings) on every FeatureBench task
test_validation: C resolves 1/8, D resolves 0/8
test_table: D costs 2x more than C with same F2P
Code search terms are often exact identifiers — FTS is ideal, embeddings add noise

Where RTFM is neutral¶

test_validation: C resolves 1x in 8 runs (12.5%), vs 0% for A/B/D. The improvement exists but is marginal and could be variance with this sample size.

Where RTFM hurts¶

Medium repos with obvious file structures (test_table, astropy)
C: 23.3% F2P vs A: 25.6% — slightly worse
D: 30.2% F2P but $4.61 vs A: $2.41 — same F2P at 2x cost
RTFM calls consume tokens and time without aiding discovery
Agent's natural tools (Grep for "class Table", Glob for "table.py") are faster and cheaper
Embeddings always increase cost, rarely improve results
D averages +24% cost vs C across all tasks
D never outperforms C on resolution rate
The 6-min MiniLM cold start (eliminated in v3 by defaulting to FTS) was a critical blocker
Even post-fix, embedding search adds ~20% more RTFM calls (agent queries more because semantic results feel "helpful")

Limitations¶

Methodological¶

Small sample sizes: Most task/config combinations have 1-9 runs. Variance is high — a single lucky/unlucky run shifts averages significantly.
Single model: All FeatureBench runs use Sonnet 4. Results may not generalize to Opus or other models.
Limited task diversity: Only 4 unique tasks with full ABCD data. The B10 task is fundamentally different from FeatureBench tasks (knowledge generation vs. bug fixing).
No statistical tests: With N=1-9 per condition, we can't compute meaningful p-values. All comparisons should be treated as directional, not conclusive.

Tool-specific¶

RTFM overhead scales with repo familiarity: On repos the agent already "knows" (common patterns, obvious file names), RTFM is pure overhead. The benefit emerges only when files are hard to find.
Embedding search (D) consistently underperforms FTS (C): For code tasks where search terms are identifiers, semantic similarity is counterproductive.
Agent calibration problem: The agent sometimes over-relies on RTFM (D on test_table: 32 calls) or under-relies on it. The CLAUDE.md template helps but doesn't fully control behavior.

Raw Data References¶

Files on PC2 (roomi@192.168.1.28)¶

Path	Description
`~/projects/FeatureBench/reports/benchmark/metrics.jsonl`	All benchmark metrics (64 entries, all runs)
`~/projects/FeatureBench/runs/`	Sonnet 4 run outputs (A/B/C/D for all tasks)
`~/projects/FeatureBench/runs_s46/`	Sonnet 4.6 run outputs (test_table only)
`~/projects/FeatureBench/run_benchmark.sh`	Main benchmark script
`/mnt/data/rtfm-dbs/<repo>/library.db`	Pre-generated RTFM databases per repo

Stream log format¶

Each run contains claude_code_stream_output.jsonl with events:

{"type": "assistant", "message": {"content": [{"type": "tool_use", "name": "Read", "input": {...}}]}}
{"type": "result", "result": {"usage": {...}, "costUSD": 1.23, "numTurns": 45}}

Files in this repo¶

Path	Description
`paper/benchmark_results.md`	This file
`paper/ab_test_b10_analysis.md`	Detailed B10 session analysis (A→D)
`paper/benchmark_paper.md`	Paper plan (EMSE)
`paper/BACKLOG.md`	Research backlog

Local cache¶

Path	Description
`/tmp/metrics_raw.jsonl`	Copy of metrics.jsonl downloaded from PC2

Per-Run Raw Data¶

test_stub_generator — All runs (Sonnet 4)¶

Config	Run	Duration	Cost	Tokens	Turns	RTFM	F2P	Resolved
A	r1	370s	$0.97	1.84M	43	0	—	No
A	r1	580s	$1.96	4.29M	84	0	—	No
A	r2	583s	$1.82	4.02M	77	0	—	No
B	r1	395s	$1.07	2.21M	52	0	—	No
B	r1	462s	$1.37	2.99M	64	0	—	No
B	r2	523s	$1.52	3.23M	59	0	—	No
C	r1	454s	$1.30	2.78M	59	2	—	No
C	r1	535s	$1.63	3.64M	66	4	—	No
C	r1	486s	$1.48	3.35M	67	3	—	No
C	r1	491s	$1.29	2.72M	58	2	—	No
C	r1	551s	$1.37	3.00M	46	6	—	No
C	r1	750s	$1.79	3.94M	72	2	—	No
C	r1	529s	$1.29	2.69M	58	5	31/31	Yes
C	r2	702s	$2.06	4.89M	89	3	—	No
C	r2	540s	$1.46	3.20M	66	6	31/31	Yes
D	r1	541s	$1.44	3.20M	68	1	—	No
D	r1	523s	$0.86	1.65M	37	3	—	No
D	r1	535s	$1.31	2.92M	54	5	—	No
D	r1	859s	$2.05	4.61M	92	2	—	No
D	r1	670s	$1.48	3.35M	57	7	—	No
D	r1	584s	$1.51	3.26M	66	3	31/31	Yes
D	r2	659s	$1.54	3.30M	58	8	—	No
D	r2	699s	$1.83	4.01M	70	0	31/31	Yes

test_validation — All runs (Sonnet 4)¶

Config	Run	Duration	Cost	Tokens	Turns	RTFM	F2P	Resolved
A	r1	479s	$1.12	2.41M	45	0	—	No
A	r1	440s	$1.18	2.41M	42	0	—	No
A	r2	493s	$1.51	3.31M	55	0	—	No
B	r1	459s	$1.50	3.38M	54	0	—	No
B	r1	509s	$1.28	2.82M	56	0	—	No
B	r2	477s	$1.30	2.81M	48	0	—	No
C	r1	732s	$2.42	5.74M	82	5	—	No
C	r1	516s	$1.57	3.52M	66	16	—	No
C	r1	722s	$2.23	5.27M	77	3	—	No
C	r1	495s	$1.41	3.13M	56	10	—	No
C	r1	618s	$1.24	2.50M	53	6	11/11	Yes
C	r2	541s	$1.57	3.57M	60	2	—	No
C	r2	590s	$1.43	3.17M	55	2	7/11	No
C	r3	450s	$1.31	2.83M	47	3	—	No
D	r1	605s	$1.33	2.97M	51	2	—	No
D	r1	684s	$1.58	3.67M	61	11	—	No
D	r1	651s	$1.68	3.71M	59	1	—	No
D	r1	578s	$1.19	2.44M	44	3	—	No
D	r1	784s	$2.08	4.90M	80	20	6/11	No
D	r2	522s	$1.46	3.28M	53	8	—	No
D	r2	673s	$1.75	3.99M	65	6	7/11	No
D	r3	518s	$1.23	2.64M	47	5	—	No

test_table — All runs (Sonnet 4)¶

Config	Duration	Cost	Tokens	Turns	RTFM	F2P	Resolved
A	635s	$2.41	5.60M	95	0	11/43 (25.6%)	No
B	737s	$3.27	8.34M	126	0	13/43 (30.2%)	No
C	823s	$2.89	7.01M	112	9	10/43 (23.3%)	No
D	847s	$4.61	12.24M	114	32	13/43 (30.2%)	No

test_table — Sonnet 4.6 (all TIMEOUT at 1200s)¶

Config	Tools	RTFM	Read	Bash	Grep	Edit
A	192	0	70	73	25	13
B	185	0	75	57	25	20
C	204	9	86	64	19	20
D	233	10	103	37	59	15

test_responses_agent (Sonnet 4, 1 run each)¶

Config	Duration	Cost	Tokens	Turns	RTFM	Resolved
A	1156s	$5.03	12.08M	118	0	No
D	1013s	$3.84	9.12M	103	14	No