RTFM Benchmark Results — Complete Analysis¶
Date: 2026-03-02 Version: RTFM v0.3.1
Table of Contents¶
- Study Design
- B10 Real Task (Musicology)
- FeatureBench 4-Condition Study
- Sonnet 4.6 Model Comparison
- Key Findings
- Limitations
- Raw Data References
Study Design¶
Configurations¶
| Config | Description | RTFM | File paths |
|---|---|---|---|
| A | Standard prompt | No | Visible |
| B | Discovery baseline | No | Stripped from prompt |
| C | RTFM FTS | Yes (FTS5) | Stripped (agent must discover via RTFM) |
| D | RTFM + Embeddings | Yes (FTS5 + semantic) | Stripped (agent must discover via RTFM) |
What B vs C/D tests¶
B strips file paths from prompts like C/D but has no RTFM. This isolates the RTFM effect from the path-stripping effect. If C outperforms B, the gain comes from RTFM's search, not from having paths in the prompt.
Metrics¶
- F2P (Fail-to-Pass): fraction of failing tests the agent made pass
- Resolved: 100% F2P (all tests pass)
- Cost: API cost in USD
- Duration: wall clock time in seconds
- RTFM calls: total rtfm_search + rtfm_expand + rtfm_discover calls
B10 Real Task (Musicology)¶
Task: "Rédige l'article B10" — generate a scholarly article from indexed research notes. Model: Claude Opus 4 — Repo: musicology-phd (Obsidian vault, ~40 articles)
Iterative A→H sessions¶
8 sessions tested progressively, same prompt, fixing RTFM issues between iterations.
Key sessions (single runs, not averaged)¶
| Session | Config | Duration | Cost | Tokens | RTFM calls | Language | Article |
|---|---|---|---|---|---|---|---|
| A | RTFM v1 | 12m00s | ~$13.00 | 3.95M | 18 | ENGLISH (bug) | 36K/8 sections |
| B | No RTFM (baseline) | 8m16s | $22.61 | 8.21M | 0 | French | 31K/10 sections |
| C | RTFM v2 | 11m23s | ~$5.11 | 5.87M | 7 | French | 38.5K/14 sections |
| D | RTFM v2+ | 15m48s | $2.62 | 3.97M | 12 | French | 31.4K/18 sections |
| H | RTFM v3 final | 6m58s | $11.14 | 3.22M | — | French | — |
H vs B (final comparison)¶
| Metric | B (No RTFM) | H (RTFM v3) | Delta |
|---|---|---|---|
| Duration | 8m16s | 6m58s | -16% |
| Cost | $22.61 | $11.14 | -51% |
| Tokens | 8.21M | 3.22M | -61% |
Behavioral Analysis¶
Session A (RTFM v1) — what went wrong:
- 18 RTFM calls: 7 necessary (39%), 6 redundant (33%), 5 useless (28%)
- Indexed English translations (_en/B4.md) overwrote French originals (slug collision bug)
- Agent saw lang: en in results → wrote entire article in English
- Each call injected ~2500 tokens → 45K tokens of RTFM overhead
Session B (No RTFM) — why it's expensive: - 85 tool calls (37 direct + 48 in subagents) for file discovery - 15 dead ends (wrong directories, missing files) — but each was cheap (<1s, ~0 tokens) - Parallel subagents made discovery fast (~2min) - Correct language because no English sources surfaced
Session C (RTFM v2) — progressive disclosure works: - Only 7 RTFM calls (vs 18 in v1) thanks to metadata-only search - ~300 tokens per search result instead of ~2500 - Language correct (FR/EN distinction visible in results) - Still 29 redundant Glob/Read calls alongside RTFM
Session H (RTFM v3 final): - Agent trusts RTFM results, minimal redundant exploration - -51% cost, -16% duration vs baseline
Fixes between iterations¶
- v1→v2: Metadata-only search (300 tokens vs 2500), absolute file paths, language metadata
- v2→v2+: Anti-duplication template, CLI removed, hooks for remember
- v2+→v3: FTS default (no 6min MiniLM cold start), pure data output, "search first" template
Data source¶
- Session transcripts:
~/.claude/projects/musicology-phd/on dev machine - Analysis:
memory/ab_test_b10_analysis.mdin this repo
FeatureBench 4-Condition Study¶
Model: Claude Sonnet 4 (claude-sonnet-4-20250514) — Timeout: 20 min
Platform: FeatureBench (SWE-bench-style Docker), 4 repos, 11 tasks (level 1)
Aggregate Results (Sonnet 4)¶
| Config | Task | N | Duration(s) | Cost($) | Tokens(M) | Turns | RTFM | F2P | Resolved |
|---|---|---|---|---|---|---|---|---|---|
| A | test_stub_generator | 4 | 400 | 1.21 | 2.56 | 52 | 0.0 | N/A | 0/4 |
| A | test_table | 1 | 635 | 2.41 | 5.60 | 95 | 0.0 | 25.6% | 0/1 |
| A | test_validation | 3 | 471 | 1.27 | 2.71 | 47 | 0.0 | N/A | 0/3 |
| B | test_stub_generator | 3 | 460 | 1.32 | 2.81 | 58 | 0.0 | N/A | 0/3 |
| B | test_table | 1 | 737 | 3.27 | 8.34 | 126 | 0.0 | 30.2% | 0/1 |
| B | test_validation | 3 | 482 | 1.36 | 3.00 | 53 | 0.0 | N/A | 0/3 |
| C | test_stub_generator | 9 | 560 | 1.52 | 3.35 | 65 | 3.7 | 100% (2/9) | 2/9 |
| C | test_table | 1 | 823 | 2.89 | 7.01 | 112 | 9.0 | 23.3% | 0/1 |
| C | test_validation | 8 | 583 | 1.65 | 3.72 | 62 | 5.9 | 81.8% (2/8) | 1/8 |
| D | test_stub_generator | 8 | 634 | 1.50 | 3.29 | 63 | 3.6 | 100% (2/8) | 2/8 |
| D | test_table | 1 | 847 | 4.61 | 12.24 | 114 | 32.0 | 30.2% | 0/1 |
| D | test_validation | 8 | 627 | 1.54 | 3.45 | 58 | 7.0 | 59.1% (2/8) | 0/8 |
| A | test_responses_agent | 1 | 1156 | 5.03 | 12.08 | 118 | 0.0 | N/A | 0/1 |
| D | test_responses_agent | 1 | 1013 | 3.84 | 9.12 | 103 | 14.0 | N/A | 0/1 |
Per-Task Analysis¶
test_stub_generator (mlflow, 8259 files)¶
RTFM's strongest result. C and D are the only configs that ever resolve the task.
| Config | Resolved | Resolution rate | Avg cost | Avg duration |
|---|---|---|---|---|
| A | 0/4 | 0% | $1.21 | 400s |
| B | 0/3 | 0% | $1.32 | 460s |
| C | 2/9 | 22% | $1.52 | 560s |
| D | 2/8 | 25% | $1.50 | 634s |
Why RTFM helps: The task requires finding stub_generator.py in a large repo (8259 files). Without RTFM, agents can't locate it within the timeout. RTFM surfaces the relevant file directly.
Why it's not 100%: Even with RTFM pointing to the right file, the code change is non-trivial — the agent must understand the stub generation logic and make the correct modification. 75% of RTFM runs find the file but fail on implementation.
test_validation (mlflow, 8259 files)¶
| Config | Resolved | Best F2P | Avg cost | Avg duration |
|---|---|---|---|---|
| A | 0/3 | N/A | $1.27 | 471s |
| B | 0/3 | N/A | $1.36 | 482s |
| C | 1/8 | 11/11 (100%) | $1.65 | 583s |
| D | 0/8 | 7/11 (64%) | $1.54 | 627s |
C resolves once (11/11 F2P) in 8 runs. D gets close (7/11) but never fully resolves. A/B never produce evaluable patches.
Key insight: FTS (C) outperforms embeddings (D) here. The search terms are exact function/class names — FTS excels at exact matching while embeddings add noise from semantically similar but irrelevant results.
test_table (astropy, 3515 files)¶
| Config | F2P | Cost | Tokens | Duration | RTFM calls |
|---|---|---|---|---|---|
| A | 11/43 (25.6%) | $2.41 | 5.60M | 635s | 0 |
| B | 13/43 (30.2%) | $3.27 | 8.34M | 737s | 0 |
| C | 10/43 (23.3%) | $2.89 | 7.01M | 823s | 9 |
| D | 13/43 (30.2%) | $4.61 | 12.24M | 847s | 32 |
RTFM's weakest result. No config resolves. RTFM adds overhead without improving F2P.
Why RTFM hurts here: This is a medium-sized repo where standard tools (Grep, Glob) work well. The task requires modifying table.py which is easy to find. RTFM adds 9-32 extra calls that consume tokens and time without value. D is particularly bad: 32 RTFM calls, 2x cost of A, same F2P.
Behavioral analysis (from stream logs): - A (95 turns): Focused — finds the file quickly, spends most time on implementation. 70 Read, 73 Bash, 25 Grep. - B (126 turns): Path-stripped, so more exploration. Higher Grep (25) and Read (75) counts. - C (112 turns): 9 RTFM calls early on, then pivots to standard tools. The RTFM results don't add value over what Grep would find instantly. - D (114 turns): 32 RTFM calls throughout — agent keeps going back to RTFM for every question. 103 Read + 59 Grep + 32 RTFM = massive token consumption.
test_responses_agent (mlflow, 8259 files)¶
Only A and D tested (1 run each). Neither resolves. D uses 14 RTFM calls and costs $3.84 vs A's $5.03 — RTFM may help with cost but sample size is too small to conclude.
Sonnet 4.6 Model Comparison¶
Model: Claude Sonnet 4.6 (claude-sonnet-4-6) — Task: test_table (astropy)
Timeout: 20 min (1200s)
Results¶
| Config | Duration | Tools | RTFM | Tokens | Patch | F2P |
|---|---|---|---|---|---|---|
| A | TIMEOUT | 192 | 0 | ~21.9M | 0 bytes | 0% |
| B | TIMEOUT | 185 | 0 | ~22.8M | 0 bytes | 0% |
| C | TIMEOUT | 204 | 9 | ~23.0M | 0 bytes | 0% |
| D | TIMEOUT | 233 | 10 | ~28.8M | 0 bytes | 0% |
Tool Call Breakdown¶
| Tool | A | B | C | D |
|---|---|---|---|---|
| Read | 70 | 75 | 86 | 103 |
| Bash | 73 | 57 | 64 | 37 |
| Grep | 25 | 25 | 19 | 59 |
| Edit | 13 | 20 | 20 | 15 |
| TodoWrite | 6 | 7 | 4 | 6 |
| Agent | 2 | 1 | 1 | 1 |
| Glob | 3 | 0 | 1 | 2 |
| RTFM search | — | — | 4 | 5 |
| RTFM expand | — | — | 4 | 4 |
| RTFM discover | — | — | 1 | 1 |
Sonnet 4.6 vs Sonnet 4 (test_table, all configs)¶
| Metric | Sonnet 4 (avg) | Sonnet 4.6 (all) | Delta |
|---|---|---|---|
| Duration | 760s | 1200s (TIMEOUT) | +58% |
| Tokens | 8.30M | 24.1M | +190% |
| Tool calls | 112 | 204 | +82% |
| F2P | 25.8% | 0% | -100% |
| Patch produced | Yes | No | — |
Sonnet 4.6 is catastrophically worse on this benchmark: - 3x more tokens consumed - All 4 configs TIMEOUT at 20 min with 0% F2P - No config produces any patch (0 bytes) - The model appears to spend more time exploring/reading without converging on a solution
This is not an RTFM-specific finding — all configs fail equally. The model may need different prompt engineering or longer timeouts for SWE-bench-style tasks.
Key Findings¶
Where RTFM helps¶
- Large repos with non-obvious file locations (test_stub_generator)
- A/B: 0% resolution rate → C/D: 22-25% resolution rate
- RTFM is the difference between "never solves" and "sometimes solves"
-
The gain is purely from file discovery — RTFM surfaces relevant files the agent can't find with Grep/Glob in time
-
Real-world multi-source knowledge tasks (B10 article generation)
- -51% cost, -16% duration, -61% tokens vs baseline
- RTFM allows the agent to navigate a complex knowledge base (40+ articles, research notes, drafts) efficiently
-
Without RTFM, the agent resorts to expensive exploration (85 tool calls, 48 in subagents)
-
FTS outperforms embeddings for code tasks
- C (FTS) matches or beats D (FTS+embeddings) on every FeatureBench task
- test_validation: C resolves 1/8, D resolves 0/8
- test_table: D costs 2x more than C with same F2P
- Code search terms are often exact identifiers — FTS is ideal, embeddings add noise
Where RTFM is neutral¶
- test_validation: C resolves 1x in 8 runs (12.5%), vs 0% for A/B/D. The improvement exists but is marginal and could be variance with this sample size.
Where RTFM hurts¶
- Medium repos with obvious file structures (test_table, astropy)
- C: 23.3% F2P vs A: 25.6% — slightly worse
- D: 30.2% F2P but $4.61 vs A: $2.41 — same F2P at 2x cost
- RTFM calls consume tokens and time without aiding discovery
-
Agent's natural tools (Grep for "class Table", Glob for "table.py") are faster and cheaper
-
Embeddings always increase cost, rarely improve results
- D averages +24% cost vs C across all tasks
- D never outperforms C on resolution rate
- The 6-min MiniLM cold start (eliminated in v3 by defaulting to FTS) was a critical blocker
- Even post-fix, embedding search adds ~20% more RTFM calls (agent queries more because semantic results feel "helpful")
Limitations¶
Methodological¶
- Small sample sizes: Most task/config combinations have 1-9 runs. Variance is high — a single lucky/unlucky run shifts averages significantly.
- Single model: All FeatureBench runs use Sonnet 4. Results may not generalize to Opus or other models.
- Limited task diversity: Only 4 unique tasks with full ABCD data. The B10 task is fundamentally different from FeatureBench tasks (knowledge generation vs. bug fixing).
- No statistical tests: With N=1-9 per condition, we can't compute meaningful p-values. All comparisons should be treated as directional, not conclusive.
Tool-specific¶
- RTFM overhead scales with repo familiarity: On repos the agent already "knows" (common patterns, obvious file names), RTFM is pure overhead. The benefit emerges only when files are hard to find.
- Embedding search (D) consistently underperforms FTS (C): For code tasks where search terms are identifiers, semantic similarity is counterproductive.
- Agent calibration problem: The agent sometimes over-relies on RTFM (D on test_table: 32 calls) or under-relies on it. The CLAUDE.md template helps but doesn't fully control behavior.
Raw Data References¶
Files on PC2 (roomi@192.168.1.28)¶
| Path | Description |
|---|---|
~/projects/FeatureBench/reports/benchmark/metrics.jsonl |
All benchmark metrics (64 entries, all runs) |
~/projects/FeatureBench/runs/ |
Sonnet 4 run outputs (A/B/C/D for all tasks) |
~/projects/FeatureBench/runs_s46/ |
Sonnet 4.6 run outputs (test_table only) |
~/projects/FeatureBench/run_benchmark.sh |
Main benchmark script |
/mnt/data/rtfm-dbs/<repo>/library.db |
Pre-generated RTFM databases per repo |
Stream log format¶
Each run contains claude_code_stream_output.jsonl with events:
{"type": "assistant", "message": {"content": [{"type": "tool_use", "name": "Read", "input": {...}}]}}
{"type": "result", "result": {"usage": {...}, "costUSD": 1.23, "numTurns": 45}}
Files in this repo¶
| Path | Description |
|---|---|
paper/benchmark_results.md |
This file |
paper/ab_test_b10_analysis.md |
Detailed B10 session analysis (A→D) |
paper/benchmark_paper.md |
Paper plan (EMSE) |
paper/BACKLOG.md |
Research backlog |
Local cache¶
| Path | Description |
|---|---|
/tmp/metrics_raw.jsonl |
Copy of metrics.jsonl downloaded from PC2 |
Per-Run Raw Data¶
test_stub_generator — All runs (Sonnet 4)¶
| Config | Run | Duration | Cost | Tokens | Turns | RTFM | F2P | Resolved |
|---|---|---|---|---|---|---|---|---|
| A | r1 | 370s | $0.97 | 1.84M | 43 | 0 | — | No |
| A | r1 | 580s | $1.96 | 4.29M | 84 | 0 | — | No |
| A | r2 | 583s | $1.82 | 4.02M | 77 | 0 | — | No |
| B | r1 | 395s | $1.07 | 2.21M | 52 | 0 | — | No |
| B | r1 | 462s | $1.37 | 2.99M | 64 | 0 | — | No |
| B | r2 | 523s | $1.52 | 3.23M | 59 | 0 | — | No |
| C | r1 | 454s | $1.30 | 2.78M | 59 | 2 | — | No |
| C | r1 | 535s | $1.63 | 3.64M | 66 | 4 | — | No |
| C | r1 | 486s | $1.48 | 3.35M | 67 | 3 | — | No |
| C | r1 | 491s | $1.29 | 2.72M | 58 | 2 | — | No |
| C | r1 | 551s | $1.37 | 3.00M | 46 | 6 | — | No |
| C | r1 | 750s | $1.79 | 3.94M | 72 | 2 | — | No |
| C | r1 | 529s | $1.29 | 2.69M | 58 | 5 | 31/31 | Yes |
| C | r2 | 702s | $2.06 | 4.89M | 89 | 3 | — | No |
| C | r2 | 540s | $1.46 | 3.20M | 66 | 6 | 31/31 | Yes |
| D | r1 | 541s | $1.44 | 3.20M | 68 | 1 | — | No |
| D | r1 | 523s | $0.86 | 1.65M | 37 | 3 | — | No |
| D | r1 | 535s | $1.31 | 2.92M | 54 | 5 | — | No |
| D | r1 | 859s | $2.05 | 4.61M | 92 | 2 | — | No |
| D | r1 | 670s | $1.48 | 3.35M | 57 | 7 | — | No |
| D | r1 | 584s | $1.51 | 3.26M | 66 | 3 | 31/31 | Yes |
| D | r2 | 659s | $1.54 | 3.30M | 58 | 8 | — | No |
| D | r2 | 699s | $1.83 | 4.01M | 70 | 0 | 31/31 | Yes |
test_validation — All runs (Sonnet 4)¶
| Config | Run | Duration | Cost | Tokens | Turns | RTFM | F2P | Resolved |
|---|---|---|---|---|---|---|---|---|
| A | r1 | 479s | $1.12 | 2.41M | 45 | 0 | — | No |
| A | r1 | 440s | $1.18 | 2.41M | 42 | 0 | — | No |
| A | r2 | 493s | $1.51 | 3.31M | 55 | 0 | — | No |
| B | r1 | 459s | $1.50 | 3.38M | 54 | 0 | — | No |
| B | r1 | 509s | $1.28 | 2.82M | 56 | 0 | — | No |
| B | r2 | 477s | $1.30 | 2.81M | 48 | 0 | — | No |
| C | r1 | 732s | $2.42 | 5.74M | 82 | 5 | — | No |
| C | r1 | 516s | $1.57 | 3.52M | 66 | 16 | — | No |
| C | r1 | 722s | $2.23 | 5.27M | 77 | 3 | — | No |
| C | r1 | 495s | $1.41 | 3.13M | 56 | 10 | — | No |
| C | r1 | 618s | $1.24 | 2.50M | 53 | 6 | 11/11 | Yes |
| C | r2 | 541s | $1.57 | 3.57M | 60 | 2 | — | No |
| C | r2 | 590s | $1.43 | 3.17M | 55 | 2 | 7/11 | No |
| C | r3 | 450s | $1.31 | 2.83M | 47 | 3 | — | No |
| D | r1 | 605s | $1.33 | 2.97M | 51 | 2 | — | No |
| D | r1 | 684s | $1.58 | 3.67M | 61 | 11 | — | No |
| D | r1 | 651s | $1.68 | 3.71M | 59 | 1 | — | No |
| D | r1 | 578s | $1.19 | 2.44M | 44 | 3 | — | No |
| D | r1 | 784s | $2.08 | 4.90M | 80 | 20 | 6/11 | No |
| D | r2 | 522s | $1.46 | 3.28M | 53 | 8 | — | No |
| D | r2 | 673s | $1.75 | 3.99M | 65 | 6 | 7/11 | No |
| D | r3 | 518s | $1.23 | 2.64M | 47 | 5 | — | No |
test_table — All runs (Sonnet 4)¶
| Config | Duration | Cost | Tokens | Turns | RTFM | F2P | Resolved |
|---|---|---|---|---|---|---|---|
| A | 635s | $2.41 | 5.60M | 95 | 0 | 11/43 (25.6%) | No |
| B | 737s | $3.27 | 8.34M | 126 | 0 | 13/43 (30.2%) | No |
| C | 823s | $2.89 | 7.01M | 112 | 9 | 10/43 (23.3%) | No |
| D | 847s | $4.61 | 12.24M | 114 | 32 | 13/43 (30.2%) | No |
test_table — Sonnet 4.6 (all TIMEOUT at 1200s)¶
| Config | Tools | RTFM | Read | Bash | Grep | Edit | Patch |
|---|---|---|---|---|---|---|---|
| A | 192 | 0 | 70 | 73 | 25 | 13 | 0 bytes |
| B | 185 | 0 | 75 | 57 | 25 | 20 | 0 bytes |
| C | 204 | 9 | 86 | 64 | 19 | 20 | 0 bytes |
| D | 233 | 10 | 103 | 37 | 59 | 15 | 0 bytes |
test_responses_agent (Sonnet 4, 1 run each)¶
| Config | Duration | Cost | Tokens | Turns | RTFM | Resolved |
|---|---|---|---|---|---|---|
| A | 1156s | $5.03 | 12.08M | 118 | 0 | No |
| D | 1013s | $3.84 | 9.12M | 103 | 14 | No |