Skip to content

RTFM Benchmark Results — Complete Analysis

Date: 2026-03-02 Version: RTFM v0.3.1

Table of Contents

  1. Study Design
  2. B10 Real Task (Musicology)
  3. FeatureBench 4-Condition Study
  4. Sonnet 4.6 Model Comparison
  5. Key Findings
  6. Limitations
  7. Raw Data References

Study Design

Configurations

Config Description RTFM File paths
A Standard prompt No Visible
B Discovery baseline No Stripped from prompt
C RTFM FTS Yes (FTS5) Stripped (agent must discover via RTFM)
D RTFM + Embeddings Yes (FTS5 + semantic) Stripped (agent must discover via RTFM)

What B vs C/D tests

B strips file paths from prompts like C/D but has no RTFM. This isolates the RTFM effect from the path-stripping effect. If C outperforms B, the gain comes from RTFM's search, not from having paths in the prompt.

Metrics

  • F2P (Fail-to-Pass): fraction of failing tests the agent made pass
  • Resolved: 100% F2P (all tests pass)
  • Cost: API cost in USD
  • Duration: wall clock time in seconds
  • RTFM calls: total rtfm_search + rtfm_expand + rtfm_discover calls

B10 Real Task (Musicology)

Task: "Rédige l'article B10" — generate a scholarly article from indexed research notes. Model: Claude Opus 4 — Repo: musicology-phd (Obsidian vault, ~40 articles)

Iterative A→H sessions

8 sessions tested progressively, same prompt, fixing RTFM issues between iterations.

Key sessions (single runs, not averaged)

Session Config Duration Cost Tokens RTFM calls Language Article
A RTFM v1 12m00s ~$13.00 3.95M 18 ENGLISH (bug) 36K/8 sections
B No RTFM (baseline) 8m16s $22.61 8.21M 0 French 31K/10 sections
C RTFM v2 11m23s ~$5.11 5.87M 7 French 38.5K/14 sections
D RTFM v2+ 15m48s $2.62 3.97M 12 French 31.4K/18 sections
H RTFM v3 final 6m58s $11.14 3.22M French

H vs B (final comparison)

Metric B (No RTFM) H (RTFM v3) Delta
Duration 8m16s 6m58s -16%
Cost $22.61 $11.14 -51%
Tokens 8.21M 3.22M -61%

Behavioral Analysis

Session A (RTFM v1) — what went wrong: - 18 RTFM calls: 7 necessary (39%), 6 redundant (33%), 5 useless (28%) - Indexed English translations (_en/B4.md) overwrote French originals (slug collision bug) - Agent saw lang: en in results → wrote entire article in English - Each call injected ~2500 tokens → 45K tokens of RTFM overhead

Session B (No RTFM) — why it's expensive: - 85 tool calls (37 direct + 48 in subagents) for file discovery - 15 dead ends (wrong directories, missing files) — but each was cheap (<1s, ~0 tokens) - Parallel subagents made discovery fast (~2min) - Correct language because no English sources surfaced

Session C (RTFM v2) — progressive disclosure works: - Only 7 RTFM calls (vs 18 in v1) thanks to metadata-only search - ~300 tokens per search result instead of ~2500 - Language correct (FR/EN distinction visible in results) - Still 29 redundant Glob/Read calls alongside RTFM

Session H (RTFM v3 final): - Agent trusts RTFM results, minimal redundant exploration - -51% cost, -16% duration vs baseline

Fixes between iterations

  1. v1→v2: Metadata-only search (300 tokens vs 2500), absolute file paths, language metadata
  2. v2→v2+: Anti-duplication template, CLI removed, hooks for remember
  3. v2+→v3: FTS default (no 6min MiniLM cold start), pure data output, "search first" template

Data source

  • Session transcripts: ~/.claude/projects/musicology-phd/ on dev machine
  • Analysis: memory/ab_test_b10_analysis.md in this repo

FeatureBench 4-Condition Study

Model: Claude Sonnet 4 (claude-sonnet-4-20250514) — Timeout: 20 min Platform: FeatureBench (SWE-bench-style Docker), 4 repos, 11 tasks (level 1)

Aggregate Results (Sonnet 4)

Config Task N Duration(s) Cost($) Tokens(M) Turns RTFM F2P Resolved
A test_stub_generator 4 400 1.21 2.56 52 0.0 N/A 0/4
A test_table 1 635 2.41 5.60 95 0.0 25.6% 0/1
A test_validation 3 471 1.27 2.71 47 0.0 N/A 0/3
B test_stub_generator 3 460 1.32 2.81 58 0.0 N/A 0/3
B test_table 1 737 3.27 8.34 126 0.0 30.2% 0/1
B test_validation 3 482 1.36 3.00 53 0.0 N/A 0/3
C test_stub_generator 9 560 1.52 3.35 65 3.7 100% (2/9) 2/9
C test_table 1 823 2.89 7.01 112 9.0 23.3% 0/1
C test_validation 8 583 1.65 3.72 62 5.9 81.8% (2/8) 1/8
D test_stub_generator 8 634 1.50 3.29 63 3.6 100% (2/8) 2/8
D test_table 1 847 4.61 12.24 114 32.0 30.2% 0/1
D test_validation 8 627 1.54 3.45 58 7.0 59.1% (2/8) 0/8
A test_responses_agent 1 1156 5.03 12.08 118 0.0 N/A 0/1
D test_responses_agent 1 1013 3.84 9.12 103 14.0 N/A 0/1

Per-Task Analysis

test_stub_generator (mlflow, 8259 files)

RTFM's strongest result. C and D are the only configs that ever resolve the task.

Config Resolved Resolution rate Avg cost Avg duration
A 0/4 0% $1.21 400s
B 0/3 0% $1.32 460s
C 2/9 22% $1.52 560s
D 2/8 25% $1.50 634s

Why RTFM helps: The task requires finding stub_generator.py in a large repo (8259 files). Without RTFM, agents can't locate it within the timeout. RTFM surfaces the relevant file directly.

Why it's not 100%: Even with RTFM pointing to the right file, the code change is non-trivial — the agent must understand the stub generation logic and make the correct modification. 75% of RTFM runs find the file but fail on implementation.

test_validation (mlflow, 8259 files)

Config Resolved Best F2P Avg cost Avg duration
A 0/3 N/A $1.27 471s
B 0/3 N/A $1.36 482s
C 1/8 11/11 (100%) $1.65 583s
D 0/8 7/11 (64%) $1.54 627s

C resolves once (11/11 F2P) in 8 runs. D gets close (7/11) but never fully resolves. A/B never produce evaluable patches.

Key insight: FTS (C) outperforms embeddings (D) here. The search terms are exact function/class names — FTS excels at exact matching while embeddings add noise from semantically similar but irrelevant results.

test_table (astropy, 3515 files)

Config F2P Cost Tokens Duration RTFM calls
A 11/43 (25.6%) $2.41 5.60M 635s 0
B 13/43 (30.2%) $3.27 8.34M 737s 0
C 10/43 (23.3%) $2.89 7.01M 823s 9
D 13/43 (30.2%) $4.61 12.24M 847s 32

RTFM's weakest result. No config resolves. RTFM adds overhead without improving F2P.

Why RTFM hurts here: This is a medium-sized repo where standard tools (Grep, Glob) work well. The task requires modifying table.py which is easy to find. RTFM adds 9-32 extra calls that consume tokens and time without value. D is particularly bad: 32 RTFM calls, 2x cost of A, same F2P.

Behavioral analysis (from stream logs): - A (95 turns): Focused — finds the file quickly, spends most time on implementation. 70 Read, 73 Bash, 25 Grep. - B (126 turns): Path-stripped, so more exploration. Higher Grep (25) and Read (75) counts. - C (112 turns): 9 RTFM calls early on, then pivots to standard tools. The RTFM results don't add value over what Grep would find instantly. - D (114 turns): 32 RTFM calls throughout — agent keeps going back to RTFM for every question. 103 Read + 59 Grep + 32 RTFM = massive token consumption.

test_responses_agent (mlflow, 8259 files)

Only A and D tested (1 run each). Neither resolves. D uses 14 RTFM calls and costs $3.84 vs A's $5.03 — RTFM may help with cost but sample size is too small to conclude.


Sonnet 4.6 Model Comparison

Model: Claude Sonnet 4.6 (claude-sonnet-4-6) — Task: test_table (astropy) Timeout: 20 min (1200s)

Results

Config Duration Tools RTFM Tokens Patch F2P
A TIMEOUT 192 0 ~21.9M 0 bytes 0%
B TIMEOUT 185 0 ~22.8M 0 bytes 0%
C TIMEOUT 204 9 ~23.0M 0 bytes 0%
D TIMEOUT 233 10 ~28.8M 0 bytes 0%

Tool Call Breakdown

Tool A B C D
Read 70 75 86 103
Bash 73 57 64 37
Grep 25 25 19 59
Edit 13 20 20 15
TodoWrite 6 7 4 6
Agent 2 1 1 1
Glob 3 0 1 2
RTFM search 4 5
RTFM expand 4 4
RTFM discover 1 1

Sonnet 4.6 vs Sonnet 4 (test_table, all configs)

Metric Sonnet 4 (avg) Sonnet 4.6 (all) Delta
Duration 760s 1200s (TIMEOUT) +58%
Tokens 8.30M 24.1M +190%
Tool calls 112 204 +82%
F2P 25.8% 0% -100%
Patch produced Yes No

Sonnet 4.6 is catastrophically worse on this benchmark: - 3x more tokens consumed - All 4 configs TIMEOUT at 20 min with 0% F2P - No config produces any patch (0 bytes) - The model appears to spend more time exploring/reading without converging on a solution

This is not an RTFM-specific finding — all configs fail equally. The model may need different prompt engineering or longer timeouts for SWE-bench-style tasks.


Key Findings

Where RTFM helps

  1. Large repos with non-obvious file locations (test_stub_generator)
  2. A/B: 0% resolution rate → C/D: 22-25% resolution rate
  3. RTFM is the difference between "never solves" and "sometimes solves"
  4. The gain is purely from file discovery — RTFM surfaces relevant files the agent can't find with Grep/Glob in time

  5. Real-world multi-source knowledge tasks (B10 article generation)

  6. -51% cost, -16% duration, -61% tokens vs baseline
  7. RTFM allows the agent to navigate a complex knowledge base (40+ articles, research notes, drafts) efficiently
  8. Without RTFM, the agent resorts to expensive exploration (85 tool calls, 48 in subagents)

  9. FTS outperforms embeddings for code tasks

  10. C (FTS) matches or beats D (FTS+embeddings) on every FeatureBench task
  11. test_validation: C resolves 1/8, D resolves 0/8
  12. test_table: D costs 2x more than C with same F2P
  13. Code search terms are often exact identifiers — FTS is ideal, embeddings add noise

Where RTFM is neutral

  1. test_validation: C resolves 1x in 8 runs (12.5%), vs 0% for A/B/D. The improvement exists but is marginal and could be variance with this sample size.

Where RTFM hurts

  1. Medium repos with obvious file structures (test_table, astropy)
  2. C: 23.3% F2P vs A: 25.6% — slightly worse
  3. D: 30.2% F2P but $4.61 vs A: $2.41 — same F2P at 2x cost
  4. RTFM calls consume tokens and time without aiding discovery
  5. Agent's natural tools (Grep for "class Table", Glob for "table.py") are faster and cheaper

  6. Embeddings always increase cost, rarely improve results

  7. D averages +24% cost vs C across all tasks
  8. D never outperforms C on resolution rate
  9. The 6-min MiniLM cold start (eliminated in v3 by defaulting to FTS) was a critical blocker
  10. Even post-fix, embedding search adds ~20% more RTFM calls (agent queries more because semantic results feel "helpful")

Limitations

Methodological

  • Small sample sizes: Most task/config combinations have 1-9 runs. Variance is high — a single lucky/unlucky run shifts averages significantly.
  • Single model: All FeatureBench runs use Sonnet 4. Results may not generalize to Opus or other models.
  • Limited task diversity: Only 4 unique tasks with full ABCD data. The B10 task is fundamentally different from FeatureBench tasks (knowledge generation vs. bug fixing).
  • No statistical tests: With N=1-9 per condition, we can't compute meaningful p-values. All comparisons should be treated as directional, not conclusive.

Tool-specific

  • RTFM overhead scales with repo familiarity: On repos the agent already "knows" (common patterns, obvious file names), RTFM is pure overhead. The benefit emerges only when files are hard to find.
  • Embedding search (D) consistently underperforms FTS (C): For code tasks where search terms are identifiers, semantic similarity is counterproductive.
  • Agent calibration problem: The agent sometimes over-relies on RTFM (D on test_table: 32 calls) or under-relies on it. The CLAUDE.md template helps but doesn't fully control behavior.

Raw Data References

Files on PC2 (roomi@192.168.1.28)

Path Description
~/projects/FeatureBench/reports/benchmark/metrics.jsonl All benchmark metrics (64 entries, all runs)
~/projects/FeatureBench/runs/ Sonnet 4 run outputs (A/B/C/D for all tasks)
~/projects/FeatureBench/runs_s46/ Sonnet 4.6 run outputs (test_table only)
~/projects/FeatureBench/run_benchmark.sh Main benchmark script
/mnt/data/rtfm-dbs/<repo>/library.db Pre-generated RTFM databases per repo

Stream log format

Each run contains claude_code_stream_output.jsonl with events:

{"type": "assistant", "message": {"content": [{"type": "tool_use", "name": "Read", "input": {...}}]}}
{"type": "result", "result": {"usage": {...}, "costUSD": 1.23, "numTurns": 45}}

Files in this repo

Path Description
paper/benchmark_results.md This file
paper/ab_test_b10_analysis.md Detailed B10 session analysis (A→D)
paper/benchmark_paper.md Paper plan (EMSE)
paper/BACKLOG.md Research backlog

Local cache

Path Description
/tmp/metrics_raw.jsonl Copy of metrics.jsonl downloaded from PC2

Per-Run Raw Data

test_stub_generator — All runs (Sonnet 4)

Config Run Duration Cost Tokens Turns RTFM F2P Resolved
A r1 370s $0.97 1.84M 43 0 No
A r1 580s $1.96 4.29M 84 0 No
A r2 583s $1.82 4.02M 77 0 No
B r1 395s $1.07 2.21M 52 0 No
B r1 462s $1.37 2.99M 64 0 No
B r2 523s $1.52 3.23M 59 0 No
C r1 454s $1.30 2.78M 59 2 No
C r1 535s $1.63 3.64M 66 4 No
C r1 486s $1.48 3.35M 67 3 No
C r1 491s $1.29 2.72M 58 2 No
C r1 551s $1.37 3.00M 46 6 No
C r1 750s $1.79 3.94M 72 2 No
C r1 529s $1.29 2.69M 58 5 31/31 Yes
C r2 702s $2.06 4.89M 89 3 No
C r2 540s $1.46 3.20M 66 6 31/31 Yes
D r1 541s $1.44 3.20M 68 1 No
D r1 523s $0.86 1.65M 37 3 No
D r1 535s $1.31 2.92M 54 5 No
D r1 859s $2.05 4.61M 92 2 No
D r1 670s $1.48 3.35M 57 7 No
D r1 584s $1.51 3.26M 66 3 31/31 Yes
D r2 659s $1.54 3.30M 58 8 No
D r2 699s $1.83 4.01M 70 0 31/31 Yes

test_validation — All runs (Sonnet 4)

Config Run Duration Cost Tokens Turns RTFM F2P Resolved
A r1 479s $1.12 2.41M 45 0 No
A r1 440s $1.18 2.41M 42 0 No
A r2 493s $1.51 3.31M 55 0 No
B r1 459s $1.50 3.38M 54 0 No
B r1 509s $1.28 2.82M 56 0 No
B r2 477s $1.30 2.81M 48 0 No
C r1 732s $2.42 5.74M 82 5 No
C r1 516s $1.57 3.52M 66 16 No
C r1 722s $2.23 5.27M 77 3 No
C r1 495s $1.41 3.13M 56 10 No
C r1 618s $1.24 2.50M 53 6 11/11 Yes
C r2 541s $1.57 3.57M 60 2 No
C r2 590s $1.43 3.17M 55 2 7/11 No
C r3 450s $1.31 2.83M 47 3 No
D r1 605s $1.33 2.97M 51 2 No
D r1 684s $1.58 3.67M 61 11 No
D r1 651s $1.68 3.71M 59 1 No
D r1 578s $1.19 2.44M 44 3 No
D r1 784s $2.08 4.90M 80 20 6/11 No
D r2 522s $1.46 3.28M 53 8 No
D r2 673s $1.75 3.99M 65 6 7/11 No
D r3 518s $1.23 2.64M 47 5 No

test_table — All runs (Sonnet 4)

Config Duration Cost Tokens Turns RTFM F2P Resolved
A 635s $2.41 5.60M 95 0 11/43 (25.6%) No
B 737s $3.27 8.34M 126 0 13/43 (30.2%) No
C 823s $2.89 7.01M 112 9 10/43 (23.3%) No
D 847s $4.61 12.24M 114 32 13/43 (30.2%) No

test_table — Sonnet 4.6 (all TIMEOUT at 1200s)

Config Tools RTFM Read Bash Grep Edit Patch
A 192 0 70 73 25 13 0 bytes
B 185 0 75 57 25 20 0 bytes
C 204 9 86 64 19 20 0 bytes
D 233 10 103 37 59 15 0 bytes

test_responses_agent (Sonnet 4, 1 run each)

Config Duration Cost Tokens Turns RTFM Resolved
A 1156s $5.03 12.08M 118 0 No
D 1013s $3.84 9.12M 103 14 No