A/B Test: Writing B10 with vs without RTFM
Date: 2026-02-21
Sessions analyzed
- Session A (with RTFM):
66c7e81b-00e4-4bde-a9e4-d70e2cbb8a3d in musicology-phd
- Session B (without RTFM):
586966f3-50e0-41b6-be1e-1b7b9dc449be in musicology-phd
- Identical prompt: "j'aimerai que tu me fasses la redaction de l'article B10"
Overall results
| Metric |
Session A (RTFM) |
Session B (without) |
Delta |
| Duration |
12 min |
8 min 16s |
+45% |
| Total tokens |
3.95M |
1.70M |
+132% |
| Cost |
~$13 |
~$6 |
+117% |
| Direct tool calls |
38 |
37 |
+3% |
| Total tool calls (subagents) |
38 |
85 |
-55% |
| Output language |
English (WRONG) |
French (correct) |
regression |
| Article size |
36K chars, 8 sections |
31K chars, 10 sections, glossary |
|
Finding: RTFM performs worse on all 4 axes (duration, tokens, cost, quality).
Behavioral analysis Session A (with RTFM)
RTFM calls: 18 total (10 search + 8 expand)
| Category |
Count |
% |
| Necessary (info used) |
7 |
39% |
| Redundant |
6 |
33% |
| Useless/poorly targeted |
5 |
28% |
Necessary calls in detail:
rtfm_expand(kanban, "B10") → confirmed B10 = "EBNF of BP3"
rtfm_expand(b10_ebnf_bp3, ...) → existing draft/spec
rtfm_search("B8 three directions") → B8 in English (caused the language bug)
rtfm_search("B4 flags weights") → B4 in English (caused the language bug)
rtfm_expand(paper1, "EBNF Appendix") → main EBNF content (the real contribution)
rtfm_expand(l3_ebnf, "notation") → EBNF variants (content not on disk)
rtfm_expand(b8, "article style") → TikZ style and layout
Redundant calls:
- 3 initial searches to find B10 (Kanban was sufficient)
- Search for BLOG_STRATEGY while it was read in parallel via Read
- Re-expand of B4 (same content as before)
- Search MOC while Glob would have sufficed
Critical problem: language
- RTFM indexed articles B4 and B8 from the
_en/ directory (English translations)
- Corpus
published = 39 articles, ALL with lang: en
- The agent saw
lang: en in the metadata and concluded "the B articles are in English"
- The existing B10 draft was in French → ignored
- Root cause: indexing bug — same
book_slug for FR and EN, EN overwrites FR
Cost per RTFM call
Each search/expand injects ~2000-3000 tokens into the context (content + metadata + hints).
18 calls × ~2500 tokens = ~45000 tokens of RTFM context alone.
rtfm_remember: never called
The instruction existed in CLAUDE.md but the agent ignored it. The learned corpus was empty.
Behavioral analysis Session B (without RTFM)
| Category |
Count |
% |
| Necessary |
17 |
46% |
| Dead ends |
15 |
41% |
| Redundant |
3 |
8% |
| Partially useful |
2 |
5% |
Strategy: parallelization via subagents
- Subagent 1 (general-purpose, 2m14s, 22 calls): reading R2 + Paper 1
- Subagent 2 (Explore, 1m04s, 26 calls): search for published articles + style
- Both in parallel → heavy research in ~2min
- The RTFM agent did all its searches sequentially
Dead ends (15 calls, but FAST and FREE)
- 5 Globs in the wrong directory (git repo instead of Obsidian vault)
- 8 Globs for online published files (B1-B8, L3 not in the vault)
- 3 Reads with wrong filename (Formalisme vs Formalisation)
- Cost: <1s and 0 context tokens each
Correct language: why?
- B9 (only complete article found on disk) is in French
- Project CLAUDE.md says "Writing language: French"
- The B10 draft was in French
- Without RTFM, the agent didn't see the English translations → no confusion
Root cause diagnosis
| Cause |
Impact |
Detail |
| RTFM calls expensive in tokens |
+132% tokens |
Each call injects ~2500 tokens vs ~0 for a failed Glob |
| No RTFM parallelization |
+45% duration |
MCP = main agent only, no subagents |
| RTFM surfaced misleading content |
Quality ↓ |
EN articles caused incorrect language choice |
| Slug collision FR/EN bug |
Quality ↓ |
_en/B4.md overwrites B4.md because same slug |
| rtfm_remember not used |
Value = 0 |
The only unique feature did not work |
| Instructions too aggressive |
Overhead |
"NEVER Grep/Glob" forced RTFM for everything |
Fixes implemented (this session)
rtfm_search and rtfm_context return ONLY metadata (title, file, score, chunk count, lang)
- Zero content in level-0 results
- ~300 tokens for 5 results instead of ~2500
- Content is read via
rtfm_expand only when necessary
2. File paths in results (DONE)
- Each result shows
file: path/to/file.md
- Eliminates redundant Globs to locate files
- Each result shows
lang: en or lang: fr
- The agent can distinguish FR/EN versions
4. CLI context and expand for subagents (DONE)
rtfm --db .rtfm/library.db context "subject" via Bash
rtfm --db .rtfm/library.db expand "slug" "query" via Bash
- Task/Explore subagents can use RTFM
5. CLAUDE.md template rewritten (DONE)
- No more "NEVER Grep/Glob"
- RTFM = knowledge/memory, Grep/Glob = code editing
- Subagents allowed via CLI
- Remember = MANDATORY dedicated section
6. Active reminder for remember (DONE)
- When the
learned corpus is empty, search/context results display a reminder
- "⚠ Learned corpus is empty. Use rtfm_remember()"
7. Fix slug collision (IN PROGRESS)
_path_to_slug() includes the parent directory in the slug
_en/B4.md → en--b4_flags vs B4.md → b4_flags
- Sync passes the slug to the parser via metadata
Tests: 33/33 MCP tests pass
Session C: RTFM v2 post-fixes (2026-02-21)
- Session C (RTFM v2):
268c7e8f-591a-4b52-ae59-10e668fb7d5b in musicology-phd
- Identical prompt: "j'aimerai que tu me fasses la rédaction de l'article B10"
Session C results
| Metric |
Value |
| Duration |
11 min 23s |
| Total tokens |
5.87M (of which 4.91M cache read) |
| Estimated cost |
~$5.11 |
| Tool calls |
64 (37 main + 27 subagent) |
| RTFM calls |
7 (3 context + 3 expand + 1 remember) |
| Language |
French (correct) |
| Article |
38.5K chars, 14 sections (including glossary + refs) |
v2 fixes validated
- Correct language (FR/EN distinction visible in results)
- Metadata-only works (7 calls vs 18 in v1)
- rtfm_remember used (written B10 indexed in learned corpus)
- Cost divided by 2.5 vs v1 ($5.11 vs $13)
- Subagent used to parallelize research
Remaining issues
- Duration +37% vs session B without RTFM (11m23 vs 8m16)
- Duplication: 15 Glob + 14 Read in parallel with RTFM calls
- Subagent fumbles (15 Bash) instead of relying on RTFM results
- Inflated tokens (5.87M vs 1.70M without RTFM) — cache read mitigates the cost
A/B/C comparison table
| Metric |
A (RTFM v1) |
B (without) |
C (RTFM v2) |
Best |
| Duration |
12 min |
8m16s |
11m23s |
B |
| Cost |
~$13 |
~$6 |
~$5.11 |
C |
| Language |
ENGLISH (bug) |
FR |
FR |
B/C |
| Size |
36K/8 sections |
31K/10 |
38.5K/14 |
C |
| RTFM calls |
18 |
0 |
7 |
C |
| remember |
never |
n/a |
yes |
C |
Session D: RTFM v2+ refined template (2026-02-21)
- Session D:
1f783e5e-8d6f-4bff-be5d-d0e5d48fd95c in musicology-phd
- Identical prompt
Session D results
| Metric |
Value |
| Duration |
15m48s (the worst) |
| Total tokens |
3.97M (of which 3.6M cache read) |
| Estimated cost |
$2.62 (the best) |
| Tool calls |
32 (0 subagents) |
| RTFM calls |
12 (3 context + 2 search + 7 expand) |
| Language |
French (correct) |
| Article |
31.4K chars, 14+4 sections |
| remember |
NO (regression) |
Issues identified
- CLI
rtfm context via Bash fails (venv not in PATH)
- Read after expand (B10 expand 27K then Read x3)
- Glob after context (5 redundant Glob Blog/B*)
- Duplicated expand (b11_ast_bp3 x2)
- remember forgotten despite instructions
Fixes implemented post-D
- CLAUDE.md template v3: explicit anti-duplication, CLI removed
- Stop hook: blocks stopping if rtfm_remember not called
- PostToolUse hook: stamp when rtfm_remember is called
- SessionStart hook: clear the stamp at the start
A/B/C/D comparison table
| Metric |
A (v1) |
B (without) |
C (v2) |
D (v2+) |
Best |
| Duration |
12m |
8m16s |
11m23s |
15m48s |
B |
| Cost |
~$13 |
~$6 |
~$5.11 |
$2.62 |
D |
| Language |
BUG |
FR |
FR |
FR |
B/C/D |
| Article |
36K/8 |
31K/10 |
38.5K/14 |
31.4K/18 |
C |
| Tool calls |
38 |
85 |
64 |
32 |
D |
| Duplication |
N/A |
N/A |
~29 |
~10 |
D |
| remember |
no |
n/a |
yes |
no |
C |