FeatureBench Integration Notes¶
Setup¶
- Repo:
/mnt/d/Claude/FeatureBench/ - Installed via
uv sync(Python 3.12+) - Config:
config.toml(gitignored, contains no API keys) - CLI:
fb infer,fb eval,fb pull - Extract metrics:
python extract_metrics.py <run_dir> <task_id>
OAuth/MAX Integration (custom patch)¶
- Modified
run_infer.py→_build_volumes()method - When
ANTHROPIC_API_KEYis empty, mounts~/.claude/→/root/.claude/in Docker - Claude Code CLI inside Docker uses the host's OAuth session (MAX subscription)
RTFM Integration¶
- Agent:
ClaudeCodeRTFMAgentinagents/claude_code_rtfm.py - Extends
ClaudeCodeAgentwithpre_run_setup()hook - Steps:
python -m pip install /opt/rtfm[mcp]→ rtfm init → rtfm sync → write CLAUDE.md → gitignore - RTFM source auto-detected:
_find_rtfm_source()checks RTFM_SRC env, ~/projects/rtfm, ~/projects/biblirag, /mnt/d/Claude/RTFM, /mnt/d/Claude/biblirag - Key fix (2026-02-25): must use
python -m pip install(notpip install) for conda env targeting - Key fix (2026-02-25): volume mount path was hardcoded to WSL path, now auto-detected
A/B Benchmark Results (2026-02-22) — 10 tasks¶
RTFM LOSES in aggregate: +3.4% duration, +15% turns, +20% cache read tokens Win rate: 4/10 tasks
| Task | Base dur | RTFM dur | Delta | Indexed |
|---|---|---|---|---|
| metaflow/stub_generator | 565s | 354s | -37% | 623 |
| astropy/erfa_ufuncs | 536s | 606s | +13% | 1122 |
| astropy/table | 675s | 632s | -6% | 1122 |
| mlflow/databricks_tracing | 524s | 688s | +31% | 8258 |
| mlflow/judge_tool | 571s | 457s | -20% | 8259 |
| mlflow/responses_agent | 746s | 1022s | +37% | 8258 |
| mlflow/serialization | 481s | 509s | +6% | 8259 |
| mlflow/span | 707s | 958s | +36% | 8259 |
| mlflow/trace | 676s | 782s | +16% | 8259 |
| mlflow/validation | 701s | 386s | -45% | 8259 |
Key Findings¶
- RTFM wins on small repos (metaflow 623 files: -37%)
- RTFM loses on large repos (mlflow 8259 files: most tasks slower)
- Problem: on large repos, RTFM search adds noise → agent spends more time searching than coding
- Grep reduced by -57 calls, but replaced by +116 Bash calls (rtfm search) which are slower
- No
rtfm contextused, almost nortfm expand— agent only usesrtfm search
Improvement Ideas¶
- Filter indexed files (skip vendored, tests, generated code)
- Limit RTFM search results (top 3 instead of top 5+)
- Better CLAUDE.md template: "search once, then code" not "search repeatedly"
- Hybrid mode: RTFM for discovery, Grep for precision
- Don't index repos > 2000 files in full — index only key directories
Task Availability (lite split = 30 tasks)¶
- 11 tasks WITHOUT GPU requirement
- 19 tasks WITH GPU requirement
- pydantic image failed to download (Docker WSL2 SIGBUS bug)
- All valid results:
benchmark_final_results.jsonl
FeatureBench-Discovery Mode (2026-02-23)¶
Original FeatureBench prompts give exact file paths + interfaces. This makes
discovery tools useless. Discovery mode (--discovery flag) strips Path: lines
but keeps interface signatures. Agent must explore codebase to find correct locations.
Implementation: _strip_interface_paths() in run_infer.py
- Removes all Path: /testbed/... lines via regex
- Replaces boilerplate "The value of Path declares..." with "Explore the codebase..."
- CLI: fb infer --discovery -a claude_code_rtfm ...
- Stored in run_metadata.json as discovery_mode: true
Benchmark plan: Run 2 configs on the 11 no-GPU tasks:
1. claude_code --discovery (baseline with open prompts)
2. claude_code_rtfm --discovery (RTFM with open prompts)
Compare: solve rate, duration, tool calls.
First Discovery A/B Result (2026-02-25, mlflow/validation, Sonnet 4)¶
| Baseline | RTFM | |
|---|---|---|
| Total time | 11m24s | 9m27s |
| Setup (RTFM) | 0 | 108s (install+init+sync 8259 books) |
| Agent work | ~10m | ~6m15s |
| Patch size | 16,927 chars | 22,485 chars |
| RTFM tool calls | 0 | 15 |
| Gain | - | -17% total, -37% agent time |
PC2: roomi@192.168.1.28, runs stored in ~/projects/FeatureBench/runs/
Methodology justification: Same repos, same tests, same interfaces. Only variable = whether agent has RTFM search tools. Evaluates whether pre-indexed search reduces discovery overhead on large codebases.
Discovery A/B Batch (2026-02-25, in progress)¶
- 6 mlflow tasks running via
run_mlflow_ab.shon PC2 (PID started ~11:52) - Script: baseline → RTFM → eval both → generate_report.py → markdown report
- Reports →
~/projects/FeatureBench/reports/discovery-ab/ - Docker on PC2: NTFS partition cannot host overlay2, must stay on ext4 root
- Strategy for remaining 4 tasks: one image at a time (pull, run, remove)
- metaflow (1 task), astropy (2 tasks), pydantic (1 task)
Files Modified in FeatureBench¶
featurebench/infer/run_infer.py—_build_volumes(), CLI choices,_strip_interface_paths()featurebench/infer/models.py—discovery_modefield on InferConfig + RunMetadatafeaturebench/infer/agents/claude_code_rtfm.py— new agent (MCP tools in ALLOWED_TOOLS)featurebench/infer/agents/__init__.py— agent registrationconfig.toml—[infer_config.claude_code_rtfm]sectionextract_metrics.py— metrics extraction scriptbenchmark_final_results.jsonl— clean results for all 10 pairs