Changelog¶
[0.24.1] — 2026-06-07¶
Fixed — rtfm queue retry-failed no longer raises on duplicate failures¶
When a pile of similar files all fail with the same shape of error (the 1330 broken EPUBs on viasophia), retry_failed tried to move them all to pending in one UPDATE and the unique-pending index rejected the second twin. The whole operation rolled back and nothing was retried. Same class of bug as the reaper case in 0.24.0.
retry_failed now coalesces before the bulk update:
- failed rows whose twin is already pending → dropped (the pending one wins).
- failed rows that share
(type, payload)with another failed row → only the one with the highestattemptssurvives, the rest are dropped.
Two regression tests added.
[0.24.0] — 2026-06-07¶
Fixed — rtfm sync no longer hangs forever after a worker crash¶
When a worker died mid-job (OOM-kill, WSL reboot, hard SIGKILL) the row stayed running in the queue forever, with no live worker behind it. rtfm sync's exit condition was pending == 0 AND running == 0, so one zombie row meant the command waited indefinitely — a real cron once burned 4 h 24 min in this state before contributing to a host crash. Diagnosed and reproduced on this very repo (35 zombies accumulated since May 21).
- New zombie reaper in
Queue.reap_zombies(). Decides what's a zombie by readingworker_state.jsonrather than by timestamps: if no worker is alive, or if the live worker is on a differentcurrent_job_id, the row is a zombie. A 3 hstarted_atfallback covers the rarer case where the worker is alive but stuck. Zombies withattempts >= 3are markedfailedinstead of being requeued, so a single poisonous file can't loop forever. - Auto-reap at worker boot — first thing the worker does on startup, before draining anything.
- Auto-reap inside
_watch_jobs— one-shot before the wait loop + every 10 s during. Sortfm syncis now self-healing: if a worker dies while you watch, the next reap cycles its in-flight row. --timeout <seconds>flag onrtfm sync— explicit ceiling. Returns exit code 2 on timeout (worker keeps draining in the background).rtfm queue reap— manual remediation command with verbose per-row output (id, type, attempts, started_at, file path). Use after an unexpected hang.
Fixed — EPUBs with missing internal images now index instead of failing¶
The ebooklib reader raised on the first manifest item missing from the ZIP (typically an interrupted-download EPUB with a missing image). 1329 EPUBs were stuck in failed on the viasophia repo for this reason. Now:
- We detect the "no item named …" error specifically and fall back to a tolerant ZIP walker that iterates
.xhtml/.htmlmembers directly, ignoring the manifest. - Chunks extracted via the fallback carry
source_status: "incomplete"in their metadata, so callers can spot them.
The Weil EPUB that triggered the issue extracts 191 chunks in the fallback path, instead of 0.
Schema¶
No migration. The reaper uses the existing started_at column. The dedup logic handles edge cases where multiple zombie rows share the same (type, payload): it keeps the one with the most attempts and deletes the duplicates, so the unique-pending index can't reject the requeue.
[0.23.0] — 2026-05-25¶
Added — worker respawn is now fully autonomous (no manual action)¶
Up to 0.22 the worker self-exited cleanly on version drift or memory pressure, but the respawn still required a hook to fire (next user prompt in Claude Code) or an explicit rtfm worker restart-all. If the user installed an upgrade and walked away, the queue could sit idle for hours. Two new layers close the gap, both fully automatic:
- Fork-helper at clean exit. When the worker self-exits (version drift / RSS over threshold), it forks a tiny detached process just before exiting. That helper sleeps ~6 s (enough for the worker's lock to be released) and calls
ensure_worker_running, which spawns a fresh worker with the up-to-date code. SIGTERM (explicitrtfm worker stop) leaves the worker stopped — the helper only fires on self-managed exits. - Lazy version check in every CLI command. At the top of
cli.main(), throttled to once per minute via a marker file at~/.rtfm/last-version-check, we scan every registered worker'sworker_state.jsonfor itsinstalled_versionand compare to the version the running CLI just loaded. Any mismatch silently triggersrtfm worker restart-allin the background. So the moment you run anyrtfmcommand afterpip install, every project's worker gets refreshed without you doing anything.
WorkerState now includes installed_version (populated at worker startup) so the CLI can detect drift without spawning anything.
New doc page docs/worker-lifecycle.md explaining the three respawn layers (fork-helper, lazy CLI check, session hooks), what stays manual (only a hard kill -9 recovery), and where to look when something feels off. The doc is written for an end-user with no internals knowledge.
[0.22.0] — 2026-05-25¶
Added — rtfm worker restart-all for post-install respawn¶
Since 0.19 the worker self-exits when a new version lands on disk, but nothing immediately respawns it — until the next user prompt fires ensure_worker_running via a hook. If you don't interact with the project for hours, the worker stays dead and the queue stalls. Bit us yesterday: musicology's worker exited at 22:14 on the 0.20 → 0.21 bump and sat idle until 11:00 the next day.
- New project registry at
~/.rtfm/workers.json— everyensure_worker_running/_spawn_worker_directadds the project. Persistent across sessions. - New action
rtfm worker restart-all— reads the registry, cycles every registered worker (SIGTERM → wait → SIGKILL fallback → drop stale state → respawn). Reportsold PID → new PIDper project. Use this as the standard post-pip install/ post-pipx installstep.
After a deploy, the canonical sequence is now:
[0.21.0] — 2026-05-24¶
Added — rtfm failed + richer rtfm check failure detail¶
For bibliography-manager agents that need to route on why a file isn't searchable:
rtfm failed— flat machine-readable list of every job infailedstatus, withbucket(short stable category) + first line of the actual error + filepath + corpus. Filters:--type,--corpus,--bucket. Default format JSON;-f textgroups by bucket for human reading. Exit 0 when nothing's failed, 1 otherwise —rtfm failed && echo all-cleanworks in shell pipelines.rtfm checknow addsingest_failure_reason/ingest_failure_errorandocr_failure_reason/ocr_failure_errorto its JSON output. Pulls the most recent failure for that file. Empty (null) when the file isn't in a failed state.
Failure buckets so far: pdf-format-invalid, file-vanished, duplicate-content, memory-exceeded, pdftext-other, ocr-tesseract-error, other, unknown. New buckets are easy to add (single helper _failure_bucket in rtfm/cli.py).
[0.20.0] — 2026-05-24¶
Added — memory guard prevents OOM-kill of the whole worker¶
A pathological PDF once made the worker consume ~13 GB of RSS, which triggered the kernel OOM-killer; the worker died without a graceful exit, lost in-flight state, and required manual recovery. Two layers of defence now:
RLIMIT_AScap at startup (default 8 GB, configurable viaRTFM_WORKER_MEMORY_LIMIT_GB). The next allocation past the cap raisesMemoryError— catchable by the per-job handler, which marks the jobfailedand moves on. Converts a kernelSIGKILLinto a normal Python exception.- RSS polling at every idle tick. Above
WORKER_RSS_EXIT_MB(5 GB) the worker exits cleanly; the next hook respawns a fresh process. Catches slow leaks that wouldn't trip the per-alloc cap.
Opt out with RTFM_WORKER_MEMORY_LIMIT_GB=0 when running marker-pdf, whose ML models legitimately need 3-8 GB.
Test suite: 568 passed.
[0.19.0] — 2026-05-23¶
Added — worker self-restarts after a version upgrade¶
A long-running worker keeps the code it imported into memory at startup; a fresh pip install --force-reinstall (or pipx install) writes new files on disk but the running worker silently ignores them. Bit this project once already — workers from 0.15 era kept handling jobs while 0.17 / 0.18 lived on disk, so the new handle_scan never fired and ~1200 PDFs sat unindexed.
Now at every idle tick the worker compares importlib.metadata.version("rtfm-ai") against the version it captured at startup. If they diverge, it logs version changed on disk, exiting for restart and exits cleanly. The next hook (UserPromptSubmit / PostToolUse) calls ensure_worker_running, which spawns a fresh worker with the up-to-date code. Source-checkout developers are unaffected — when either side reports "unknown" (no installed metadata) the check is a no-op.
rtfm check, the CLI command introduced earlier today, also gains the ocr_attempted / ocr_pending / ocr_failed triplet (and the same ingest_* split) so consumers can route differently: pending → wait, failed → escalate to human, neither → fully done.
Test suite: 566 passed.
[0.18.0] — 2026-05-23¶
Changed — every DB write now goes through the worker (no more inline path)¶
The CLI, the hooks and the slash commands stop touching the DB directly. They become producers that enqueue jobs into a single 7-level priority queue; the worker daemon is the only consumer. This removes a whole class of bugs (concurrent writes from different RTFM versions, inline OCR that blocks the user's terminal for hours, hooks that ran a full destructive sync on every prompt) and makes the system observable: a rtfm sync shows live queue progress instead of a long opaque blocking call.
Seven priority lanes, lowest number wins:
- P0 = explicit user (slash commands, manual CLI invocations)
- P1 =
scan— detect changes in a source - P2 =
remove— drop a vanished file from the index - P3 =
ingest— parse one file → chunks - P4 =
reconcile/vacuum— short maintenance - P5 =
embed— vectorise a batch of chunks - P6 =
ocr— OCR a page-range of a scanned PDF
What changed concretely:
- New job types:
scan,remove,reconcile,vacuum, each with its own handler inrtfm/core/handlers.py. Thescanhandler subsumes the old_scan_oncemethod on the worker and the destructivesync()removed-path — including the mass-removal circuit breaker from 0.16.0. - Worker periodic ticks (
_maybe_scan,_maybe_reconcile) now just enqueue jobs. The work happens in handlers. Queue dedup (UNIQUE(type, payload) WHERE status='pending') keeps the queue clean across repeated ticks. - CLI — every mutating command (
rtfm sync,rtfm gc,rtfm doctor,rtfm reindex,rtfm vacuum,rtfm backfill-pages) becomes "enqueue P0 + watch progress + exit".--backgroundskips the watching loop.rtfm sync --inlineis gone (the inline path is gone).cli.pyshrank 2587 → 2274 lines. - DB migration is automatic: pre-0.18 DBs had a
CHECK(type IN ('ingest','embed','ocr'))onwork_queuethat blocked the new job types. The first 0.18+Queueopen rebuilds the table in place, rows preserved. - Docs:
docs/architecture.mdrewritten for the new model (priority table, handler list, periodic-tick semantics).
Test suite: 552 → 565 passed (24 skipped).
[0.17.0] — 2026-05-23¶
Fixed — stop indexing our own state directory (feedback loop)¶
Some live DBs ballooned absurdly (RTFM 2.3 GB for 441 books, tradingbot 8.5 GB for 59 books). Forensics: the parser registry was happily ingesting .rtfm/library.db itself — every chunk of the index became more rows, which the next sync re-ingested, snowballing. New default excludes block this and other generic noise:
.rtfm/— RTFM's own state dir (library.db, logs, locks). Indexing it is always a bug..cache/— generic cache dirs (import caches, browser caches, build caches): always noise.- Honor root
.gitignore— whenpathspecis installed (now a core dep),scan_directory()filters out anything matched by the project's own.gitignore. Reuses what the user has already declared as ignored artifacts rather than maintaining a parallel exclude list. Nested.gitignorefiles in subdirs are not walked (root-only) — covers the vast majority of real-world setups while keeping the scan simple. Opt out withhonor_gitignore=False.
To purge the historical garbage on an already-polluted DB, run once: rtfm sync --force-remove (the mass-removal circuit breaker from 0.16 will otherwise block the cleanup since 90%+ of "files" disappear under the new excludes).
[0.16.0] — 2026-05-21¶
Fixed — sync no longer wipes a corpus on an incomplete scan (data-loss bug)¶
A live corpus on NTFS-via-WSL lost ~500 fully-indexed PDFs (and their embeddings). Root cause: the session hooks ran a full sync() of every source on every prompt. While an external process was reorganising files on flaky NTFS, a scan caught a moment when hundreds of files were temporarily absent → sync() flagged them removed → delete_book destroyed their chunks; a later gc then purged the now-orphaned embeddings. The background worker was never the cause — its idle-scan only ever adds, never deletes.
- Mass-removal circuit breaker in
sync(): refuses a removal batch that is both large (≥REMOVE_CIRCUIT_MIN_FILES, default 25) and a big fraction of the corpus (≥REMOVE_CIRCUIT_RATIO, default 25%) — the signature of an incomplete scan, not real deletions. Index left intact; a warning is surfaced. Override withrtfm sync --force-remove(orforce_remove=True) for deliberate bulk deletes. - File-list mode never deletes: when
sync(files=[...])is given a partial list, files not in that list are no longer treated as removed (their absence from a partial list is not evidence of deletion).
Changed — lightweight hooks: the worker does the work¶
The Claude Code hooks no longer run a full sync() (which re-MD5'd the entire corpus on every prompt — slow on NTFS, and the trigger for the data-loss bug above). New design:
- UserPromptSubmit / Stop → only revive the background worker if it died. No scan, no hashing, nothing on the user's hot path.
- PostToolUse (Write|Edit|MultiEdit) → enqueue the one file the agent just wrote as a P1 ingest job (mapped to its source/corpus, gated on a registered parser). Non-destructive: only ever adds work.
- Discovery of new/changed/moved files across all sources is the worker's non-destructive idle-scan. New
install_hookregisters all three; re-running is idempotent.
[0.15.0] — 2026-05-21¶
Changed — OCR: tesseract backend by default, split into page tranches¶
marker (Surya models) is excellent but unusable for OCR on CPU: on a real corpus every big scan (Narmour 499p, Eco 253p, Chomsky…) either timed out at 20 min or OOM-crashed during layout. New default OCR path:
extract_with_tesseract— renders each page via pypdfium2 (already a dep) and OCRs it with tesseract (fast C binary, no multi-GB ML models → no OOM/timeout). Multilingual (eng, fra, + indic packs). Languages auto-filtered to those actually installed.- Page-range splitting — a scanned book is OCR'd in tranches of
PAGES_PER_OCR_JOB = 50, one P3 job each. A 600-page book becomes ~12 short, independently-resumable jobs instead of one hour-long block that monopolises the worker.enqueue_ocr_jobs()does the split; P1,rtfm doctor --enqueue-ocrandbackfill-pages --enqueue-ocrall use it. - Idempotent append —
Library.append_ocr_chunks(book_slug, chunks, page_lo, page_hi)deletes that page range then inserts, so re-running a tranche (retry) never duplicates and other tranches stay intact. Each tranche enqueues P2 embedding for just its new chunks. - Config:
ocr_backend(tesseractdefault |marker|auto),ocr_langs(defaulteng+fra; set e.g.eng+fra+tam+hin+sanfor Indic-script scans). - New
[ocr]extra:pytesseract,pypdfium2,Pillow(+ the systemtesseractbinary). pages_to_chunks()extracted fromPDFParser.parseand shared with the OCR handler so OCR'd pages produce identical chunk shapes. 6 new tests (split ranges, idempotent tranche append).
Trade-off: tesseract is excellent on clean print (your scanned books) but weaker than marker on heavy maths/tables/multi-column. For making text searchable it's the right call on a GPU-less machine; marker stays available via ocr_backend: marker.
[0.14.1] — 2026-05-21¶
Fixed¶
rtfm.core.embeddingsno longer hard-imports numpy at module load. numpy is part of the[embeddings]extra, butreconcile()(and the queue handlers) need only the metadata helpers (resolve_model,DEFAULT_MODEL) — which don't touch numpy. The top-levelimport numpymadetest_reconcilefail in the core/dev CI matrix (ModuleNotFoundError: numpy). numpy is now imported lazily inside the functions that use it, withfrom __future__ import annotationskeeping thenp.ndarraytype hints from evaluating at import time.
[0.14.0] — 2026-05-21¶
Fixed — embeddings no longer leak when chunks are deleted¶
Library._get_conn now sets PRAGMA foreign_keys = ON. SQLite has FK enforcement off by default, so the chunk_embeddings → chunks ON DELETE CASCADE never fired: every re-ingest/delete_book left the old embeddings behind as orphans (a real index had 197k orphans = 19% of all embeddings). They didn't pollute search (the semantic query JOINs on chunks, excluding them) but wasted disk. With FKs on, deleting a chunk removes its embedding.
Added — self-healing reconciliation (rtfm gc + idle worker pass)¶
A live pipeline is never perfectly consistent (interrupted syncs, re-ingests, moves). Rather than try to prevent every gap, RTFM now reconciles the index periodically:
rtfm.core.reconcile.reconcile()— purges orphan embeddings and re-queues every chunk missing an embedding as P2 jobs.- The worker runs it automatically while idle (every
RECONCILE_INTERVAL_SECONDS = 3600, only when the queue is empty — so it never races an in-flight re-ingest/move, and an orphan only ever means "chunk gone for good" sincemove_filepreserves chunk ids). rtfm gc [--vacuum] [--force]— manual trigger. Refuses while the worker is busy (reconciliation is only safe at rest);--forceoverrides;--vacuumreclaims disk after purging.
This also surfaces and self-heals un-embedded chunks — content that exists but was never embedded (e.g. after an inline/--no-embeddings sync), so it's invisible to semantic search until reconciled. 5 new tests in test_reconcile.py, incl. a regression that FK=ON cascades the delete.
[0.13.0] — 2026-05-21¶
Fixed — half the supported formats were never scanned¶
DEFAULT_EXTENSIONS was a hand-maintained list of 27 extensions that omitted 27 formats RTFM has a parser for: csv, tsv, xlsx, sqlite, sqlite3, db, epub, mobi, azw, azw3, docx, odt, rtf, fb2, djvu, ipynb, sql, and several languages (kotlin, swift, lua, r, perl, scala, …). Those files were silently ignored unless a source declared extensions explicitly. DEFAULT_EXTENSIONS is now derived from the parser registry (default_extensions()), so every format with a parser is scanned — 56 extensions, and any newly-added parser is picked up automatically.
Added — rtfm reindex (targeted refresh after a parser change)¶
When a parser is improved (e.g. the 0.12.0 tabular fix), the affected files need re-ingesting — but their content hash is unchanged, so rtfm sync skips them and --force would re-ingest everything (including a thousand PDFs mid-embed). rtfm reindex enqueues P1 ingest jobs only for a chosen category, leaving the rest of the queue and in-flight embeddings untouched:
rtfm reindex --ext csv,tsv,xlsx,sqlite,db # after a tabular parser fix
rtfm reindex --parser csv # by parser name
rtfm reindex --ext pdf --corpus icm-bibliography
P1 jobs preempt pending P2/P3, so the refresh runs first. This is the "nominal" way to roll out a parser change to an existing index.
[0.12.0] — 2026-05-21¶
Changed — tabular parsers index the whole file, not a sample¶
CSV/TSV, XLSX and SQLite were samplers, not parsers: they indexed only the header + a handful of rows (CSV 8, XLSX 6, SQLite 5). A value on row 5000 was invisible to search. They now index every row, so the full table is searchable.
- CSV/TSV (
csv_parser.py): overview chunk (columns + inferred types) then all rows in size-bounded data chunks. Each row rendered ascol=value | col=value(every value tied to its column for FTS/semantic match), full values (no more 80-char cell truncation), header repeated per chunk. Streamed — memory stays bounded on million-row files. - XLSX (
xlsx.py): same treatment per sheet — schema chunk + all-rows data chunks, viaread_onlyiter_rows. - SQLite (
sqlite_parser.py): per table, schema chunk + all rows streamed withfetchmany(500). BLOB columns keep a<blob NB>placeholder (binary, not text-searchable); text/numeric values kept in full. FK edges unchanged.
Type inference still samples the first ~50 rows (it doesn't need the whole file). Trade-off: indexing a large table produces many more chunks → bigger index and more embeddings, which is the cost of "everything searchable". Tests updated/added across all three parsers (full-content, column-context, large-file, no-truncation).
[0.11.2] — 2026-05-20¶
Changed — PDF health scan hardened for unattended corpus runs¶
A cross-team freeze post-mortem (a sibling tool ran two poppler-based PDF scanners in parallel on a DrvFs/9p mount; corrupt files wedged pdfinfo in uninterruptible D-state, full-document pdftotext on big healthy PDFs saturated I/O) drove three hardening changes so RTFM can scan an entire corpus in the background without freezing WSL:
- Page sampling:
measure_pdf_text(path, sample_pages=10)now text-extracts only the first ~10 pages. The scan signal (≈0 chars/page) is unambiguous there; extracting a 700-page book in full was pure I/O waste. Verdict unchanged, ~10× faster per large file (Narmour 499p: 0.5 s vs seconds). - Buffer read: the file bytes are read in Python (
path.read_bytes(), an interruptible syscall we own) and handed to pypdfium2 as a buffer, instead of letting pdfium open the path and block on the slow mount. RTFM was already subprocess-free (pypdfium2 in-process), so it never had the D-state child problem in the first place. - No two scanners at once:
rtfm doctorrefuses to run while the worker isbusy(use--forceto override, or stop the worker). One PDF scanner per mount.
measure_pdf_text now also returns sampled_pages. backfill-pages no longer overwrites total_chars (the sampled count isn't the document total) — it writes page_count and bases the scan verdict on the freshly-sampled real text.
[0.11.1] — 2026-05-20¶
Fixed — scan detection reads the file, not the DB¶
A cross-check against a hand-curated 28-PDF list exposed a flaw: scan detection (and backfill-pages) computed chars/page from the stored books.total_chars, which can be stale (different file revision, prior OCR run). It made a genuine 0-char scan (Chomsky 1957) look like text. Now the density is measured from the real file every time.
parsers.pdf.measure_pdf_text(path)— opens via pypdfium2, extracts the real text of every page, returns{pages, chars, chars_per_page, error}. A non-Noneerroris a distinct "unreadable" state (pdfium "Data format error" on corrupt files) — such files can't be OCR'd by marker either (same backend), so they need re-acquisition, not OCR.backfill-pagesrewrites bothpage_countand a freshly-measuredtotal_chars, and only flags readable scans.
Added — format sniffing + rtfm doctor¶
core.sniff.detect_real_format(path)— magic-byte detection (pdf / zip / epub / docx / xlsx / pptx / html / rtf / gzip / empty). Catches files saved with a lying extension (e.g. an EPUB named.pdf).- The P1 ingest handler no longer queues OCR for a
.pdfthat isn't really a PDF (marker would fail too). - New
rtfm doctor— diagnoses every indexed PDF into ok / scan / unreadable / wrong-format / missing by reading the real file. Flags:--enqueue-ocr(queue P3 for readable scans),--fix-extensions(rename mislabeled files on disk so a re-sync routes them to the right parser). - 11 new tests in
test_sniff.py.
[0.11.0] — 2026-05-20¶
Added — deterministic scanned-PDF detection¶
The "is this PDF a scan that needs OCR?" decision is now based on text density (chars per page), not the chunk count. On a real corpus the chunk-count heuristic was badly wrong: of 143 low-chunk PDF candidates, only 4 were actual scans — the other ~113 had plenty of text that the chunker had merged into 1-2 large chunks. Conversely, scans that produced 1-2 junk chunks slipped past the old chunks == 0 test entirely.
PDFParser.parse()writes the realpypdfium2page count into the shared metadata dict, soLibrary._index_chunkspersists it to the new use ofbooks.page_count.Library._index_chunksreturnspagesin its stats and storespage_count(viaCOALESCE, so a re-ingest never nulls it).handlers._pdf_is_scan(stats)— deterministic test:chars / pages < SCAN_CHARS_PER_PAGE(20). Falls back to the zero-chunk signal only when no page count is available. The P1 ingest handler uses it to decide whether to enqueue a P3 OCR job.- New
rtfm backfill-pages [--enqueue-ocr]— fillsbooks.page_countfor already-indexed PDFs (cheap: pypdfium2 page count, no text extraction), reports which are provably scans, and optionally enqueues P3 OCR jobs for them (enablingocr_fallbackif needed). - 4 new tests in
rtfm/tests/test_handlers.py.
[0.10.6] — 2026-05-19¶
Fixed¶
rtfm syncno longer crashes withdatabase is lockedwhen the Library connection has an open implicit transaction at the moment the Queue triesBEGIN IMMEDIATE. Two connections to the same SQLite DB from the same Python process see each other as locked even in WAL mode —busy_timeoutdoesn't help in that intra-process case._cmd_sync_enqueuenow commits the Library connection right before every batch enqueue.
[0.10.5] — 2026-05-19¶
Fixed¶
Queue.enqueue_manywraps each batch in a singleBEGIN IMMEDIATEtransaction (was N individual auto-commits) and retries up to 3× on transientdatabase is locked.busy_timeoutbumped from 10 s → 60 s for multi-MCP-server setups (3+ Claude Code sessions on the same project).
[0.10.4] — 2026-05-19¶
Changed — single consumer process + MD5 enqueue¶
Two corrections to the 0.10.3 design after a real-world run on the user's musicology-phd project.
rtfm syncenqueue now usescompute_diff(MD5), notquick_diff(size + mtime). On a 4400-job sample the previous quick-diff path was ~14 % waste: ~10 % cross-corpus duplicates (a file already in the DB under another corpus with the same MD5) and ~4 % mtime false-positives (NTFS-via-WSL re-touching files without content change). Quick-diff missed the cross-corpus case entirely.- Cross-corpus moves are now applied inline during
rtfm sync, before any enqueue, viaLibrary.move_file(new_corpus=...). The book row's corpus is updated and its chunks / embeddings / tags follow (FK on chunk_id, not on the on-disk path). This is the work-preservation guarantee the user asked for: a file moved between configured corpora keeps the embeddings already paid for.
Changed — one process, no more watcher¶
rtfm/core/watcher.py and rtfm/tests/test_watcher.py are gone. The periodic scan is folded into the worker idle loop: when the priority queue is empty, the worker runs the same MD5-based scan + cross-corpus move logic itself, then sleeps. One project = one process, exactly as the user originally specified.
- New
--scan-interval SECONDSoption onrtfm worker start(default 30 s). The worker reads it viartfm worker-daemon --scan-interval. rtfm watch [start|stop|status]andwatch-daemonare removed.rtfm statuskeeps showing theWorker / Queue:section unchanged — it was already worker-only.
The 0.10.3 watcher made sense in isolation but doubled the daemon footprint for no benefit: scanning is cheap (quick_diff had been ~ms per file; compute_diff is the new cost and only runs while the queue is empty, so a long ingest or OCR run is never paused to scan).
[0.10.3] — 2026-05-19¶
Added — filesystem watcher + enriched status (queue phase 4)¶
rtfm watch [start|stop|status]— a polling daemon that scans every configured source every 30 s (configurable via--poll) and enqueues P1 ingest jobs for new/modified files. Auto-spawns the worker after each scan that found something. Held by an exclusiveflockon.rtfm/watcher.lock(one watcher per project), with.rtfm/watcher_state.jsonfor status. Combined with the worker, a file you save now lands in the index within ~30 s, automatically, without any manualrtfm sync.- Polling (not inotify) chosen on purpose: RTFM frequently indexes Obsidian vaults on
/mnt/d/…(NTFS via WSL), where inotify does not propagate events. The poll usesquick_diff(size + mtime, no MD5), so a 30 s tick is cheap even on huge corpora. rtfm statusshows a newWorker / Queue:section when relevant: worker status (running/idle/busy), current job preview, per-type counts (ingest,embed,ocr) withpending/running/done/failedbreakdown. Silent on projects that never used the queue path.- New module
rtfm/core/watcher.py(Watcher,WatcherLock,watcher_running, state primitives).cli_worker.ensure_watcher_running()mirrorsensure_worker_running(). 10 new unit tests inrtfm/tests/test_watcher.py.
Phase 4 closes the queue redesign loop: producers (CLI, hooks, watcher) → priority queue → worker (one process, three priorities, bounded resources). From here on the user can edit a file and the index catches up on its own.
[0.10.2] — 2026-05-19¶
Added — P3 OCR handler (queue phase 3)¶
The OCR pass is now a P3 job in the unified worker. Pipeline:
P1 ingest (PDF, ocr_fallback=true)
├─ pdftext yields ≥1 chunk → ingest OK, enqueue P2 follow-up
└─ pdftext yields 0 chunks → enqueue P3 OCR for this same file
P3 ocr
├─ delete the empty book P1 left behind
├─ re-ingest with PDFParser(backend="marker") — marker runs in
│ an isolated subprocess (0.9.5) so its 3-8 GB of model RAM
│ is reclaimed between PDFs
└─ enqueue P2 follow-up for the freshly OCR'd chunks
P3 sits below P1 / P2 in the queue, so a freshly-edited markdown file is always indexed before the worker burns CPU on a slow OCR run.
handlers.handle_ocr— P3 handler. Drops any empty book P1 left behind, re-ingests with the marker backend, updatesindexed_files, then enqueues P2 follow-up so the OCR'd chunks reach the embedding column on their own.handlers.handle_ingest(existing) now detects zero-chunk PDFs and auto-enqueues a P3 job iffocr_fallback: trueis set in.rtfm/config.json. Skips the P2 follow-up in that case (no point embedding an empty book).rtfm sync --ocris queue-based by default: persistsocr_fallback: true(idempotent), enqueues a P3 for every previously-flagged scan from.rtfm/seen_scans.json, auto-spawns the worker. The legacy detachedocr-workerdaemon is still reachable viartfm sync --inline --ocrand will be removed in 0.11.- 3 new tests in
rtfm/tests/test_handlers.py(auto-enqueue P3 with fallback on; no P3 with fallback off; reject non-PDF payloads).
Phase 3 closes the queue redesign the user asked for: one process, three priorities (ingest > embed > OCR), per-file granularity for responsive preemption, bounded resources (nice 19 + ionice -c 3 + marker subprocess isolation).
[0.10.1] — 2026-05-19¶
Added — P2 embed handler (queue phase 2)¶
The priority-queue worker now drains P2 embed jobs in addition to P1 ingest. The full pipeline is:
producer ─► P1 ingest job ─► worker ─► parse + index + upsert tracking
─► enqueue N P2 jobs (chunks of the new book,
split at EMBED_BATCH_SIZE=64)
producer ─► P2 embed job ─► worker ─► fastembed batch → chunk_embeddings
Library.embed_chunks_by_id(chunk_ids, model=None)— embed a specific list of chunk ids. Skips chunks that already carry an embedding for the active model (idempotent retry). The 500-id chunked filter dodges SQLite's parameter limit even for huge backfills.Library.chunk_ids_for_book(slug)andLibrary.chunk_ids_without_embedding(corpus=None)— small helpers used by the P1 follow-up enqueue and byrtfm embedin queue mode.handlers.handle_embed— P2 handler: loadchunk_idsfrom payload, callembed_chunks_by_id. Empty payload is a no-op (so a malformed enqueue doesn't fail the job).handlers.handle_ingest(existing) now enqueues a P2 batch perEMBED_BATCH_SIZE=64chunks of the newly-created book — chunks reach the embedding column on their own, no manualrtfm embedneeded.rtfm embedis queue-based by default: scans for chunks missing an embedding, splits atEMBED_BATCH_SIZE, enqueues P2 jobs, auto-spawns the worker, returns immediately.--inlineand--forcekeep the legacy blocking path (CI / re-embedding the whole DB).- 5 new tests in
rtfm/tests/test_handlers.py. Fixed anINSERT … ON CONFLICT(chunk_id, model)clause to match the table's actualUNIQUE(chunk_id)constraint.
Coming¶
Phase 3 (P3 OCR handler — folds the existing OCR daemon into the unified worker) lands in 0.10.2.
[0.10.0] — 2026-05-19¶
Added — priority-queue worker (MVP / phase 1)¶
The work model moves from "every command blocks on a full-tree sync" to a single in-project background daemon that drains a priority queue. Producers (CLI, hooks, MCP tools) enqueue per-file jobs; the worker picks them up by priority. Ingestion (P1) preempts embeddings (P2) which preempts OCR (P3), so a file you just edited is indexed before any embedding/OCR backlog. Granularity is one file per job, so preemption is responsive (next-job boundary).
- New
work_queuetable in.rtfm/library.dbwith priority + status + dedup index on(type, payload) WHERE status='pending'— multiple producers can safely enqueue concurrently. rtfm.core.queue.Queue— atomicenqueue/dequeue(single-statementUPDATE ... RETURNING),mark_done/mark_failed,stats/list_pending/list_failed,retry_failed,clear_done. 13 unit tests.rtfm.core.worker.Worker— single-threaded loop, dispatch by job type, atomic state snapshot to.rtfm/worker_state.json, exclusiveflockon.rtfm/worker.lockso at most one worker drains a project at a time.rtfm.core.handlers.handle_ingest— P1 worker handler. Equivalent to the per-file path of the legacy inline sync (parse → ingest → upsert tracking), but isolated to a single file.rtfm syncis now queue-based by default: scans configured sources, enqueues P1 jobs for new/modified files, auto-spawns the worker daemon (atnice 19+ionice -c 3when available), returns immediately.--inlinekeeps the legacy blocking sync for CI / scripted use;--ocr,--no-embeddings,--files, explicit path,--dry-run,--forcealso stay on the legacy path.- New CLI commands:
rtfm worker [start|stop|status]— manage the daemon directly.rtfm worker-daemon— hidden; the actual loop, invoked byensure_worker_running().rtfm queue [stats|list|failed|clear-done|retry-failed]— inspect & manage the queue (--limit,--keep).
Coming¶
Phase 2 (P2 embed handler — chunks-without-embeddings as scheduled jobs) and Phase 3 (P3 OCR handler — folds the existing OCR daemon into the unified worker) will land in 0.10.1 / 0.10.2.
[0.9.5] — 2026-05-18¶
Fixed¶
- OCR no longer accumulates RAM across PDFs.
marker.models.create_model_dict()loads 3-8 GB of ML state (layout + OCR + table + reading-order pipelines) and caches it at module level — marker never releases it. The old in-process loop inextract_with_marker()re-loaded those models for every PDF without freeing the previous run, so a longrtfm sync --ocron WSL (16 GB cap) climbed past the ceiling, swapped on NTFS, and froze the whole VM. Now each PDF is OCR'd in a one-shot Python subprocess (subprocess.run); the OS reclaims the full footprint when the child exits. Adds a 20-min per-PDF timeout (PDFExtractionErrorinstead of an indefinite hang) and a structured JSON protocol between worker and host. 3 new tests inrtfm/tests/test_pdf_parser.py.
[0.9.4] — 2026-05-18¶
Changed¶
- Claude Code hooks: targeted per-turn sync instead of full-tree rescan. Before 0.9.4 the
UserPromptSubmitandStophooks both iterated every configured source (e.g. 35 sources for a multi-vault project) at every turn — ~30–60s per hook, fighting anrtfm sync --ocrdaemon for SQLite write locks on multi-session setups and producing 100+ redundant scans per hour. The new design is event-driven: - New
PostToolUsehook (rtfm_record_edit.py, matcherWrite|Edit|MultiEdit|NotebookEdit) appends each touchedfile_pathto.rtfm/touched_files.tmpin O(1). Stop(rtfm_stop_sync.py) reads that queue, groups files by their longest-matching configured source, and runssync(files=[...])only for those files. Empty queue → instant no-op.UserPromptSubmit(rtfm_sync.py) is now just a safety-net drain for orphan queues left behind by sessions abandoned before their Stop hook ran.- Net effect: zero-cost hooks on turns with no edits; sub-second sync on turns with 1–5 edits; never re-scans untouched sources; no more lock contention with the OCR daemon.
hooks.jsonupdated to registerPostToolUse.
[0.9.3] — 2026-05-18¶
Fixed¶
- Sync no longer drops embeddings with
Model paraphrase-multilingual-MiniLM-L12-v2 is not supported in TextEmbeddingon DBs created by older RTFM versions. Early releases stored the short, unqualified model name (paraphrase-multilingual-MiniLM-L12-v2) inchunk_embeddings.model. Recent fastembed releases only accept the fully-qualifiedsentence-transformers/...form, so reusing the DB's active model on a fresh sync threw mid-batch and silently disabled embedding generation for every new chunk.resolve_model()now suffix-matches a short name back to the registered fully-qualified entry, andLibrary.generate_embeddings()normalizes the DB-stored name throughresolve_modelbefore handing it to fastembed. 4 new tests inrtfm/tests/test_embeddings.py::TestResolveModel.
[0.9.2] — 2026-05-18¶
Fixed¶
move_file()no longer crashes withUNIQUE constraint failed: indexed_files.filepath. Previously the cross-corpus move pass did a plainDELETE + INSERTon the tracking table, which raised mid-sync as soon as the targetfilepathalready had a row (typical when only the corpus name changes inconfig.jsonand every cross-move hasold_filepath == new_filepath). The DELETE had already run when the INSERT threw, so thebooksrow was repointed at the new corpus but its tracking entry was gone — leaving thousands of "orphan" books with noindexed_filesmapping. Replaced with anINSERT ... ON CONFLICT(filepath) DO UPDATE(same pattern asupdate_indexed_file) and an explicitDELETE old_filepathonly when it differs fromnew_filepath. 2 new regression tests inrtfm/tests/test_cross_corpus_move.py(test_corpus_rename_in_place_no_unique_conflictreproduces the user-facing scenario;test_move_file_preexisting_target_filepathis a belt-and-braces unit test). Full suite: 501 passed.
[0.9.1] — 2026-05-18¶
Fixed¶
- MCP tools now coerce numeric params passed as strings. Some MCP clients/LLMs emit
"limit": "5"instead of"limit": 5; downstream comparisons likelen(results) >= limitinlibrary.search()then crashed withTypeError: '>=' not supported between instances of 'int' and 'str'. Affectedrtfm_search,rtfm_context,rtfm_books,rtfm_expand, andrtfm_history. New_coerce_int/_coerce_floathelpers inrtfm/mcp.pycast incoming values, fall back to the documented default on unparseable input, and rejectbool(which is a subclass ofintin Python). 9 new regression tests inrtfm/tests/test_mcp.py. Full suite: 487 passed.
[0.9.0] — 2026-05-18¶
Added¶
rtfm sync --ocrruns as a detached background daemon. Marker-based OCR takes minutes per scanned PDF, hours for a real corpus — the previous foreground implementation died with the terminal or the Claude Code hook timeout, losing the entire run. The command now: (1) refuses to relaunch if another daemon is already running (shows live progress and PID instead), (2) persistsocr_fallback: truein.rtfm/config.json, (3) invalidates the hash of every PDF in.rtfm/seen_scans.jsonso the worker's incremental sync re-ingests them, (4) forks asubprocess.Popen(..., start_new_session=True)worker (immune to parent SIGHUP) and exits immediately with the daemon's PID. New internalrtfm ocr-workersubcommand does the actual sync.- Resumable: the worker writes its live state to
.rtfm/ocr_state.json(atomic temp+rename) withpid,status,total,done,current_file,started_at,last_update. If the daemon is killed mid-run, the nextrtfm sync --ocrresumes from where the incremental sync left off — files already OCR'd have a real hash and are skipped. rtfm statusnow surfaces the OCR daemon when one is present:- Live:
OCR running (PID 12345): 23/156 PDFs (15%), 1h20m elapsed, ETA ~6h\n current: scan_45.pdf - Dead-but-resumable:
OCR interrupted at 23/156 (...). Resume: rtfm sync --ocr /rtfm.statusand/rtfm.ocrslash command prompts updated to highlight the daemon state and never wait/poll.- New module
rtfm/core/ocr_daemon.pyexposes the helpers (pid_alive,read_state,write_state,daemon_running,format_progress) and the on-diskocr_state.jsonschema. - 14 new unit tests in
rtfm/tests/test_ocr_daemon.pycover PID liveness, atomic write semantics, the running-detection logic, malformed-JSON tolerance, and the progress renderer for running/crashed states. Full suite: 475 passed.
Changed¶
rtfm sync --ocrno longer accepts running in the foreground. (If you really need a foreground run for debugging, invokertfm ocr-workerdirectly — it's hidden from--helpbut documented inrtfm/core/ocr_daemon.py.)
[0.8.9] — 2026-05-18¶
Added¶
- Cross-corpus move detection by content hash. When a file is reorganised across corpus boundaries (e.g. moved from an Obsidian
Projets/intoPublications/when those map to different RTFM corpora),compute_diff()now spots the hash match againstlibrary.list_indexed_files()(all corpora) and transfers ownership instead of treating the file as deleted-in-A + added-in-B. The book row is updated in place, so chunks, embeddings, and tags all survive (they referencechunk_id, not the on-disk path). Critical when expensive computation has already been done — semantic embeddings, OCR output, manual tagging. library.move_file(..., new_corpus=...)is the new entry point. The same on-disk filepath cannot belong to two corpora at once (table constraint), so this is also a safe partition guarantee.- 3 new tests in
rtfm/tests/test_cross_corpus_move.pycovering chunk-id preservation across the move, regression on in-corpus moves, and the "really new file" path. Full suite: 461 passed.
[0.8.8] — 2026-05-18¶
Added¶
/rtfm.statusslash command. Wrapsrtfm status --healthso the user can check index health from the Claude Code/menu without dropping to a terminal. Returns the full status (books, chunks, corpora, embeddings, last sync, parsers, extras) plus pending-sync counts and known scan suspects. Defined incommands/rtfm.status.md.
[0.8.7] — 2026-05-17¶
Fixed¶
- Slash command moved to the correct location and renamed to
/rtfm.ocr. In 0.8.6 the file lived at.claude-plugin/commands/ocr.md, which is not a directory scanned by Claude Code — plugin slash commands must sit incommands/at the plugin root (per the official Plugins reference). Renamed tocommands/rtfm.ocr.md, so the command surfaces as/rtfm.ocrin the slash menu once the marketplace plugin is updated (/plugin marketplace update roomi-fieldsthen reinstallrtfm@roomi-fields).
[0.8.6] — 2026-05-17¶
Added¶
/rtfm:ocrslash command. Users who install RTFM via/plugin install rtfm@roomi-fieldsnow get a Claude Code slash command that wrapsrtfm sync --ocr— pick it from the/menu, the agent runs the command, summarises results, and confirms persistent OCR fallback is active. Defined in.claude-plugin/commands/ocr.md.
Fixed¶
rtfm sync --ocrnow works from any directory. When invoked outside a.rtfm/project (no config to persist into), the flag still forcesocr_fallback=Truefor the current run. Previously it was silently ignored: the persistent flag could only be saved when a.rtfm/was reachable, and the run itself fell back to pdftext-only.
[0.8.5] — 2026-05-17¶
Added¶
- One-shot
rtfm sync --ocr— persistent OCR fallback for scanned PDFs. Activates anocr_fallback: trueflag in.rtfm/config.jsonand re-runs sync withforce=Trueso previously-empty scans get OCR'd immediately. From then on, every sync (CLI or auto via hook) instantiatesPDFParser(backend='auto')for PDFs: it triespdftextfirst (fast, ~ms) and only falls back tomarker-pdf(slow OCR) when no text was extractable. The user runs the command once — new scans added to indexed sources are OCR'd automatically by the next sync. Successfully OCR'd files drop off.rtfm/seen_scans.jsonsortfm statusreflects the real remaining backlog. PDFParsergains abackend='auto'mode that does the pdftext → marker fallback in-process. Existingpdftextandmarkermodes are unchanged. Picks the cheap backend by default; only spends OCR cycles on real scans.- Periodic progress reporting inside
sync(). Newprogress_intervalparameter (seconds) emits a heartbeat line viaon_progress("progress", "", "K/N files, Xmin elapsed, ~Ymin remaining")while the inner loop runs. CLI auto-enables a 10-minute interval when--ocris set;--progress-every Noverrides. Long OCR passes no longer look frozen. ACTION REQUIREDblocks now propose a concrete copy-pastable command. Both the MCPrtfm_synctool and the auto-sync hook printON APPROVAL RUN: rtfm sync --ocr(instead of the previous "install [pdf] and re-sync" phrasing) and explicitly tell the user that the command is one-shot — future scans are handled automatically.
Changed¶
- The hook (UserPromptSubmit + Stop) reads
ocr_fallbackfrom.rtfm/config.jsonand propagates it to the innersync()call, so the auto-sync respects the persistent flag. _print_health_warnings()now adapts its message: when OCR fallback is already on but scans still survive, it tells the user the PDFs are likely corrupt rather than re-suggesting OCR.
[0.8.4] — 2026-05-17¶
Fixed¶
rtfm statusand the auto-sync hook no longer block on remote/NTFS sources. 0.8.3 reduced the status-health diff from "hash every file" to "stat every file"; on a small local repo that's instant, but on a 1700-file Obsidian vault sitting on NTFS via WSL evenos.stat()adds up to ~90 seconds per source. Two changes:rtfm statusnow keeps the index-health pending counts behind an opt-in--healthflag. The defaultrtfm statusruns in well under a second again, and known scan suspects (a single JSON read) are still shown unconditionally.- The
UserPromptSubmithook bounds its pre-sync diff to a 2-second total budget. If the budget is exhausted before all sources are scanned, the "indexing N files" announcement is silently skipped and the actual sync proceeds normally — the post-sync✓ RTFM syncsummary still fires.
[0.8.3] — 2026-05-17¶
Fixed¶
rtfm statusno longer hangs on large corpora. The "Index health" section introduced in 0.8.1 ransync(..., dry_run=True)for every configured source, which computes the MD5 of every tracked file — fine on a small repo, but a hard wait on corpora with hundreds of large PDFs (e.g. research libraries). Replaced by a newquick_diff()helper inrtfm/core/sync.pythat compares path presence + on-diskst_sizeagainst the stored tracking metadata. The same helper now also feeds theUserPromptSubmithook's "indexing N files" announcement. Trade-off: an in-place edit that does not change the file size can be missed byquick_diff; the realrtfm syncstill uses the hash diff for correctness.- Tests: 3 new in
rtfm/tests/test_sync_health.pycovering the added / modified-by-size / removed paths ofquick_diff.
[0.8.2] — 2026-05-17¶
Fixed¶
rtfm.__version__no longer reports"0.0.0"to installed users.rtfm/__init__.pywas looking upimportlib.metadata.version("rtfm")but the distribution name on PyPI isrtfm-ai(thertfmimport name was already taken by an unrelated package). The lookup raisedPackageNotFoundErrorsilently and fell back to"0.0.0", which leaked into every place that reads__version__— the CLI, the MCP server stats output, andrtfm status. Now usesversion("rtfm-ai")and adds a regression test (rtfm/tests/test_version.py) that fails if__version__drifts frompyproject.toml.
[0.8.1] — 2026-05-17¶
Added¶
- Sync health signals — RTFM no longer swallows scanned PDFs silently.
SyncResultnow exposessuspect_scans(PDFs that parsed without error but produced 0 chunks — almost always image-only scans needing OCR) andempty_files(other 0-chunk parses). The CLI, MCP server and the auto-sync hook all surface this state instead of silently treating it as a successful sync. rtfm sync(CLI) prints a localized warning block listing the suspect PDFs and the OCR install path.rtfm_sync(MCP) emits anACTION REQUIRED — surface to the user verbatimblock, the same format used when thepdfextra is missing, so the agent raises it with the user instead of moving on.UserPromptSubmithook dry-runs the diff first; announces→ RTFM: indexing N files...when there are ≥ 50 new/modified files, prints✓ RTFM sync: +A ~M -R files (Xs)when something actually changed, and forwards new scan warnings as the sameACTION REQUIREDblock. Already-reported scans are tracked in.rtfm/seen_scans.jsonso the warning does not repeat on every turn.rtfm status— new "Index health" section. Reports pending added / modified / removed files relative to the configured sources (best-effort dry-run) and known scan suspects. Answers the question "is my index up to date?" in one command.- Tests: 9 new in
rtfm/tests/test_sync_health.pycoveringSyncResultshape, sync-time classification, the CLI warning helper, and the MCPACTION REQUIREDblock. Full suite: 448 passed, 17 skipped.
[0.8.0] — 2026-05-16¶
Added¶
- 7 new document parsers — ebook and office formats. RTFM now indexes EPUB, MOBI/AZW/AZW3, FB2, DJVU, DOCX, ODT, and RTF in addition to the existing 15 formats.
epub(extra[epub]:ebooklib,beautifulsoup4) — walks the spine in reading order, one chunk group per chapter, OPF title/author lifted into metadata.mobi_parser(extra[mobi]:mobi,beautifulsoup4) — Kindle MOBI/AZW/AZW3, DRM-free only; DRM-protected files surface a cleanMOBIExtractionError.fb2— FictionBook XML, zero external dependency (stdlibxml.etree). Sections become chapters,<title-info>becomes title/author.djvu— DJVU via thedjvutxtsystem binary fromdjvulibre-bin(no Python dep), one chunk group per page.docx(extra[office]:python-docx,odfpy,striprtf) — paragraphs walked in document order, Heading 1/2/3 styles cut sections, tables flattened tocell | cell.core_properties.title/authorlifted into metadata.odt(extra[office]) — same shape asdocx, sections cut bytext:hwithtext:outline-level. Metadata viadc:title/dc:creator.rtf(extra[office]) — text-only extraction viastriprtf; RTF has no native hierarchy so chunking is paragraph-based.- Shared chunking helpers in
rtfm/parsers/_chunking.py(split_into_paragraphs,merge_short_paragraphs,split_on_sentence,slugify,content_hash,estimate_page). New parsers reuse these; the oldermarkdown.pyandpdf.pykeep their own copies for now (no behaviour change). - New tests:
rtfm/tests/test_ebook_parsers.pyandrtfm/tests/test_office_parsers.py— fixtures synthesise minimal files in-process; testsimportorskipcleanly when an optional dep is absent.
[0.7.2] — 2026-05-06¶
Fixed¶
- MCP server connection:
bin/rtfm-servenow executable. The shell launchers (rtfm-serve,rtfm-hook,rtfm-install-extras) were checked into git with mode100644(no exec bit) because they were authored on a WSL/NTFS filesystem that does not preserve the POSIX exec bit. Claude Code clones plugins respecting the git index modes, so on Linux/macOS the MCP server failed to start with no helpful error in the/pluginUI ("rtfm MCP · failed"). Index permissions are now100755for the three shell launchers;.cmdsiblings keep100644(Windows ignores the exec bit). To receive the fix:/plugin marketplace update roomi-fieldsthen/reload-plugins.
[0.7.1] — 2026-05-06¶
Changed¶
- Distribution: marketplace consolidated. The standalone
roomi-fields/rtfmmarketplace is retired; RTFM now ships exclusively through the aggregator marketplaceroomi-fields/claude-plugins. Install command changes:/plugin marketplace add roomi-fields/claude-pluginsthen/plugin install rtfm@roomi-fields. The plugin itself is unchanged — samebin/rtfm-serve, same hooks, same skills. Existing users of the standalone marketplace should run/plugin marketplace remove rtfmand re-install via the aggregator.
No code changes — the wheel is byte-identical to 0.7.0. This release exists to carry the version bump in .claude-plugin/plugin.json and signal the marketplace migration to PyPI users via the release feed.
[0.7.0] — 2026-05-04¶
Added¶
- Generic JSON schema mappings — declaratively map any JSON schema to chunks and edges via YAML files in
.rtfm/mappings/, no Python required. Drop a mapping file (matched by$schemaURL or by a discriminator liketype: foo) and matching JSON files are extracted into typed chunks at sync time. The system replaces what would otherwise be N format-specific parsers (NotebookLM exports, Linear/Jira dumps, OpenAPI specs, structured logs…) with one extensibility point that lives outside RTFM. Mini-templating engine ({{ dotted.path }}only — no eval, no Jinja). 35 new tests, zero new dependencies. See docs/json-mappings.md. - NotebookLM integration recipe — docs/notebooklm-integration.md covers both the zero-friction markdown path and the typed JSON path, with a ready-to-copy
nblm-answer.yamlmapping fornotebooklm-mcpbatch outputs.
Changed¶
JSONParserconsultsMappingRegistry.find_mapping(data)before falling back to the generic structural parser. Plain JSON files are unaffected.Library.__init__autoloads mappings from<db_dir>/mappings/*.{yaml,yml,json}.
[0.6.0] — 2026-05-04¶
Added¶
- SQLite parser (
.sqlite,.sqlite3,.db) — read-only URI connection. Emits an overview chunk (tables, views, indexes, triggers + row counts), then per-table schema + sample chunks. Foreign keys extracted asEdgeCandidate(relation_type="fk"). FTS5 shadow tables filtered..dbextension validated by SQLite magic bytes to avoid false positives. - Jupyter parser (
.ipynb) — groups cells by markdown heading, code cells fenced as ```python, outputs dropped (often huge / low-signal). Zero deps. - TOML parser (
.toml) — one chunk per top-level table; emitsdepends_onedges forpyproject.toml(PEP 621, Poetry, build-system) andCargo.toml. Uses stdlibtomllib(3.11+) withtomlifallback; gracefully unregistered if neither importable. - CSV/TSV parser (
.csv,.tsv) — dialect sniffing (delimiter), overview chunk with column types via lightweight inference (int/float/bool/text), sample chunk (first N rows aligned). Streams rows so big files don't blow memory. - XLSX parser (
.xlsx) — per-workbook overview + per-sheet schema + per-sheet sample. Optional dependency:pip install rtfm-ai[xlsx](openpyxl). Usesread_only=Truefor huge workbooks.
Changed¶
- Parser count: 10 → 15.
pyproject.toml: new optional extras[xlsx](openpyxl).
[0.5.0] — 2026-04-16¶
Added — native Claude Code plugin¶
/plugin marketplace add roomi-fields/rtfm+/plugin install rtfm@rtfm— zero pip required on user side.- Pure-Python MCP server (
rtfm/_mcp/, ~300 LOC) — drops the upstreammcpSDK, nopydantic, nocryptography, no native binaries. JSON-RPC 2.0 over stdio, schemas inferred from type hints + docstrings. - Cross-platform launchers (
bin/) — POSIXsh+ Windows.cmd, auto-resolvepython3/python/py, dodge the Microsoft Storepython3stub. - Plugin hooks —
SessionStartbootstraps the project,UserPromptSubmitthrottled sync (30s),Stopfinal sync. - Skills —
/rtfm:search,/rtfm:expand,/rtfm:install-embeddings(FastEmbed ONNX ~85 MB),/rtfm:install-pdf(~50 MB),/rtfm:install-pdf-full(CPU-only torch + marker-pdf, ~1.5 GB, isolated venv in$CLAUDE_PLUGIN_DATA, no PEP 668 conflicts).
Fixed¶
- Short files no longer silently skipped — single-header markdown, title-only LaTeX sections, Python modules without classes, short legal articles. Affects
markdown,pdf,python,latex,xml_legifrance,html_bofip. - Memory history preserved on file deletion —
sync(retain_history=None)no longer cascades deletes throughbooks.id → file_versions.book_id. Restores the "unlimited version history" promise of the memory hook. Default (retain_history=50) unchanged.
Changed¶
- Dropped
mcp>=1.0.0dependency. Onlypyyamlremains. - README: plugin install promoted to primary path;
pip install rtfm-aikept as fallback for Cursor, Codex, Claude Desktop chat, other MCP clients.
[0.4.0] — 2026-04-09¶
Added — Obsidian Vault Integration¶
rtfm vaultcommand — detects Obsidian vaults (.obsidian/), auto-proposes corpus mappings from folder structure, generates_rtfm/navigation files (Obsidian-native: wikilinks, YAML frontmatter Dataview-queryable, callouts, Mermaid).- Wikilink resolution —
[[wikilinks]]resolved to actual files following Obsidian rules (basename match case-insensitive, path-suffix[[folder/Note]], disambiguation by path distance). Resolved links become graph edges → powers hub detection + centrality ranking. _rtfm/auto-generated navigation —index.md(corpus list, top connected docs),graph.md(hubs, orphans, broken links, Mermaid),recent.md(auto-updates on sync),corpus/*.md(per-corpus indexes).- Karpathy 3-layer repo restructure —
raw/(source),docs/(compiled wiki),CLAUDE.md(schema). - Docs: Obsidian Vault Guide, Architecture, Parsers Guide, Positioning.
Stats¶
- 357 tests pass, 0 regressions; 32 new tests (wikilink + vault integration); 7,100+ LOC added.
[0.3.1] — 2026-03-01¶
Changed¶
rtfm_expandreads raw file lines — Content is now read from disk betweenline_startandline_end, guaranteeing line numbers matchRead/Editexactly.- Strict path resolution —
rtfm_expanduses exact path matching instead of fuzzy slug lookup. No more ambiguous results from duplicate files. - CLAUDE.md template mentions
rtfm_expand— Guides agents to usertfm_searchthenrtfm_expandinstead of defaulting toRead. - Batch corpus resolution — Search formatting resolves corpus paths in a single query instead of per-result SQL.
Fixed¶
- Markdown/LaTeX parser
line_startoff-by-one — Content line numbers now point to first content line after the header. - Double search removed in expand query mode — Was falling back to unscoped search, causing irrelevant matches.
Added¶
countparameter forrtfm_expand— Read multiple consecutive chunks in one call.- End-to-end search→expand→Edit test — Proves line numbers from expand match the real file.
[0.3.0] — 2026-02-27¶
Removed¶
- biblirag dissociation — Removed all RAG/question-answering code (
ask.py,llm.py,cmd_ask,Citation,GroundingResult,Answermodels). RTFM is now a pure retrieval layer. - Legacy code — Removed
src/(biblirag legacy),config/,extract.py,query.py,requirements.txt. - Gemini dependency — No more LLM client code. RTFM indexes and retrieves; generation is the agent's job.
[0.2.3] — 2026-02-25¶
Fixed¶
- Dynamic version —
__version__now reads fromimportlib.metadatainstead of hardcoded string, stays in sync withpyproject.toml. rtfm_bookspagination — MCP tool now returns per-corpus summary + paginated listing (default 50 books/page) withlimit/offsetparams. Previously dumped all books at once (~18k tokens for large repos).
[0.2.2] — 2026-02-24¶
Fixed¶
- Auto-enable MCP in Claude Code settings —
rtfm initnow addsrtfmtoenabledMcpjsonServersin.claude/settings.jsonand.claude/settings.local.json. Previously the server was configured in.mcp.jsonbut not activated, causing it to silently disappear from/mcp. - Simplified CLAUDE.md template — Replaced verbose 30-line workflow with concise 4-line instruction (search, Read, Edit). Less prescriptive, better agent compliance.
- CLI progressive disclosure —
rtfm searchnow deduplicates results by source and shows metadata-only output with absolute file paths, matching the MCP server format. - Semantic search slug extraction — Fixed slug parsing in
library.pyfor semantic search results.
[0.2.0] — 2026-02-21¶
Added¶
- Config auto-detection —
.rtfm/directory found automatically (like.git/), no more--dbon every command - Source management —
rtfm add,rtfm sourcesto register directories for recurring sync - Multi-source sync —
rtfm sync(no args) syncs all registered sources from.rtfm/config.json rtfm serve— start MCP server directly from CLI (replacespython -m rtfm.mcp)rtfm context/rtfm expand— CLI commands for progressive disclosurertfm monitor— tail live MCP and hook activity- Progressive disclosure in MCP — search/context return metadata-only (file paths, scores, chunk counts), expand returns full content
- Absolute path resolution — search results include absolute file paths so agents can
Read()directly - End-of-content marker — expand output ends with
⏹to prevent "file seems truncated" false positives - Dual auto-sync hooks — UserPromptSubmit (every 30s) + Stop (final sync)
- Corpus-prefixed slugs — FR/EN translations get distinct slugs (e.g.
published--b4-flagsvspublished-en--b4-flags) - Language in search results —
lang: fr/lang: enshown when available from frontmatter
Changed¶
- FTS as default search —
rtfm_searchdefaults tosearch_type="fts"instead of"hybrid"(avoids 6min MiniLM cold start) - Data/instruction separation — search results contain pure data (file paths, slugs, scores), no inline instructions
- CLAUDE.md template — simplified: "RTFM first, then Read", "NEVER Glob for research"
- Hook architecture — simplified from 4 hooks to 2 (UserPromptSubmit + Stop)
Removed¶
rtfm_remembertool — replaced by scratch files + auto-sync (simpler, same result)- Inline
rtfm_expand()hints in search results — replaced byfile:/slug:pure data fields
Performance (benchmarked on real tasks)¶
- -51% cost vs no-RTFM ($11.14 vs $22.61)
- -16% duration (6m58s vs 8m16s)
- -61% tokens (3.22M vs 8.21M)
[0.1.0] — 2026-02-15¶
Added¶
- Full-text search with SQLite FTS5 (porter stemming)
- Semantic search with sentence embeddings (paraphrase-multilingual-MiniLM-L12-v2)
- Hybrid search (FTS5 + semantic)
- 10 smart parsers: Markdown, Python (AST), LaTeX, YAML, JSON, Shell, PDF, Legifrance XML, BOFiP HTML, plain text
- MCP server with tools: rtfm_search, rtfm_context, rtfm_discover, rtfm_stats, rtfm_sync, rtfm_ingest, rtfm_tags, rtfm_books, rtfm_tag_chunks, rtfm_remove
rtfm init— one-command project setup (database, .mcp.json, CLAUDE.md, auto-sync hook, .gitignore)rtfm_context— progressive disclosure for AI agents (lazy indexing, hybrid search)rtfm_discover— fast project structure scan (~1 second)- Incremental sync with file hash tracking and corpus isolation
- Auto-sync hook for Claude Code (UserPromptSubmit, throttled to 30s)
- Background embedding generation in MCP server (model cached in memory)
- Multi-corpus support for organizing documents by source
- Tag management (manual + batch tagging)
- Article versioning for legal documents (history, date lookup, diff)
- CLI with search, semantic-search, stats, status, sync, init, embed, books, corpora, tags, schema commands
- Python API (Library, SearchResults with to_dict/to_json/to_markdown/to_prompt)
- LLM-ready exports with to_prompt() (XML-structured context)
--forceflag for re-indexing all files- Extensible metadata (domain-specific fields stored as JSON)