@@ -273,6 +273,48 @@ These are guidelines, not hard minimums. A boilerplate README genuinely has 0
273273extractable facts. But a 200-page QA manual with 5 facts means you are skimming
274274and need to go deeper.
275275
276+ ## Reading Large Documents
277+
278+ Not all documents can be processed in a single read. Long documents suffer from
279+ attention degradation — facts from the middle and end of a long file get thinner
280+ coverage than facts from the beginning.
281+
282+ ** Determine the reading strategy before you start extracting:**
283+
284+ ** Small documents (under ~ 5,000 lines or ~ 50 pages):**
285+ Read the entire file in one pass. Extract facts normally.
286+
287+ ** Large documents (over ~ 5,000 lines or ~ 50 pages):**
288+ Read and extract in sections. Do NOT read the entire file at once.
289+
290+ 1 . ** First pass — structure scan.** Read the first 200 lines with the Read tool
291+ (use ` offset: 0, limit: 200 ` ) to understand the document structure: table of
292+ contents, section headers, chapter boundaries, or natural break points.
293+
294+ 2 . ** Plan sections.** Divide the document into sections of ~ 2,000-3,000 lines
295+ (~ 30-40 pages) based on the structure you found. Use natural boundaries
296+ (chapters, sections, modules) when possible. If none exist (e.g., a flat
297+ CSV or continuous log), use fixed-size chunks with 100-line overlap.
298+
299+ 3 . ** Process each section independently.** For each section:
300+ a. Read ONLY that section using ` offset ` and ` limit ` on the Read tool.
301+ b. Extract facts from that section.
302+ c. Write facts immediately (batch write to the fact file).
303+ d. Display: "Section N/M: extracted K facts (lines X-Y)"
304+
305+ 4 . ** Track sections.** After all sections, verify total lines processed matches
306+ the document's total line count. If any range was skipped, go back and read it.
307+ Display: "Document complete: N facts from M sections (lines 1-total)"
308+
309+ ** Why this matters:** A 200-page manual read in one pass might yield 30 facts,
310+ heavily weighted toward the first 50 pages. The same manual read in 5 sections
311+ of 40 pages each will yield 60-80 facts with even coverage. The section boundary
312+ forces Claude to give full attention to every part of the document.
313+
314+ ** Code files:** Most code files are under 5,000 lines and can be read in one pass.
315+ For very large programs (e.g., 10,000+ line COBOL), split at paragraph/section
316+ boundaries (COBOL) or subroutine boundaries (RPG) rather than arbitrary line counts.
317+
276318## After Extraction
277319
278320### Critical: Write in Batches to Avoid Output Limits
0 commit comments