hyparam · platypii · May 25, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/ARCH.md b/ARCH.md
@@ -0,0 +1,123 @@
+# Architecture
+
+HypGrep is a serverless grep over Parquet files in S3. A client (browser, Node, etc.) reads a small precomputed index over HTTP range requests, then reads only the source byte ranges that could plausibly match. The dominant cost being optimized is **per-query transfer bytes** — every byte the client pulls is wall-clock latency and dollar cost. File size on disk is a secondary metric.
+
+## Index format
+
+A HypGrep index is itself a Parquet file with two columns:
+
+| column | type | encoding | meaning |
+|---|---|---|---|
+| `ngram` | STRING | DELTA_BYTE_ARRAY | a lowercase n-character substring |
+| `blockId` | INT32 | DELTA_BINARY_PACKED | logical block this n-gram appears in |
+
+Rows are sorted by `(ngram, blockId)`. The set of `blockId`s for a given n-gram is its posting list. The source file is divided into logical blocks of `block_size` consecutive rows (independent of Parquet row groups). The index records, for each block, the set of distinct n-grams present across every indexed string column.
+
+Key-value metadata on the index file:
+
+```
+hypgrep.version          # format version
+hypgrep.block_size       # rows per block
+hypgrep.ngram_length     # n-gram size (default 5)
+hypgrep.text_columns     # which columns were indexed
+hypgrep.source_rows      # total rows in source
+hypgrep.source_bytelength # source file size (lets the client pre-size buffers)
+```
+
+Storing source size in the metadata lets the client construct an `AsyncBuffer` for the source without a separate HEAD request.
+
+## Query path
+
+For a query string `Q`:
+
+1. **Tokenize the query into n-grams.** Lowercase; split on `/[^a-z0-9]+/`; emit every n-character window of each alphanumeric run. Whitespace-separated words are unioned (substring AND across words).
+2. **Push-down filter on the index** with `{ ngram: { $in: queryNgrams } }`. Hyparquet only reads the row-group(s) covering those n-grams.
+3. **Intersect.** Group returned rows by `blockId`; keep blocks that hit *every* query n-gram.
+4. **For each candidate block, in order**, read the row range from the source with `parquetReadObjects({ rowStart, rowEnd })` and `useOffsetIndex: true`. Both files are wrapped in `cachedAsyncBuffer` for the duration of the query so adjacent blocks within a Parquet row group don't re-fetch its pages.
+5. **Per-row substring filter.** For each row, verify `value.toLowerCase().includes(needle)` for every whitespace-separated query word.
+6. **`limit` short-circuits** as soon as enough matches have been yielded — subsequent blocks are never read.
+
+## Key tuning decisions
+
+### Why n = 5
+
+Shorter n-grams fail to prune prose at any reasonable block size — every block of natural-language text contains every common 3- or 4-character window, so trigram intersection returns *every* block as a candidate and we end up scanning the whole source. n = 5 has a large enough universe (~285K alnum 5-grams vs ~47K trigrams) that distinguishing windows like `rverl` (`serverless`) or `ichor` (`petrichor`) prune effectively. Per-query source transfer on a 420 MB Wikipedia dump:
+
+| query | n=3 | n=4 | n=5 |
+|---|---:|---:|---:|
+| eigenvalue (67 hits) | 839 MB | 205 MB | 189 MB |
+| petrichor (0) | 839 MB | 839 MB | 316 MB |
+| serverless (0) | 839 MB | 839 MB | 29 MB |
+
+Tradeoff: index size grows with n. n=3 → 1.2 MB, n=5 → 15 MB on Wikipedia. The goal weighs query bytes far more than index size, so the larger index pays for itself many times over after the first few queries.
+
+### Why blockSize = 100
+
+Smaller blocks improve pruning — when a candidate block is selected, fewer "wasted" non-matching rows ride along. Going from 500 to 100 cuts source transfer roughly in half on absent/rare-string queries:
+
+| query | b=500 | b=100 |
+|---|---:|---:|
+| petrichor | 301 MB | 157 MB |
+| serverless | 28 MB | 0 MB |
+
+Tradeoff: smaller blocks mean more `(ngram, blockId)` postings. Index grows 15 → 25 MB. Again favorable under the goal.
+
+### Per-block reads + `cachedAsyncBuffer`
+
+Both source and index files are wrapped with `cachedAsyncBuffer` for the duration of a query. Parquet readers commonly re-fetch the same byte range across calls (page footer + page data, or overlapping row-group pages when adjacent blocks fall in the same row group). The cache memoizes slices by `(start, end)`, so duplicate fetches are free.
+
+An earlier version coalesced contiguous candidate blocks into single `parquetReadObjects` calls to avoid the re-fetch. That worked for transfer bytes but defeated `limit`: a `limit: 10` query against a string that matched every block still pulled the entire coalesced run. Switching back to per-block reads (relying on the cache to dedupe pages) costs identical bytes on full scans while letting `limit` short-circuit on the next block boundary. Measured impact:
+
+| query | without cache | with cache |
+|---|---:|---:|
+| eigenvalue | 195 MB | 150 MB |
+| petrichor | 106 MB | 59 MB |
+| quantum entanglement | 159 MB | 107 MB |
+
+## End-to-end bytes on Wikipedia (420 MB source, 24 MB index)
+
+No limit (every match):
+
+| query | matches | total | index | source |
+|---|---:|---:|---:|---:|
+| eigenvalue | 67 | 150.4 MB | 724 KB | 150 MB |
+| petrichor | 0 | 59.6 MB | 646 KB | 59 MB |
+| serverless | 0 | 0.7 MB | 689 KB | 0 MB |
+| quantum entanglement | 24 | 107.7 MB | 870 KB | 107 MB |
+| wikipedia | 156289 | 401.3 MB | 708 KB | 401 MB |
+
+`limit: 10` (typical client usage):
+
+| query | matches | total | index | source |
+|---|---:|---:|---:|---:|
+| eigenvalue | 10 | 42.5 MB | 724 KB | 42 MB |
+| petrichor | 0 | 59.6 MB | 646 KB | 59 MB |
+| serverless | 0 | 0.7 MB | 689 KB | 0 MB |
+| quantum entanglement | 10 | 40.2 MB | 870 KB | 39 MB |
+| wikipedia | 10 | 14.1 MB | 708 KB | 13 MB |
+
+**Per-query index transfer is bounded at ~700 KB** regardless of selectivity — that's the property that makes the design serverless-friendly. The source transfer scales with how many blocks legitimately match, and `limit` lets selective UIs (first 10 hits) clip it further.
+
+## API surface
+
+- `createIndex({ sourceFile, indexFile, blockSize?, ngramLength?, ... })` — build the index Parquet.
+- `queryIndex({ query, indexFile, indexMetadata? })` — return candidate blocks. Streaming-friendly: just metadata.
+- `parquetFind({ query, url, limit?, ... })` — async generator of matching rows in natural order (Ctrl+F semantics).
+- `parquetSearch({ query, url, limit?, ... })` — async generator of matching rows ranked by occurrence count.
+
+Both reader functions accept either a `url` (with optional `asyncBufferFactory`) or pre-loaded `sourceFile` / `indexFile` AsyncBuffers.
+
+## Known limitations and where to push next
+
+1. **Queries shorter than `ngramLength` (default 5) chars** return no results. Acceptable: short n-grams aren't selective in prose anyway. Workaround if needed: store multi-length n-grams (e.g. 3 *and* 5) and pick at query time.
+2. **Common-everywhere words** (e.g. `wikipedia` in a Wikipedia dump) match every block; the source transfer is unavoidable because every row really does match. No n-gram strategy fixes this.
+3. **Per-row precision is missing.** A candidate block is scanned in full even if only one row matches. Storing per-`(ngram, block)` row bitmaps would let queries narrow the source read to specific rows, at the cost of a ~3× larger index. Probably the highest-value next step for sparse queries that match a small number of rows spread across many blocks.
+4. **No-limit queries on dense matches do more work than they need to.** When every block matches (e.g. `wikipedia` against a Wikipedia dump), exhausting all matches takes ~24 s of CPU because we process blocks one at a time. The bytes are minimal; the time is CPU/parsing overhead. Real clients should pass a `limit`.
+5. **Automatic regex literal extraction is not implemented.** Callers can still run regex by passing a `rowFilter` predicate and choosing a `query` string that names a literal substring the regex requires (so the index can prune). What's not done: parsing a `RegExp` automatically and deriving the mandatory n-gram set (Zoekt's planner). The index format wouldn't need to change.
+6. **Index files are not back-compatible.** `hypgrep.version` exists but we don't ship a multi-version reader. Reasonable for a 0.x package.
+
+## Dependencies
+
+- `hyparquet` — Parquet reader with push-down filtering, offset index, and `cachedAsyncBuffer`.
+- `hyparquet-writer` — Parquet writer with `DELTA_BYTE_ARRAY` / `DELTA_BINARY_PACKED` encodings that keep the index small.
+- `hyparquet-compressors` — compression codecs.
diff --git a/README.md b/README.md
@@ -3,11 +3,11 @@
 [![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
 ![coverage](https://img.shields.io/badge/Coverage-95-darkred)
 
-Build a compact full-text search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer).
+Build a compact n-gram search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer). Queries are case-insensitive substring matches — grep semantics over a precomputed index.
 
 ## Why?
 
-Enable efficient full-text search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.
+Enable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.
 
 Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.
 
@@ -26,7 +26,7 @@ hypgrep dataset.parquet [dataset.index.parquet]
 
 ## Find rows in a parquet file in JavaScript
 
-Use `parquetFind` to find rows matching a query while preserving natural row order (like Ctrl+F):
+Use `parquetFind` to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):
 
 ```javascript
 import { parquetFind } from 'hypgrep'
@@ -39,9 +39,27 @@ for await (const row of parquetFind({
 }
 ```
 
+Whitespace-separated words are ANDed: `'foo bar'` matches rows containing both `foo` and `bar` as substrings. Queries shorter than the indexed n-gram length (default 5) return no results.
+
+### Regex (via `rowFilter`)
+
+Pass a `rowFilter` callback to override the default substring match. The `query` is still used to prune candidate blocks; the callback decides which rows to keep:
+
+```javascript
+const re = /eigen\w*value/i
+
+for await (const row of parquetFind({
+  query: 'eigen',
+  rowFilter: row => re.test(row.text),
+  url: '...',
+})) ...
+```
+
+Picking a `query` that names a literal substring the regex requires is important — without it the index can't prune, and you'll scan everything.
+
 ## Ranked search
 
-Use `parquetSearch` to rank results by BM25 relevance score (like a search engine):
+Use `parquetSearch` to rank results by total occurrence count of the query words:
 
 ```javascript
 import { parquetSearch } from 'hypgrep'
@@ -50,7 +68,7 @@ for await (const row of parquetSearch({
   query: 'serverless',
   url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
 })) {
-  console.log(row) // highest relevance first
+  console.log(row) // most matches first
 }
 ```
 

diff --git a/benchmark.js b/benchmark.js
@@ -3,101 +3,95 @@ import { asyncBufferFromFile } from 'hyparquet'
 import { fileWriter } from 'hyparquet-writer'
 import { pipeline } from 'stream/promises'
 import { createIndex } from './src/createIndex.js'
-import { queryIndex } from './src/queryIndex.js'
+import { parquetFind } from './src/parquetFind.js'
 
 const url = 'https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.en/train-00000-of-00041.parquet'
 const filename = 'example.parquet'
 const indexFilename = 'example.index.parquet'
 
-// Download test parquet file if needed
+// Download source if needed
 let stat = await fs.stat(filename).catch(() => undefined)
 if (!stat) {
   console.log('downloading ' + url)
   const res = await fetch(url)
   if (!res.ok) throw new Error(res.statusText)
   await pipeline(res.body, createWriteStream(filename))
   stat = await fs.stat(filename)
-  console.log('downloaded example.parquet', stat.size.toLocaleString(), 'bytes')
 }
 
-// Create search index
+// Build index if needed
 let indexStat = await fs.stat(indexFilename).catch(() => undefined)
 if (!indexStat) {
-  console.log('\n=== Creating Search Index ===')
-  const indexStartTime = performance.now()
-
+  console.log('building ' + indexFilename)
+  const t0 = performance.now()
   const sourceFile = await asyncBufferFromFile(filename)
   const indexFile = fileWriter(indexFilename)
   await createIndex({ sourceFile, indexFile })
   indexStat = await fs.stat(indexFilename)
-
-  const indexMs = performance.now() - indexStartTime
-  console.log(`created index in ${indexMs.toFixed(0)} ms`)
-  console.log(`index size: ${indexStat.size.toLocaleString()} bytes (${(indexStat.size / stat.size * 100).toFixed(1)}% of source)`)
-} else {
-  console.log('\n=== Using Existing Search Index ===')
-  console.log(`index size: ${indexStat.size.toLocaleString()} bytes (${(indexStat.size / stat.size * 100).toFixed(1)}% of source)`)
+  console.log(`built in ${((performance.now() - t0) / 1000).toFixed(1)}s`)
 }
 
-// Load the index for querying
-const indexFile = await asyncBufferFromFile(indexFilename)
-
-// Augment the AsyncBuffer with instrumentation
-indexFile.sliceCount = 0
-indexFile.bytesRead = 0
-const originalSlice = indexFile.slice.bind(indexFile)
-indexFile.slice = function (start, end) {
-  indexFile.sliceCount++
-  indexFile.bytesRead += end - start
-  return originalSlice(start, end)
+console.log(`source: ${(stat.size / 1024 / 1024).toFixed(1)} MB`)
+console.log(`index:  ${(indexStat.size / 1024 / 1024).toFixed(1)} MB (${(indexStat.size / stat.size * 100).toFixed(2)}% of source)`)
+
+/**
+ * @param {import('hyparquet').AsyncBuffer} buf
+ * @returns {import('hyparquet').AsyncBuffer & {fetches: number, bytes: number}}
+ */
+function instrument(buf) {
+  const wrapper = {
+    byteLength: buf.byteLength,
+    fetches: 0,
+    bytes: 0,
+    /**
+     * @param {number} start
+     * @param {number} [end]
+     * @returns {Promise<ArrayBuffer> | ArrayBuffer}
+     */
+    slice(start, end) {
+      wrapper.fetches += 1
+      wrapper.bytes += (end ?? buf.byteLength) - start
+      return buf.slice(start, end)
+    },
+  }
+  return wrapper
 }
 
-// Test queries
 const queries = [
-  { name: 'Rare term', query: 'eigenvalue' },
-  { name: 'Common term', query: 'wikipedia' },
-  { name: 'Multi-term', query: 'united states history' },
+  'eigenvalue',
+  'petrichor',
+  'serverless',
+  'quantum entanglement',
+  'wikipedia',
 ]
 
-console.log('\n=== Search Performance ===')
-
-const queryTimes = []
-const queryRequestCounts = []
-const queryBytesRead = []
-
-for (const { name, query } of queries) {
-  // Reset stats for this query
-  indexFile.sliceCount = 0
-  indexFile.bytesRead = 0
-
-  const queryStartTime = performance.now()
-
-  const results = await queryIndex({ query, indexFile })
-
-  const queryMs = performance.now() - queryStartTime
-  queryTimes.push(queryMs)
-  queryRequestCounts.push(indexFile.sliceCount)
-  queryBytesRead.push(indexFile.bytesRead)
-
-  console.log(`\n${name}: "${query}"`)
-  console.log(`  Query time: ${queryMs.toFixed(2)} ms`)
-  console.log(`  Requests: ${indexFile.sliceCount}`)
-  console.log(`  Bytes read: ${indexFile.bytesRead.toLocaleString()}`)
-  console.log(`  Matching blocks: ${results.blocks.length}`)
-
-  if (results.blocks.length > 0) {
-    console.log('  Top 3 blocks by relevance:')
-    for (let i = 0; i < Math.min(3, results.blocks.length); i++) {
-      const result = results.blocks[i]
-      console.log(`    Block ${result.blockId}: rows ${result.rowStart}-${result.rowEnd}, score: ${result.score}`)
-    }
+for (const limit of [Infinity, 10]) {
+  console.log()
+  console.log('limit = ' + (limit === Infinity ? 'all' : limit))
+  console.log('query                 matches  ms     idx_KB  src_MB  total')
+  console.log('--------------------- -------  -----  ------  ------  -----')
+  for (const query of queries) {
+    const idx = instrument(await asyncBufferFromFile(indexFilename))
+    const src = instrument(await asyncBufferFromFile(filename))
+    const t0 = performance.now()
+    let matches = 0
+    // eslint-disable-next-line no-unused-vars
+    for await (const _row of parquetFind({
+      url: filename,
+      sourceFile: src,
+      indexFile: idx,
+      query,
+      limit,
+    })) matches += 1
+    const ms = performance.now() - t0
+    const total = (idx.bytes + src.bytes) / 1024 / 1024
+    console.log(
+      query.padEnd(21) +
+      ' ' + String(matches).padStart(7) +
+      '  ' + ms.toFixed(0).padStart(5) +
+      '  ' + (idx.bytes / 1024).toFixed(0).padStart(6) +
+      '  ' + (src.bytes / 1024 / 1024).toFixed(0).padStart(6) +
+      '  ' + total.toFixed(1).padStart(5) + ' MB'
+    )
   }
 }
-
-// Summary
-console.log('\n=== Summary ===')
-console.log(`Source file: ${stat.size.toLocaleString()} bytes`)
-console.log(`Index file: ${indexStat.size.toLocaleString()} bytes`)
-console.log(`Average query time: ${(queryTimes.reduce((sum, time) => sum + time, 0) / queryTimes.length).toFixed(2)} ms`)
-console.log(`Average requests per query: ${(queryRequestCounts.reduce((sum, count) => sum + count, 0) / queryRequestCounts.length).toFixed(1)}`)
-console.log(`Average bytes read per query: ${(queryBytesRead.reduce((sum, bytes) => sum + bytes, 0) / queryBytesRead.length).toLocaleString()}`)
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "hypgrep",
-  "version": "0.1.1",
+  "version": "0.2.0",
   "author": "Hyperparam",
   "homepage": "https://hyperparam.app",
   "license": "MIT",
@@ -41,16 +41,16 @@
     "test": "vitest run"
   },
   "dependencies": {
-    "hyparquet": "1.25.8",
+    "hyparquet": "1.26.0",
     "hyparquet-compressors": "1.1.1",
-    "hyparquet-writer": "0.15.1"
+    "hyparquet-writer": "0.15.2"
   },
   "devDependencies": {
     "@types/node": "25.9.1",
-    "@vitest/coverage-v8": "4.1.6",
+    "@vitest/coverage-v8": "4.1.7",
     "eslint": "9.39.4",
-    "eslint-plugin-jsdoc": "62.9.0",
+    "eslint-plugin-jsdoc": "63.0.0",
     "typescript": "6.0.3",
-    "vitest": "4.1.6"
+    "vitest": "4.1.7"
   }
 }
diff --git a/src/constants.js b/src/constants.js
@@ -1,8 +1,17 @@
 // Version of the parquet index format
 export const hypGrepVersion = 0
 
-// Number of rows per virtual block
-export const defaultBlockSize = 500
+// Number of rows per virtual block. Smaller blocks improve query selectivity
+// (less wasted source-byte transfer per candidate block) at the cost of a larger
+// index file. Measured on Wikipedia: dropping from 500 to 100 cuts source bytes
+// roughly in half on absent/rare-string queries while ~67% larger index file.
+export const defaultBlockSize = 100
 
 // Row group size in the index file
 export const defaultIndexRowGroupSize = 40000
+
+// Length of n-grams emitted into the index. Tuned for prose selectivity:
+// shorter n-grams (3-4) fail to prune Wikipedia-scale blocks because every
+// block contains every common short window; n=5 dramatically reduces the
+// candidate-block set for selective substrings.
+export const defaultNgramLength = 5