Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions ARCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Architecture

HypGrep is a serverless grep over Parquet files in S3. A client (browser, Node, etc.) reads a small precomputed index over HTTP range requests, then reads only the source byte ranges that could plausibly match. The dominant cost being optimized is **per-query transfer bytes** — every byte the client pulls is wall-clock latency and dollar cost. File size on disk is a secondary metric.

## Index format

A HypGrep index is itself a Parquet file with two columns:

| column | type | encoding | meaning |
|---|---|---|---|
| `ngram` | STRING | DELTA_BYTE_ARRAY | a lowercase n-character substring |
| `blockId` | INT32 | DELTA_BINARY_PACKED | logical block this n-gram appears in |

Rows are sorted by `(ngram, blockId)`. The set of `blockId`s for a given n-gram is its posting list. The source file is divided into logical blocks of `block_size` consecutive rows (independent of Parquet row groups). The index records, for each block, the set of distinct n-grams present across every indexed string column.

Key-value metadata on the index file:

```
hypgrep.version # format version
hypgrep.block_size # rows per block
hypgrep.ngram_length # n-gram size (default 5)
hypgrep.text_columns # which columns were indexed
hypgrep.source_rows # total rows in source
hypgrep.source_bytelength # source file size (lets the client pre-size buffers)
```

Storing source size in the metadata lets the client construct an `AsyncBuffer` for the source without a separate HEAD request.

## Query path

For a query string `Q`:

1. **Tokenize the query into n-grams.** Lowercase; split on `/[^a-z0-9]+/`; emit every n-character window of each alphanumeric run. Whitespace-separated words are unioned (substring AND across words).
2. **Push-down filter on the index** with `{ ngram: { $in: queryNgrams } }`. Hyparquet only reads the row-group(s) covering those n-grams.
3. **Intersect.** Group returned rows by `blockId`; keep blocks that hit *every* query n-gram.
4. **For each candidate block, in order**, read the row range from the source with `parquetReadObjects({ rowStart, rowEnd })` and `useOffsetIndex: true`. Both files are wrapped in `cachedAsyncBuffer` for the duration of the query so adjacent blocks within a Parquet row group don't re-fetch its pages.
5. **Per-row substring filter.** For each row, verify `value.toLowerCase().includes(needle)` for every whitespace-separated query word.
6. **`limit` short-circuits** as soon as enough matches have been yielded — subsequent blocks are never read.

## Key tuning decisions

### Why n = 5

Shorter n-grams fail to prune prose at any reasonable block size — every block of natural-language text contains every common 3- or 4-character window, so trigram intersection returns *every* block as a candidate and we end up scanning the whole source. n = 5 has a large enough universe (~285K alnum 5-grams vs ~47K trigrams) that distinguishing windows like `rverl` (`serverless`) or `ichor` (`petrichor`) prune effectively. Per-query source transfer on a 420 MB Wikipedia dump:

| query | n=3 | n=4 | n=5 |
|---|---:|---:|---:|
| eigenvalue (67 hits) | 839 MB | 205 MB | 189 MB |
| petrichor (0) | 839 MB | 839 MB | 316 MB |
| serverless (0) | 839 MB | 839 MB | 29 MB |

Tradeoff: index size grows with n. n=3 → 1.2 MB, n=5 → 15 MB on Wikipedia. The goal weighs query bytes far more than index size, so the larger index pays for itself many times over after the first few queries.

### Why blockSize = 100

Smaller blocks improve pruning — when a candidate block is selected, fewer "wasted" non-matching rows ride along. Going from 500 to 100 cuts source transfer roughly in half on absent/rare-string queries:

| query | b=500 | b=100 |
|---|---:|---:|
| petrichor | 301 MB | 157 MB |
| serverless | 28 MB | 0 MB |

Tradeoff: smaller blocks mean more `(ngram, blockId)` postings. Index grows 15 → 25 MB. Again favorable under the goal.

### Per-block reads + `cachedAsyncBuffer`

Both source and index files are wrapped with `cachedAsyncBuffer` for the duration of a query. Parquet readers commonly re-fetch the same byte range across calls (page footer + page data, or overlapping row-group pages when adjacent blocks fall in the same row group). The cache memoizes slices by `(start, end)`, so duplicate fetches are free.

An earlier version coalesced contiguous candidate blocks into single `parquetReadObjects` calls to avoid the re-fetch. That worked for transfer bytes but defeated `limit`: a `limit: 10` query against a string that matched every block still pulled the entire coalesced run. Switching back to per-block reads (relying on the cache to dedupe pages) costs identical bytes on full scans while letting `limit` short-circuit on the next block boundary. Measured impact:

| query | without cache | with cache |
|---|---:|---:|
| eigenvalue | 195 MB | 150 MB |
| petrichor | 106 MB | 59 MB |
| quantum entanglement | 159 MB | 107 MB |

## End-to-end bytes on Wikipedia (420 MB source, 24 MB index)

No limit (every match):

| query | matches | total | index | source |
|---|---:|---:|---:|---:|
| eigenvalue | 67 | 150.4 MB | 724 KB | 150 MB |
| petrichor | 0 | 59.6 MB | 646 KB | 59 MB |
| serverless | 0 | 0.7 MB | 689 KB | 0 MB |
| quantum entanglement | 24 | 107.7 MB | 870 KB | 107 MB |
| wikipedia | 156289 | 401.3 MB | 708 KB | 401 MB |

`limit: 10` (typical client usage):

| query | matches | total | index | source |
|---|---:|---:|---:|---:|
| eigenvalue | 10 | 42.5 MB | 724 KB | 42 MB |
| petrichor | 0 | 59.6 MB | 646 KB | 59 MB |
| serverless | 0 | 0.7 MB | 689 KB | 0 MB |
| quantum entanglement | 10 | 40.2 MB | 870 KB | 39 MB |
| wikipedia | 10 | 14.1 MB | 708 KB | 13 MB |

**Per-query index transfer is bounded at ~700 KB** regardless of selectivity — that's the property that makes the design serverless-friendly. The source transfer scales with how many blocks legitimately match, and `limit` lets selective UIs (first 10 hits) clip it further.

## API surface

- `createIndex({ sourceFile, indexFile, blockSize?, ngramLength?, ... })` — build the index Parquet.
- `queryIndex({ query, indexFile, indexMetadata? })` — return candidate blocks. Streaming-friendly: just metadata.
- `parquetFind({ query, url, limit?, ... })` — async generator of matching rows in natural order (Ctrl+F semantics).
- `parquetSearch({ query, url, limit?, ... })` — async generator of matching rows ranked by occurrence count.

Both reader functions accept either a `url` (with optional `asyncBufferFactory`) or pre-loaded `sourceFile` / `indexFile` AsyncBuffers.

## Known limitations and where to push next

1. **Queries shorter than `ngramLength` (default 5) chars** return no results. Acceptable: short n-grams aren't selective in prose anyway. Workaround if needed: store multi-length n-grams (e.g. 3 *and* 5) and pick at query time.
2. **Common-everywhere words** (e.g. `wikipedia` in a Wikipedia dump) match every block; the source transfer is unavoidable because every row really does match. No n-gram strategy fixes this.
3. **Per-row precision is missing.** A candidate block is scanned in full even if only one row matches. Storing per-`(ngram, block)` row bitmaps would let queries narrow the source read to specific rows, at the cost of a ~3× larger index. Probably the highest-value next step for sparse queries that match a small number of rows spread across many blocks.
4. **No-limit queries on dense matches do more work than they need to.** When every block matches (e.g. `wikipedia` against a Wikipedia dump), exhausting all matches takes ~24 s of CPU because we process blocks one at a time. The bytes are minimal; the time is CPU/parsing overhead. Real clients should pass a `limit`.
5. **Automatic regex literal extraction is not implemented.** Callers can still run regex by passing a `rowFilter` predicate and choosing a `query` string that names a literal substring the regex requires (so the index can prune). What's not done: parsing a `RegExp` automatically and deriving the mandatory n-gram set (Zoekt's planner). The index format wouldn't need to change.
6. **Index files are not back-compatible.** `hypgrep.version` exists but we don't ship a multi-version reader. Reasonable for a 0.x package.

## Dependencies

- `hyparquet` — Parquet reader with push-down filtering, offset index, and `cachedAsyncBuffer`.
- `hyparquet-writer` — Parquet writer with `DELTA_BYTE_ARRAY` / `DELTA_BINARY_PACKED` encodings that keep the index small.
- `hyparquet-compressors` — compression codecs.
28 changes: 23 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
![coverage](https://img.shields.io/badge/Coverage-95-darkred)

Build a compact full-text search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer).
Build a compact n-gram search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer). Queries are case-insensitive substring matches — grep semantics over a precomputed index.

## Why?

Enable efficient full-text search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.
Enable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.

Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.

Expand All @@ -26,7 +26,7 @@ hypgrep dataset.parquet [dataset.index.parquet]

## Find rows in a parquet file in JavaScript

Use `parquetFind` to find rows matching a query while preserving natural row order (like Ctrl+F):
Use `parquetFind` to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):

```javascript
import { parquetFind } from 'hypgrep'
Expand All @@ -39,9 +39,27 @@ for await (const row of parquetFind({
}
```

Whitespace-separated words are ANDed: `'foo bar'` matches rows containing both `foo` and `bar` as substrings. Queries shorter than the indexed n-gram length (default 5) return no results.

### Regex (via `rowFilter`)

Pass a `rowFilter` callback to override the default substring match. The `query` is still used to prune candidate blocks; the callback decides which rows to keep:

```javascript
const re = /eigen\w*value/i

for await (const row of parquetFind({
query: 'eigen',
rowFilter: row => re.test(row.text),
url: '...',
})) ...
```

Picking a `query` that names a literal substring the regex requires is important — without it the index can't prune, and you'll scan everything.

## Ranked search

Use `parquetSearch` to rank results by BM25 relevance score (like a search engine):
Use `parquetSearch` to rank results by total occurrence count of the query words:

```javascript
import { parquetSearch } from 'hypgrep'
Expand All @@ -50,7 +68,7 @@ for await (const row of parquetSearch({
query: 'serverless',
url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
console.log(row) // highest relevance first
console.log(row) // most matches first
}
```

Expand Down
132 changes: 63 additions & 69 deletions benchmark.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,101 +3,95 @@ import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { pipeline } from 'stream/promises'
import { createIndex } from './src/createIndex.js'
import { queryIndex } from './src/queryIndex.js'
import { parquetFind } from './src/parquetFind.js'

const url = 'https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.en/train-00000-of-00041.parquet'
const filename = 'example.parquet'
const indexFilename = 'example.index.parquet'

// Download test parquet file if needed
// Download source if needed
let stat = await fs.stat(filename).catch(() => undefined)
if (!stat) {
console.log('downloading ' + url)
const res = await fetch(url)
if (!res.ok) throw new Error(res.statusText)
await pipeline(res.body, createWriteStream(filename))
stat = await fs.stat(filename)
console.log('downloaded example.parquet', stat.size.toLocaleString(), 'bytes')
}

// Create search index
// Build index if needed
let indexStat = await fs.stat(indexFilename).catch(() => undefined)
if (!indexStat) {
console.log('\n=== Creating Search Index ===')
const indexStartTime = performance.now()

console.log('building ' + indexFilename)
const t0 = performance.now()
const sourceFile = await asyncBufferFromFile(filename)
const indexFile = fileWriter(indexFilename)
await createIndex({ sourceFile, indexFile })
indexStat = await fs.stat(indexFilename)

const indexMs = performance.now() - indexStartTime
console.log(`created index in ${indexMs.toFixed(0)} ms`)
console.log(`index size: ${indexStat.size.toLocaleString()} bytes (${(indexStat.size / stat.size * 100).toFixed(1)}% of source)`)
} else {
console.log('\n=== Using Existing Search Index ===')
console.log(`index size: ${indexStat.size.toLocaleString()} bytes (${(indexStat.size / stat.size * 100).toFixed(1)}% of source)`)
console.log(`built in ${((performance.now() - t0) / 1000).toFixed(1)}s`)
}

// Load the index for querying
const indexFile = await asyncBufferFromFile(indexFilename)

// Augment the AsyncBuffer with instrumentation
indexFile.sliceCount = 0
indexFile.bytesRead = 0
const originalSlice = indexFile.slice.bind(indexFile)
indexFile.slice = function (start, end) {
indexFile.sliceCount++
indexFile.bytesRead += end - start
return originalSlice(start, end)
console.log(`source: ${(stat.size / 1024 / 1024).toFixed(1)} MB`)
console.log(`index: ${(indexStat.size / 1024 / 1024).toFixed(1)} MB (${(indexStat.size / stat.size * 100).toFixed(2)}% of source)`)

/**
* @param {import('hyparquet').AsyncBuffer} buf
* @returns {import('hyparquet').AsyncBuffer & {fetches: number, bytes: number}}
*/
function instrument(buf) {
const wrapper = {
byteLength: buf.byteLength,
fetches: 0,
bytes: 0,
/**
* @param {number} start
* @param {number} [end]
* @returns {Promise<ArrayBuffer> | ArrayBuffer}
*/
slice(start, end) {
wrapper.fetches += 1
wrapper.bytes += (end ?? buf.byteLength) - start
return buf.slice(start, end)
},
}
return wrapper
}

// Test queries
const queries = [
{ name: 'Rare term', query: 'eigenvalue' },
{ name: 'Common term', query: 'wikipedia' },
{ name: 'Multi-term', query: 'united states history' },
'eigenvalue',
'petrichor',
'serverless',
'quantum entanglement',
'wikipedia',
]

console.log('\n=== Search Performance ===')

const queryTimes = []
const queryRequestCounts = []
const queryBytesRead = []

for (const { name, query } of queries) {
// Reset stats for this query
indexFile.sliceCount = 0
indexFile.bytesRead = 0

const queryStartTime = performance.now()

const results = await queryIndex({ query, indexFile })

const queryMs = performance.now() - queryStartTime
queryTimes.push(queryMs)
queryRequestCounts.push(indexFile.sliceCount)
queryBytesRead.push(indexFile.bytesRead)

console.log(`\n${name}: "${query}"`)
console.log(` Query time: ${queryMs.toFixed(2)} ms`)
console.log(` Requests: ${indexFile.sliceCount}`)
console.log(` Bytes read: ${indexFile.bytesRead.toLocaleString()}`)
console.log(` Matching blocks: ${results.blocks.length}`)

if (results.blocks.length > 0) {
console.log(' Top 3 blocks by relevance:')
for (let i = 0; i < Math.min(3, results.blocks.length); i++) {
const result = results.blocks[i]
console.log(` Block ${result.blockId}: rows ${result.rowStart}-${result.rowEnd}, score: ${result.score}`)
}
for (const limit of [Infinity, 10]) {
console.log()
console.log('limit = ' + (limit === Infinity ? 'all' : limit))
console.log('query matches ms idx_KB src_MB total')
console.log('--------------------- ------- ----- ------ ------ -----')
for (const query of queries) {
const idx = instrument(await asyncBufferFromFile(indexFilename))
const src = instrument(await asyncBufferFromFile(filename))
const t0 = performance.now()
let matches = 0
// eslint-disable-next-line no-unused-vars
for await (const _row of parquetFind({
url: filename,
sourceFile: src,
indexFile: idx,
query,
limit,
})) matches += 1
const ms = performance.now() - t0
const total = (idx.bytes + src.bytes) / 1024 / 1024
console.log(
query.padEnd(21) +
' ' + String(matches).padStart(7) +
' ' + ms.toFixed(0).padStart(5) +
' ' + (idx.bytes / 1024).toFixed(0).padStart(6) +
' ' + (src.bytes / 1024 / 1024).toFixed(0).padStart(6) +
' ' + total.toFixed(1).padStart(5) + ' MB'
)
}
}

// Summary
console.log('\n=== Summary ===')
console.log(`Source file: ${stat.size.toLocaleString()} bytes`)
console.log(`Index file: ${indexStat.size.toLocaleString()} bytes`)
console.log(`Average query time: ${(queryTimes.reduce((sum, time) => sum + time, 0) / queryTimes.length).toFixed(2)} ms`)
console.log(`Average requests per query: ${(queryRequestCounts.reduce((sum, count) => sum + count, 0) / queryRequestCounts.length).toFixed(1)}`)
console.log(`Average bytes read per query: ${(queryBytesRead.reduce((sum, bytes) => sum + bytes, 0) / queryBytesRead.length).toLocaleString()}`)
12 changes: 6 additions & 6 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "hypgrep",
"version": "0.1.1",
"version": "0.2.0",
"author": "Hyperparam",
"homepage": "https://hyperparam.app",
"license": "MIT",
Expand Down Expand Up @@ -41,16 +41,16 @@
"test": "vitest run"
},
"dependencies": {
"hyparquet": "1.25.8",
"hyparquet": "1.26.0",
"hyparquet-compressors": "1.1.1",
"hyparquet-writer": "0.15.1"
"hyparquet-writer": "0.15.2"
},
"devDependencies": {
"@types/node": "25.9.1",
"@vitest/coverage-v8": "4.1.6",
"@vitest/coverage-v8": "4.1.7",
"eslint": "9.39.4",
"eslint-plugin-jsdoc": "62.9.0",
"eslint-plugin-jsdoc": "63.0.0",
"typescript": "6.0.3",
"vitest": "4.1.6"
"vitest": "4.1.7"
}
}
13 changes: 11 additions & 2 deletions src/constants.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
// Version of the parquet index format
export const hypGrepVersion = 0

// Number of rows per virtual block
export const defaultBlockSize = 500
// Number of rows per virtual block. Smaller blocks improve query selectivity
// (less wasted source-byte transfer per candidate block) at the cost of a larger
// index file. Measured on Wikipedia: dropping from 500 to 100 cuts source bytes
// roughly in half on absent/rare-string queries while ~67% larger index file.
export const defaultBlockSize = 100

// Row group size in the index file
export const defaultIndexRowGroupSize = 40000

// Length of n-grams emitted into the index. Tuned for prose selectivity:
// shorter n-grams (3-4) fail to prune Wikipedia-scale blocks because every
// block contains every common short window; n=5 dramatically reduces the
// candidate-block set for selective substrings.
export const defaultNgramLength = 5
Loading