code-chunk

AST-aware code chunking for semantic search and RAG pipelines.

Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.

Features
How It Works
Installation
Quickstart
Edge Runtimes (WASM)
API Reference
License

Features

AST-aware: Splits at semantic boundaries, never mid-function
Rich context: Scope chain, imports, siblings, entity signatures
Contextualized text: Pre-formatted for embedding models
Multi-language: TypeScript, JavaScript, Python, Rust, Go, Java
Streaming: Process large files incrementally
Effect support: First-class Effect integration
Edge-ready: Works in Cloudflare Workers and other edge runtimes via WASM

How It Works

Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. code-chunk takes a different approach:

1. Parse

Source code is parsed into an Abstract Syntax Tree (AST) using tree-sitter. This gives us a structured representation of the code that understands language grammar.

2. Extract

We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:

Name and type
Full signature (e.g., async getUser(id: string): Promise<User>)
Docstring/comments if present
Byte and line ranges

3. Build Scope Tree

Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like UserService > getUser.

4. Chunk

Code is split at semantic boundaries while respecting the maxChunkSize limit. The chunker:

Prefers to keep complete entities together
Splits oversized entities at logical points (statement boundaries)
Never cuts mid-expression or mid-statement
Merges small adjacent chunks to reduce fragmentation

5. Enrich with Context

Each chunk is enriched with contextual metadata:

Scope chain: Where this code lives (e.g., inside which class/function)
Entities: What's defined in this chunk
Siblings: What comes before/after (for continuity)
Imports: What dependencies are used

This context is formatted into contextualizedText, optimized for embedding models to understand semantic relationships.

Installation

bun add code-chunk
# or
npm install code-chunk

Quickstart

Basic Usage

import { chunk } from 'code-chunk'

const chunks = await chunk('src/user.ts', sourceCode)

for (const c of chunks) {
  console.log(c.text)
  console.log(c.context.scope)    // [{ name: 'UserService', type: 'class' }]
  console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
}

Using Contextualized Text for Embeddings

Use contextualizedText for better embedding quality in RAG systems:

for (const c of chunks) {
  const embedding = await embed(c.contextualizedText)
  await vectorDB.upsert({
    id: `${filepath}:${c.index}`,
    embedding,
    metadata: { filepath, lines: c.lineRange }
  })
}

The contextualizedText prepends semantic context to the raw code:

# src/services/user.ts
# Scope: UserService
# Defines: async getUser(id: string): Promise<User>
# Uses: Database
# After: constructor

  async getUser(id: string): Promise<User> {
    return this.db.query('SELECT * FROM users WHERE id = ?', [id])
  }

Streaming Large Files

Process chunks incrementally without loading everything into memory:

import { chunkStream } from 'code-chunk'

for await (const c of chunkStream('src/large.ts', code)) {
  await process(c)
}

Reusable Chunker

Create a chunker instance when processing multiple files with the same config:

import { createChunker } from 'code-chunk'

const chunker = createChunker({
  maxChunkSize: 2048,
  contextMode: 'full',
  siblingDetail: 'signatures',
})

for (const file of files) {
  const chunks = await chunker.chunk(file.path, file.content)
}

Effect Integration

For Effect-based pipelines:

import { chunkStreamEffect } from 'code-chunk'
import { Effect, Stream } from 'effect'

const program = Stream.runForEach(
  chunkStreamEffect('src/utils.ts', code),
  (chunk) => Effect.log(chunk.text)
)

await Effect.runPromise(program)

Edge Runtimes (WASM)

The default entry point uses Node.js APIs to load tree-sitter WASM files from the filesystem. For edge runtimes, use the code-chunk/wasm entry point which accepts pre-loaded WASM binaries.

Cloudflare Workers

import { createChunker } from 'code-chunk/wasm'

import treeSitterWasm from 'web-tree-sitter/tree-sitter.wasm'
import typescriptWasm from 'tree-sitter-typescript/tree-sitter-typescript.wasm'
import javascriptWasm from 'tree-sitter-javascript/tree-sitter-javascript.wasm'

export default {
  async fetch(request: Request): Promise<Response> {
    const chunker = await createChunker({
      treeSitter: treeSitterWasm,
      languages: {
        typescript: typescriptWasm,
        javascript: javascriptWasm,
      },
    })

    const code = await request.text()
    const chunks = await chunker.chunk('input.ts', code)

    return Response.json(chunks)
  },
}

WasmConfig

The createChunker function from code-chunk/wasm accepts a WasmConfig object:

interface WasmConfig {
  treeSitter: WasmBinary
  languages: Partial<Record<Language, WasmBinary>>
}

type WasmBinary = Uint8Array | ArrayBuffer | Response | string

treeSitter: The web-tree-sitter runtime WASM binary
languages: Map of language names to their grammar WASM binaries

Only include the languages you need to minimize bundle size.

WASM Errors

The WASM entry point throws specific errors:

WasmParserError: Parser initialization or parsing failed
WasmGrammarError: No WASM binary provided for requested language
WasmChunkingError: Chunking process failed
UnsupportedLanguageError: File extension not recognized

import { 
  WasmParserError, 
  WasmGrammarError,
  WasmChunkingError,
  UnsupportedLanguageError 
} from 'code-chunk/wasm'

try {
  const chunks = await chunker.chunk('input.ts', code)
} catch (error) {
  if (error instanceof WasmGrammarError) {
    console.error(`Language not loaded: ${error.language}`)
  }
}

API Reference

`chunk(filepath, code, options?)`

Chunk source code into semantic pieces with context.

Parameters:

filepath: File path (used for language detection)
code: Source code string
options: Optional configuration

Returns: Promise<Chunk[]>

Throws: ChunkingError, UnsupportedLanguageError

`chunkStream(filepath, code, options?)`

Stream chunks as they're generated. Useful for large files.

Returns: AsyncGenerator<Chunk>

Note: chunk.totalChunks is -1 in streaming mode (unknown upfront).

`chunkStreamEffect(filepath, code, options?)`

Effect-native streaming API for composable pipelines.

Returns: Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>

`createChunker(options?)`

Create a reusable chunker instance with default options.

Returns: Chunker with chunk() and stream() methods

`createChunker(config, options?)` (WASM)

Create a chunker for edge runtimes with pre-loaded WASM binaries.

import { createChunker } from 'code-chunk/wasm'

Parameters:

config: WasmConfig with treeSitter and languages WASM binaries
options: Optional ChunkOptions

Returns: Promise<Chunker>

Throws: WasmParserError, WasmGrammarError, WasmChunkingError, UnsupportedLanguageError

`WasmParser`

Low-level parser class for edge runtimes. Use this when you need direct access to parsing without chunking.

import { WasmParser } from 'code-chunk/wasm'

const parser = new WasmParser(config)
await parser.init()

const result = await parser.parse(code, 'typescript')
console.log(result.tree.rootNode)

`formatChunkWithContext(text, context, overlapText?)`

Format chunk text with semantic context prepended. Useful for custom embedding pipelines.

Returns: string

`detectLanguage(filepath)`

Detect programming language from file extension.

Returns: Language | null

Options

Option	Type	Default	Description
`maxChunkSize`	`number`	`1500`	Maximum chunk size in bytes
`contextMode`	`'none' \| 'minimal' \| 'full'`	`'full'`	How much context to include
`siblingDetail`	`'none' \| 'names' \| 'signatures'`	`'signatures'`	Level of sibling detail
`filterImports`	`boolean`	`false`	Filter out import statements
`language`	`Language`	auto	Override language detection
`overlapLines`	`number`	`10`	Lines from previous chunk to include in `contextualizedText`

Supported Languages

Language	Extensions
TypeScript	`.ts`, `.tsx`, `.mts`, `.cts`
JavaScript	`.js`, `.jsx`, `.mjs`, `.cjs`
Python	`.py`, `.pyi`
Rust	`.rs`
Go	`.go`
Java	`.java`

Errors

ChunkingError: Thrown when chunking fails (parsing error, extraction error, etc.)

UnsupportedLanguageError: Thrown when the file extension is not supported

Both errors have a _tag property for Effect-style error handling.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code-chunk

Table of Contents

Features

How It Works

1. Parse

2. Extract

3. Build Scope Tree

4. Chunk

5. Enrich with Context

Installation

Quickstart

Basic Usage

Using Contextualized Text for Embeddings

Streaming Large Files

Reusable Chunker

Effect Integration

Edge Runtimes (WASM)

Cloudflare Workers

WasmConfig

WASM Errors

API Reference

`chunk(filepath, code, options?)`

`chunkStream(filepath, code, options?)`

`chunkStreamEffect(filepath, code, options?)`

`createChunker(options?)`

`createChunker(config, options?)` (WASM)

`WasmParser`

`formatChunkWithContext(text, context, overlapText?)`

`detectLanguage(filepath)`

Options

Supported Languages

Errors

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

code-chunk

Table of Contents

Features

How It Works

1. Parse

2. Extract

3. Build Scope Tree

4. Chunk

5. Enrich with Context

Installation

Quickstart

Basic Usage

Using Contextualized Text for Embeddings

Streaming Large Files

Reusable Chunker

Effect Integration

Edge Runtimes (WASM)

Cloudflare Workers

WasmConfig

WASM Errors

API Reference

chunk(filepath, code, options?)

chunkStream(filepath, code, options?)

chunkStreamEffect(filepath, code, options?)

createChunker(options?)

createChunker(config, options?) (WASM)

WasmParser

formatChunkWithContext(text, context, overlapText?)

detectLanguage(filepath)

Options

Supported Languages

Errors

License

`chunk(filepath, code, options?)`

`chunkStream(filepath, code, options?)`

`chunkStreamEffect(filepath, code, options?)`

`createChunker(options?)`

`createChunker(config, options?)` (WASM)

`WasmParser`

`formatChunkWithContext(text, context, overlapText?)`

`detectLanguage(filepath)`