Skip to content

feat: Streaming JSON Parsing API#8

Open
vnixx wants to merge 4 commits into
mainfrom
feat/streaming-parser
Open

feat: Streaming JSON Parsing API#8
vnixx wants to merge 4 commits into
mainfrom
feat/streaming-parser

Conversation

@vnixx
Copy link
Copy Markdown
Member

@vnixx vnixx commented Apr 25, 2026

Summary

Add streaming/incremental JSON parsing support to ReerJSON.

New Types

Bottom layer (JSONValue):

  • JSONStreamParser — push-based streaming parser with two modes:
    • .jsonLines: extract multiple JSON documents from a byte stream (NDJSON/SSE)
    • .jsonArray: parse elements from a large JSON array one by one, O(1) memory
  • JSONIncrementalReader — accumulate chunks for large single-document parsing

Codable layer:

  • StreamingJSONLinesDecoder<T> / StreamingJSONArrayDecoder<T> — typed streaming decoders

AsyncSequence adapters:

  • JSONValueStream / JSONValueByteStream / DecodingStream
  • AsyncSequence.jsonValues() / .decode() convenience extensions

Implementation

  • Uses yyjson's YYJSON_READ_STOP_WHEN_DONE + yyjson_doc_get_read_size() for accurate byte-level buffer management
  • Internal buffer with lazy compaction (readOffset > buffer/2 triggers memmove)
  • JSON Array mode uses a state machine to strip [, ,, ] and parse each element individually
  • Zero impact on existing parsing paths

Testing

  • 33 new tests covering JSON Lines, JSON Array, incremental reader, edge cases, and Codable layer
  • All 755 tests pass (722 existing + 33 new), zero regressions

Note on yyjson_incr_* API

yyjson 0.12.0's incremental API (yyjson_incr_new/read/free) requires all data to be pre-loaded in the buffer — len only controls how far each parse step reads. It cannot handle dynamically appended data between reads. Therefore, network streaming uses STOP_WHEN_DONE instead.

vnixx and others added 4 commits April 25, 2026 21:50
- JSONStreamParser: push-based streaming parser for JSON Lines and JSON Array modes
  - JSON Lines: extracts multiple JSON documents from a byte stream using STOP_WHEN_DONE
  - JSON Array: state machine to parse elements from a large JSON array one by one
  - Internal buffer with lazy compaction for efficient memory management

- JSONIncrementalReader: accumulates chunks for large single-document parsing

- StreamingJSONLinesDecoder / StreamingJSONArrayDecoder: Codable-layer streaming decoders

- JSONValueStream / DecodingStream: AsyncSequence adapters for async byte streams

- AsyncSequence extensions: .jsonValues() and .decode() convenience methods

- Document.streamParse: internal API using yyjson_doc_get_read_size for accurate byte counting

- 33 new tests covering JSON Lines, JSON Array, incremental, edge cases, and Codable layer
- All 755 existing tests pass with zero regressions
Reject malformed array streams consistently and avoid copying the unread buffer for each parsed value.

Co-authored-by: Cursor <cursoragent@cursor.com>
Several issues found while reviewing the streaming JSON parsing API:

* Numeric cross-chunk splitting was silently truncating values.
  yyjson with STOP_WHEN_DONE will happily parse "1" out of a buffer
  whose true content is "12345" — there is no way for the parser to
  tell from within yyjson that the number could be extended. The
  stream parser now defers any value whose parse ends exactly at the
  current buffer end and only commits it once the next chunk arrives
  (or finalize() confirms it's the last token). Strings, objects,
  arrays, and literals continue to work as before since yyjson
  detects truncation for those itself.

* JSONIncrementalReader used substring matching on the error message
  (e.g. error.message.contains("Unexpected end")) to detect
  "need more data". This was fragile and could misclassify a real
  syntactic error whose human-readable message coincidentally
  contained those words. JSONError now carries the yyjson read error
  code and the reader switches on the code directly.

* JSONIncrementalReader was declared @unchecked Sendable but was not
  thread-safe — feed/finish mutated state without synchronization.
  Added an internal LockedState so concurrent feed/finish calls are
  serialized; added a concurrent-feed test.

* StreamingJSONLinesDecoder / StreamingJSONArrayDecoder were
  serializing each parsed JSONValue back to JSON text via .data()
  and then re-parsing it through ReerJSONDecoder, i.e. parsing each
  value three times. JSONStreamParser now also exposes an internal
  byte-slice API (parseSlices / finalizeSlices) that the streaming
  decoders use directly, eliminating the round-trip.

* parseOneValue was appending YYJSON_PADDING_SIZE zero bytes to the
  buffer and then removing them on every value. yyjson_read_opts in
  non-INSITU mode allocates its own padded buffer internally, so this
  was unnecessary churn — removed.

* JSONValueByteStream was building per-chunk Data by appending one
  byte at a time inside withUnsafeMutableBytes (which couldn't even
  cross await boundaries). Rewritten to read into a [UInt8] and
  construct the Data once.

* Clarified docs: documented the cross-chunk numeric deferral rule,
  the per-feed parse cost of JSONIncrementalReader, and the
  thread-safety guarantees. Removed the redundant
  options.contains(.json5) check on top of .allowTrailingCommas (the
  former includes the latter) and kept the OR as a clarity comment.

13 new tests cover number/float boundary splitting in both modes,
finalize() flushing of values without trailing newline, structural
errors not being mistaken for needMore, and concurrent access.
All 768 tests pass (719 pre-existing + 49 streaming).

Co-authored-by: Cursor <cursoragent@cursor.com>
`JSONDocument` is `~Copyable`, and `Optional<~Copyable>` is not yet
supported on Swift 5.10 (the toolchain used by the Linux CI). The
`JSONIncrementalReader.feed(_:)` API previously returned
`JSONDocument?`, which compiled on macOS Swift 6 but failed on Linux
with:

  error: noncopyable type 'JSONDocument' cannot be used with generic
  type 'Optional<Wrapped>' yet

Replace the optional return with a dedicated non-copyable enum:

  public enum JSONIncrementalReadResult: ~Copyable {
      case ready(JSONDocument)
      case needMoreData
  }

This compiles cleanly on every supported toolchain and keeps the
call-site ergonomic (switch with a let-binding instead of optional
chaining).

Updated tests and doc examples accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant