Skip to content

Feat/low memory ingestion#61

Merged
koenvo merged 3 commits intomainfrom
feat/low-memory-ingestion
Mar 17, 2026
Merged

Feat/low memory ingestion#61
koenvo merged 3 commits intomainfrom
feat/low-memory-ingestion

Conversation

@koenvo
Copy link
Contributor

@koenvo koenvo commented Mar 17, 2026

No description provided.

koenvo added 3 commits March 12, 2026 17:46
Replace BytesIO with BufferedStream (SpooledTemporaryFile, 5MB threshold)
throughout the fetch and store pipeline to avoid loading large files fully
into memory.

- Add BufferedStream to utils: stays in memory up to 5MB, spills to disk
- Stream HTTP response body via iter_content(1MB chunks) into BufferedStream,
  hashing on the fly — no more response.content loading full body into memory
- Add http_decompress=True option to retrieve_http: stream-decompresses gzip
  content (e.g. .json.gz from S3) without double-compressing on store
- Use BufferedStream in _prepare_write_stream and _prepare_read_stream for
  compress/decompress in the store
- DraftFile.stream typed as BufferedStream with coercing validator for
  backwards compatibility (accepts BytesIO, bytes, or any readable)
Replace BytesIO with BufferedStream (SpooledTemporaryFile, 5MB threshold)
throughout the fetch and store pipeline to avoid loading large files into memory.

- Add BufferedStream to utils: stays in memory up to 5MB, spills to disk
- Stream HTTP response via iter_content(1MB chunks) into BufferedStream,
  hashing on the fly — no more response.content loading full body into memory
- Detect gzip content (magic bytes) once in retrieve_http, store as
  DraftFile.content_compression_method — no re-reading the stream later
- Gzip files are stored as-is (no recompression CPU cost); size is read
  from the gzip trailer so file.size always reflects uncompressed data size
- _prepare_write_stream uses content_compression_method to skip compression
  for already-compressed files, and returns the actual compression_method
  used so File metadata is always correct
- DraftFile.stream typed as BufferedStream with coercing validator for
  backwards compatibility (accepts BytesIO, bytes, or any readable)
- Extract detect_compression() and gzip_uncompressed_size() into utils
- Set DraftFile.content_compression_method once on fetch, used by store
- Add tests for detect_compression, gzip_uncompressed_size, and http fetch
- Format with black
@koenvo koenvo merged commit 3bc8373 into main Mar 17, 2026
12 checks passed
@koenvo koenvo deleted the feat/low-memory-ingestion branch March 17, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant