Skip to content

feat: S3 object storage offloading for V3 bucket data#673

Draft
Sleepful wants to merge 6 commits into
compressed-bucket-storagefrom
s3-offloading
Draft

feat: S3 object storage offloading for V3 bucket data#673
Sleepful wants to merge 6 commits into
compressed-bucket-storagefrom
s3-offloading

Conversation

@Sleepful

@Sleepful Sleepful commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Offload BucketDataDocumentV3.ops[] arrays to object storage (S3), keeping only a metadata shell in MongoDB. The service reads S3 objects and streams ops to clients using the existing wire protocol — no protocol changes. Object storage is optional at configuration level; when not configured, ops remain inline in MongoDB as today.

Changes

  • ObjectStorage interfaceput/get/delete contract, decouples storage backend from core logic
  • S3ObjectStorage — production implementation using @aws-sdk/client-s3 (static import)
  • MemoryObjectStorage — in-memory test double (no Docker/S3 needed in CI)
  • Config — optional object_storage: section in MongoStorageConfig with type s3
  • Write pathflushBucketData() uploads zstd-compressed BSON ops to S3, inserts metadata shell in MongoDB
  • Read pathgetBucketDataBatchImpl() parallel pre-fetches S3 objects, patches doc.ops before existing decode loop
  • CompactioncompactSingleBucket() and clearBucketLeading() are S3-aware (read/write/cleanup old refs)
  • InjectionobjectStorage? threaded through the full chain, all optional fields, zero breaking changes

Design Decisions

  • No inline threshold — all documents offload to S3 when configured. A general threshold is deferred to a follow-up.
  • S3 path formatbucket-data/<group>/<def>/<bucket>/<minOp>-<maxOp>
  • Zstd whole-object compression — entire BSON ops array compressed as one blob
  • Batch sizingcompressed_size * 3 heuristic keeps batch memory bounded for S3-backed docs

Manual Verification [TODO]

S3ObjectStorage is not exercised in CI. To manually validate before shipping:

  1. Start MinIO: minio server /tmp/minio-data --console-address :9001
  2. Configure object_storage with endpoint: http://localhost:9000 and S3 type
  3. Write/read/compact via client API, verify with mc ls

forcePathStyle: true is already set when endpoint is present (required by MinIO).

Sleepful added 6 commits June 9, 2026 18:49
Add failing tests in storage_s3_writing.test.ts that exercise the
MemoryObjectStorage helper and confirm the S3 write path guard condition
works. Thread the objectStorage option through the storage stack
(MongoBucketStorage, MongoSyncBucketStorage, MongoBucketBatch,
PersistedBatch) so it is available for future implementation.

Model changes: make ops optional in BucketDataDocumentV3 to support
storage_ref-only documents. Add StorageRef type and loadBucketDataDocument
guard for empty ops. Add S3ObjectStorage config type and object_storage
config field. Add @aws-sdk/client-s3 and @mongodb-js/zstd dependencies.

Update existing compacting tests to use non-null assertions on ops since
it is now optional.
Implement S3 offloading in PersistedBatchV3: BSON-serialize, zstd-compress
and upload bucket data chunks to objectStorage. Insert metadata shells
with storage_ref in MongoDB instead of inline ops. Update Phase 2b test
assertions with non-null accessors now that the write path works.

Add storage_s3_reading.test.ts with 3 failing tests for the S3 read path:
round-trip write/read, missing S3 object handling, and mixed inline+S3
batch reads. All 3 must fail until the read path fetches from S3.
…test

Pre-fetch and decompress S3 objects for storage_ref docs during
getBucketDataBatch so ops from S3-backed documents are included
in bucket data responses and size tracking.

Add red test for S3-aware compaction (Phase 2d): verifies that
compacted_state is populated correctly, S3 objects are cleaned,
MongoDB docs are replaced, and read path survives compaction.
This test fails because compactSingleBucket does not yet fetch
ops from S3-backed storage_ref documents.
Compaction now pre-fetches S3-backed ops before decode, uploads new S3
objects after rechunking, and cleans up old storage_refs after transaction
commit. Batch size calculation accounts for storage_ref.compressed_size.

S3ObjectStorage implements the ObjectStorage interface using
@aws-sdk/client-s3, wired through MongoStorageProvider when config specifies
object_storage.type: s3.
- Align S3 path format: write and compact both use maxOp (_id.o)
  suffix (minOp-maxOp-maxOp), not minOp
- Scale compaction batch size by compressed_size * 3 for S3-backed
  docs, matching the read path multiplier
- clearBucketLeading(): upload CLEAR doc and boundary survivors to
  S3 when objectStorage is configured, with old ref cleanup after
  the transaction
- Fix compaction test: allow S3 path reuse when op ranges don't
  change after dedup
- Remove dead `compression` field from StorageRef interface and all sites
- Add comments explaining compressed_size * 3 heuristic for byte tracking
- Simplify S3 paths from ${minOp}-${maxOp}-${maxOp} to ${minOp}-${maxOp}
- Invert objectStorage guards: inline path first, S3 as else branch
- loadBucketDataDocument() now throws on undefined ops (empty arrays still ok)
- Set doc.ops = [] in S3 fetch error catch blocks for graceful skip
@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: e17627c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@rkistner rkistner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite promising, and I like the structure.

Some initial high-level comments:

  1. Currently there are various places in the code doing the same compression/decompression and serialize/deserialize logic. Should we perhaps do this in a wrapper class for ObjectStorage? E.g. a BucketDataObjectStorage that wraps ObjectStorage and does that logic?
  2. NodeJS now has built-in zstd support. But I haven't checked how the APIs and performance compares with @mongodb/zstd. Since we're already using @mongodb/zstd implicitly, that should be fine.
  3. We do need a threshold for inlining ops directly in mongodb storage, before we can merge & release this: S3 has too much overhead for storing say individual 100-byte operations.

await session.endSession();
}

// After commit: delete old S3 objects (best-effort)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for initial testing, but it could be problematic if we leave orphaned documents in the bucket indefinitely (either from the delete request failing, or from say a process crash/restart between the commit and the delete).

Is there some way we can ensure these are cleaned up eventually? Maybe persisting a "delete queue" in mongodb, or running a periodic cleanup job (maybe part of the compact job)?

Comment on lines +500 to +503
// Track sizes: for S3 docs multiply compressed_size by 3 as a rough
// decompressed estimate to keep chunk byte tracking bounded. Without a
// multiplier, metadata shells (~200 bytes) would let thousands of
// S3-backed docs pack into a single chunk before splitting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the size on the mongodb document - could we use that instead of the estimate?

Comment on lines +492 to +493
this.logger.warn(`Failed to fetch/decompress S3 object ${doc.storage_ref?.path}: ${err}`);
doc.ops = [];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a hard error - setting doc.ops = [] may result in data inconsistencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants