feat: S3 object storage offloading for V3 bucket data#673
Conversation
Add failing tests in storage_s3_writing.test.ts that exercise the MemoryObjectStorage helper and confirm the S3 write path guard condition works. Thread the objectStorage option through the storage stack (MongoBucketStorage, MongoSyncBucketStorage, MongoBucketBatch, PersistedBatch) so it is available for future implementation. Model changes: make ops optional in BucketDataDocumentV3 to support storage_ref-only documents. Add StorageRef type and loadBucketDataDocument guard for empty ops. Add S3ObjectStorage config type and object_storage config field. Add @aws-sdk/client-s3 and @mongodb-js/zstd dependencies. Update existing compacting tests to use non-null assertions on ops since it is now optional.
Implement S3 offloading in PersistedBatchV3: BSON-serialize, zstd-compress and upload bucket data chunks to objectStorage. Insert metadata shells with storage_ref in MongoDB instead of inline ops. Update Phase 2b test assertions with non-null accessors now that the write path works. Add storage_s3_reading.test.ts with 3 failing tests for the S3 read path: round-trip write/read, missing S3 object handling, and mixed inline+S3 batch reads. All 3 must fail until the read path fetches from S3.
…test Pre-fetch and decompress S3 objects for storage_ref docs during getBucketDataBatch so ops from S3-backed documents are included in bucket data responses and size tracking. Add red test for S3-aware compaction (Phase 2d): verifies that compacted_state is populated correctly, S3 objects are cleaned, MongoDB docs are replaced, and read path survives compaction. This test fails because compactSingleBucket does not yet fetch ops from S3-backed storage_ref documents.
Compaction now pre-fetches S3-backed ops before decode, uploads new S3 objects after rechunking, and cleans up old storage_refs after transaction commit. Batch size calculation accounts for storage_ref.compressed_size. S3ObjectStorage implements the ObjectStorage interface using @aws-sdk/client-s3, wired through MongoStorageProvider when config specifies object_storage.type: s3.
- Align S3 path format: write and compact both use maxOp (_id.o) suffix (minOp-maxOp-maxOp), not minOp - Scale compaction batch size by compressed_size * 3 for S3-backed docs, matching the read path multiplier - clearBucketLeading(): upload CLEAR doc and boundary survivors to S3 when objectStorage is configured, with old ref cleanup after the transaction - Fix compaction test: allow S3 path reuse when op ranges don't change after dedup
- Remove dead `compression` field from StorageRef interface and all sites
- Add comments explaining compressed_size * 3 heuristic for byte tracking
- Simplify S3 paths from ${minOp}-${maxOp}-${maxOp} to ${minOp}-${maxOp}
- Invert objectStorage guards: inline path first, S3 as else branch
- loadBucketDataDocument() now throws on undefined ops (empty arrays still ok)
- Set doc.ops = [] in S3 fetch error catch blocks for graceful skip
|
rkistner
left a comment
There was a problem hiding this comment.
This looks quite promising, and I like the structure.
Some initial high-level comments:
- Currently there are various places in the code doing the same compression/decompression and serialize/deserialize logic. Should we perhaps do this in a wrapper class for ObjectStorage? E.g. a BucketDataObjectStorage that wraps ObjectStorage and does that logic?
- NodeJS now has built-in zstd support. But I haven't checked how the APIs and performance compares with
@mongodb/zstd. Since we're already using@mongodb/zstdimplicitly, that should be fine. - We do need a threshold for inlining ops directly in mongodb storage, before we can merge & release this: S3 has too much overhead for storing say individual 100-byte operations.
| await session.endSession(); | ||
| } | ||
|
|
||
| // After commit: delete old S3 objects (best-effort) |
There was a problem hiding this comment.
Not a blocker for initial testing, but it could be problematic if we leave orphaned documents in the bucket indefinitely (either from the delete request failing, or from say a process crash/restart between the commit and the delete).
Is there some way we can ensure these are cleaned up eventually? Maybe persisting a "delete queue" in mongodb, or running a periodic cleanup job (maybe part of the compact job)?
| // Track sizes: for S3 docs multiply compressed_size by 3 as a rough | ||
| // decompressed estimate to keep chunk byte tracking bounded. Without a | ||
| // multiplier, metadata shells (~200 bytes) would let thousands of | ||
| // S3-backed docs pack into a single chunk before splitting. |
There was a problem hiding this comment.
We already have the size on the mongodb document - could we use that instead of the estimate?
| this.logger.warn(`Failed to fetch/decompress S3 object ${doc.storage_ref?.path}: ${err}`); | ||
| doc.ops = []; |
There was a problem hiding this comment.
This should be a hard error - setting doc.ops = [] may result in data inconsistencies.
Summary
Offload
BucketDataDocumentV3.ops[]arrays to object storage (S3), keeping only a metadata shell in MongoDB. The service reads S3 objects and streams ops to clients using the existing wire protocol — no protocol changes. Object storage is optional at configuration level; when not configured, ops remain inline in MongoDB as today.Changes
put/get/deletecontract, decouples storage backend from core logic@aws-sdk/client-s3(static import)object_storage:section inMongoStorageConfigwith types3flushBucketData()uploads zstd-compressed BSON ops to S3, inserts metadata shell in MongoDBgetBucketDataBatchImpl()parallel pre-fetches S3 objects, patchesdoc.opsbefore existing decode loopcompactSingleBucket()andclearBucketLeading()are S3-aware (read/write/cleanup old refs)objectStorage?threaded through the full chain, all optional fields, zero breaking changesDesign Decisions
bucket-data/<group>/<def>/<bucket>/<minOp>-<maxOp>compressed_size * 3heuristic keeps batch memory bounded for S3-backed docsManual Verification [TODO]
S3ObjectStorage is not exercised in CI. To manually validate before shipping:
minio server /tmp/minio-data --console-address :9001object_storagewithendpoint: http://localhost:9000and S3 typemc lsforcePathStyle: trueis already set whenendpointis present (required by MinIO).