feat(azure_blob sink): add append blob support via blob_type option#25627
feat(azure_blob sink): add append blob support via blob_type option#25627Danielku15 wants to merge 2 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5734e634c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.blob_type == AzureBlobType::Append && batch.max_bytes.is_none() { | ||
| batch.max_bytes = Some(APPEND_BLOB_MAX_BLOCK_BYTES); |
There was a problem hiding this comment.
Respect omitted append batch max_bytes
For append configs that set any other [batch] field but omit max_bytes (for example batch.timeout_secs), BatchConfig's per-field serde default has already populated batch.max_bytes with the bulk default of 10_000_000. This condition is therefore false, and the later limit_max_bytes(4 MiB) rejects the config at startup even though the user did not explicitly configure max_bytes, contradicting the new append-mode default.
Useful? React with 👍 / 👎.
| AzureBlobType::Append => { | ||
| append_blob( | ||
| &blob_client.append_blob_client(), | ||
| request.blob_data, | ||
| request.content_type, | ||
| request.content_encoding, | ||
| ) | ||
| .await |
There was a problem hiding this comment.
Serialize append writes per blob
When blob_type = append, these calls still go through the existing sink driver with configurable/adaptive request concurrency, so two batches targeting the same partition can execute append_blob at the same time. Azure orders appended blocks by the order the service receives them, not the original event order, so under request.concurrency > 1 (or after adaptive concurrency ramps up) flushes to the same blob can be persisted out of order; append mode needs per-blob serialization or an equivalent ordering guard.
Useful? React with 👍 / 👎.
| match append_client | ||
| .append_block(RequestContent::from(data.to_vec()), data_len, None) | ||
| .instrument(info_span!("request").or_current()) | ||
| .await |
There was a problem hiding this comment.
Avoid retrying non-idempotent appends
This append mutates the blob, but the request is still wrapped in the existing retry policy, which retries timeouts and Azure 5xx/429 errors by cloning and replaying the same AzureBlobRequest. If Azure commits the block and the client then observes a timeout or transient response error, the retry path appends the same batch again; append mode needs an append-position/ETag condition or retries disabled for unsafe cases.
Useful? React with 👍 / 👎.
| content_type: &str, | ||
| content_encoding: Option<&str>, | ||
| ) -> Result<(), Box<dyn std::error::Error + Send + Sync>> { | ||
| let data_len = data.len() as u64; |
There was a problem hiding this comment.
Enforce the limit on encoded append payloads
The startup check limits the batcher's pre-encoding event size, but Azure enforces the append-block limit on the serialized/compressed body length used here. With blob_type: append and JSON logs containing many escapable characters (or gzip overhead around an incompressible near-limit batch), Vector can produce data.len() > 4 MiB and send it to append_block, causing Azure to reject the whole batch despite the config validation.
Useful? React with 👍 / 👎.
| /// - `append`: each batch appends to the same blob. | ||
| /// `blob_append_uuid` defaults to `false`; `blob_time_format` defaults to `%Y-%m-%d`. | ||
| /// Multiple batches within the same time window write to the same blob. |
There was a problem hiding this comment.
Document SAS permissions for appends
For SAS-authenticated configs, the existing connection-string guidance only tells users to grant Read/Create permissions, which can pass healthcheck and blob creation but is insufficient for Append Block. This new append option should document that SAS tokens also need Add or Write permission; otherwise users following the component docs get 403s on every flush in append mode.
Useful? React with 👍 / 👎.
| let request_metadata = request_metadata_builder.build(&payload); | ||
| let request = request_options.build_request(metadata, request_metadata, payload); | ||
|
|
||
| let expected_date = Utc::now().format("%Y-%m-%d").to_string(); |
There was a problem hiding this comment.
Capture one timestamp in date-sensitive tests
This assertion calls Utc::now() after build_request, which already formatted its key using an earlier Utc::now(). If the test runs across a UTC date boundary between those calls, the generated key legitimately contains the previous day while the expected value uses the new day, creating a rare but avoidable flaky failure; capture the time window once or inject a clock into the request builder test.
Useful? React with 👍 / 👎.
Summary
Note
This PR overlaps with with #25545
I kept this PR as draft to update it with the tag support once the other PR is merged. append blobs will also need tagging/metadata support. Feel free to review this PR already.
Adds
blob_type: appendsupport to theazure_blobsink, implementing #19397.The default behavior (
blob_type: block) is unchanged. Whenblob_type: appendis set, each flush appends to a stable-named Azure Append Blob rather than creating a new uniquely-named blob per batch. This is the natural model for continuous log streaming where you want a single growing file per time window.Key design decisions:
blob_time_formatdefaults to%Y-%m-%d(daily rotation) andblob_append_uuiddefaults tofalsefor append type, matching the expected append use case. Both can still be overridden.append_blockfirst (hot path = 1 API call for an existing blob), create the blob on 404, retry. A 409 Conflict on create is swallowed — it means a concurrent writer created the blob first.batch.max_bytesis automatically defaulted to 4 MiB whenblob_type: appendis configured and the setting is not explicitly set. Values above 4 MiB are rejected at startup with a clear error.gunzip,zstd -d) handle these correctly.Vector configuration
Minimal append blob configuration:
This produces a single blob per day (e.g.
app/2024-07-18/2024-07-18.log) and appends each batch to it. Defaults are:blob_time_format: "%Y-%m-%d",blob_append_uuid: false,batch.max_bytes: 4194304.Explicit batch size and custom rotation:
How did you test this PR?
cargo test --no-default-features --features sinks-azure_blob sinks::azure_blob): 27 tests pass, including new tests for:blob_time_formatblob_typeisblockblob_type = "append"blob_type: appendwith no explicitbatch.max_bytessucceeds at startup (C1 regression test)blob_type: appendwithbatch.max_bytes > 4 MiBfails at startup with amax_bytes exceedserrorvalidate().limit_max_bytes()rejection for oversized valuescargo vdev int test azure): tested against Azurite (local Azure emulator) covering:Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.