From beea2bba88afcd48e3100648135edd61062a4509 Mon Sep 17 00:00:00 2001 From: David Riddle Date: Thu, 12 Mar 2026 15:44:48 -0500 Subject: [PATCH 1/2] Add RFC draft: Usage Snapshots for App and Service Usage Baselines --- toc/rfc/rfc-draft-usage-snapshots.md | 62 ++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 toc/rfc/rfc-draft-usage-snapshots.md diff --git a/toc/rfc/rfc-draft-usage-snapshots.md b/toc/rfc/rfc-draft-usage-snapshots.md new file mode 100644 index 000000000..d3af53e71 --- /dev/null +++ b/toc/rfc/rfc-draft-usage-snapshots.md @@ -0,0 +1,62 @@ +# Meta +[meta]: #meta +- Name: Usage Snapshots for App and Service Usage Baselines +- Start Date: 2026-03-12 +- Author(s): @joyvuu-dave +- Status: Draft +- RFC Pull Request: [community#XXXX](https://github.com/cloudfoundry/community/pull/XXXX) + + +## Summary + +This RFC proposes new V3 API endpoints for capturing point-in-time usage snapshots of all running app processes and service instances. Snapshots provide a non-destructive way for billing consumers to establish a baseline of current platform usage, tied to a checkpoint in the usage event stream. The feature is purely additive to Cloud Controller and does not modify any existing endpoints or behavior. + +## Problem + +When a new billing consumer wants to start processing usage events, the START/CREATE events for long-running apps and services have often already been pruned (default 31-day retention). The only current way to establish a baseline is `destructively_purge_all_and_reseed`, which truncates the entire event table and synthesizes new events for running resources. This breaks any existing consumer's event stream -- their checkpoint IDs become invalid, and there is no way to scope the reset to a single consumer. See [Issue #4182](https://github.com/cloudfoundry/cloud_controller_ng/issues/4182) for further discussion. + +The result is a gap in the V3 API: there is no safe way for a new consumer to onboard without risking data loss for consumers that are already operating. + +## Proposal + +Cloud Controller should support two new resource types -- app usage snapshots and service usage snapshots -- each exposed under `/v3/app_usage/snapshots` and `/v3/service_usage/snapshots` respectively. Each resource type supports `POST` (async, admin-write), `GET` list, `GET` by GUID, and `GET` chunks (admin-read). Only one snapshot generation MAY be in progress at a time; concurrent requests SHOULD return `409 Conflict`. + +A snapshot captures every running process (or service instance) on the platform at the moment of generation, organized into chunks of up to 50 items grouped by space. Each snapshot records a `checkpoint_event_guid` referencing the most recent usage event at the time the snapshot was created. This checkpoint is what bridges the snapshot to the event stream: a consumer reads the snapshot for its baseline, then begins polling usage events with `after_guid` set to the checkpoint, ensuring no gap or overlap between the two data sources. + +An app usage snapshot response looks like this: + +```json +{ + "guid": "abc-123", + "created_at": "2026-01-14T10:00:00Z", + "completed_at": "2026-01-14T10:00:03Z", + "checkpoint_event_guid": "def-456", + "checkpoint_event_created_at": "2026-01-14T09:59:58Z", + "summary": { + "instance_count": 15234, + "app_count": 2500, + "organization_count": 42, + "space_count": 156, + "chunk_count": 200 + }, + "links": { + "self": { "href": "/v3/app_usage/snapshots/abc-123" }, + "checkpoint_event": { "href": "/v3/app_usage_events/def-456" }, + "chunks": { "href": "/v3/app_usage/snapshots/abc-123/chunks" } + } +} +``` + +An earlier approach based on consumer registration was [prototyped](https://github.com/joyvuu-dave/cloud_controller_ng/tree/usage_consumer) but coupled consumer lifecycle to the event cleanup job, creating circular dependencies and an unsolvable zombie consumer problem -- dead consumers block pruning indefinitely, and there's no clean fix short of per-consumer heartbeats. Snapshots avoid this entirely by separating the baseline concern from the event stream. + +Snapshots work well in conjunction with the already-reviewed keep-running-records change ([PR #4646](https://github.com/cloudfoundry/cloud_controller_ng/pull/4646)), which prevents start events from being pruned while apps and services are still running. Together they eliminate the need for `destructively_purge_all_and_reseed` when onboarding new billing consumers. + +Daily cleanup jobs SHOULD remove completed snapshots older than a configurable retention period (default 31 days) and any in-progress snapshots that have been stuck for more than one hour. Snapshot generation is atomic -- if interrupted, it rolls back completely so no partial snapshots can exist. + +This proposal is scoped to Cloud Controller and does not modify any existing API surface. + +A reference implementation is available in [PR #4858](https://github.com/cloudfoundry/cloud_controller_ng/pull/4858). Community input is welcome -- in particular, whether a `DELETE` endpoint for manual snapshot removal and operator-configurable chunk sizes would be useful. + +## Possible Future Work + +CF CLI commands for listing and requesting snapshots (e.g. `cf app-usage-snapshot`, `cf service-usage-snapshot`) would improve the operator experience. The initial proposal focuses on the API surface, which billing systems will consume directly. From e5dbd540ec0f8501d11f098601e0e331c95517ac Mon Sep 17 00:00:00 2001 From: David Riddle Date: Thu, 12 Mar 2026 15:45:59 -0500 Subject: [PATCH 2/2] Update RFC Pull Request link to community#1449 --- toc/rfc/rfc-draft-usage-snapshots.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toc/rfc/rfc-draft-usage-snapshots.md b/toc/rfc/rfc-draft-usage-snapshots.md index d3af53e71..b55640b0b 100644 --- a/toc/rfc/rfc-draft-usage-snapshots.md +++ b/toc/rfc/rfc-draft-usage-snapshots.md @@ -4,7 +4,7 @@ - Start Date: 2026-03-12 - Author(s): @joyvuu-dave - Status: Draft -- RFC Pull Request: [community#XXXX](https://github.com/cloudfoundry/community/pull/XXXX) +- RFC Pull Request: [community#1449](https://github.com/cloudfoundry/community/pull/1449) ## Summary