Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions toc/rfc/rfc-draft-usage-snapshots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Meta
[meta]: #meta
- Name: Usage Snapshots for App and Service Usage Baselines
- Start Date: 2026-03-12
- Author(s): @joyvuu-dave
- Status: Draft
- RFC Pull Request: [community#1449](https://github.com/cloudfoundry/community/pull/1449)


## Summary

This RFC proposes new V3 API endpoints for capturing point-in-time usage snapshots of all running app processes and service instances. Snapshots provide a non-destructive way for billing consumers to establish a baseline of current platform usage, tied to a checkpoint in the usage event stream. The feature is purely additive to Cloud Controller and does not modify any existing endpoints or behavior.

## Problem

When a new billing consumer wants to start processing usage events, the START/CREATE events for long-running apps and services have often already been pruned (default 31-day retention). The only current way to establish a baseline is `destructively_purge_all_and_reseed`, which truncates the entire event table and synthesizes new events for running resources. This breaks any existing consumer's event stream -- their checkpoint IDs become invalid, and there is no way to scope the reset to a single consumer. See [Issue #4182](https://github.com/cloudfoundry/cloud_controller_ng/issues/4182) for further discussion.

The result is a gap in the V3 API: there is no safe way for a new consumer to onboard without risking data loss for consumers that are already operating.

## Proposal

Cloud Controller should support two new resource types -- app usage snapshots and service usage snapshots -- each exposed under `/v3/app_usage/snapshots` and `/v3/service_usage/snapshots` respectively. Each resource type supports `POST` (async, admin-write), `GET` list, `GET` by GUID, and `GET` chunks (admin-read). Only one snapshot generation MAY be in progress at a time; concurrent requests SHOULD return `409 Conflict`.

A snapshot captures every running process (or service instance) on the platform at the moment of generation, organized into chunks of up to 50 items grouped by space. Each snapshot records a `checkpoint_event_guid` referencing the most recent usage event at the time the snapshot was created. This checkpoint is what bridges the snapshot to the event stream: a consumer reads the snapshot for its baseline, then begins polling usage events with `after_guid` set to the checkpoint, ensuring no gap or overlap between the two data sources.

An app usage snapshot response looks like this:

```json
{
"guid": "abc-123",
"created_at": "2026-01-14T10:00:00Z",
"completed_at": "2026-01-14T10:00:03Z",
"checkpoint_event_guid": "def-456",
"checkpoint_event_created_at": "2026-01-14T09:59:58Z",
"summary": {
"instance_count": 15234,
"app_count": 2500,
"organization_count": 42,
"space_count": 156,
"chunk_count": 200
},
"links": {
"self": { "href": "/v3/app_usage/snapshots/abc-123" },
"checkpoint_event": { "href": "/v3/app_usage_events/def-456" },
"chunks": { "href": "/v3/app_usage/snapshots/abc-123/chunks" }
}
}
```

An earlier approach based on consumer registration was [prototyped](https://github.com/joyvuu-dave/cloud_controller_ng/tree/usage_consumer) but coupled consumer lifecycle to the event cleanup job, creating circular dependencies and an unsolvable zombie consumer problem -- dead consumers block pruning indefinitely, and there's no clean fix short of per-consumer heartbeats. Snapshots avoid this entirely by separating the baseline concern from the event stream.

Snapshots work well in conjunction with the already-reviewed keep-running-records change ([PR #4646](https://github.com/cloudfoundry/cloud_controller_ng/pull/4646)), which prevents start events from being pruned while apps and services are still running. Together they eliminate the need for `destructively_purge_all_and_reseed` when onboarding new billing consumers.

Daily cleanup jobs SHOULD remove completed snapshots older than a configurable retention period (default 31 days) and any in-progress snapshots that have been stuck for more than one hour. Snapshot generation is atomic -- if interrupted, it rolls back completely so no partial snapshots can exist.

This proposal is scoped to Cloud Controller and does not modify any existing API surface.

A reference implementation is available in [PR #4858](https://github.com/cloudfoundry/cloud_controller_ng/pull/4858). Community input is welcome -- in particular, whether a `DELETE` endpoint for manual snapshot removal and operator-configurable chunk sizes would be useful.

## Possible Future Work

CF CLI commands for listing and requesting snapshots (e.g. `cf app-usage-snapshot`, `cf service-usage-snapshot`) would improve the operator experience. The initial proposal focuses on the API surface, which billing systems will consume directly.