Skip to content

ttl more keys#1644

Open
luke-lombardi wants to merge 1 commit into
mainfrom
ll/ttl-more-keys
Open

ttl more keys#1644
luke-lombardi wants to merge 1 commit into
mainfrom
ll/ttl-more-keys

Conversation

@luke-lombardi

@luke-lombardi luke-lombardi commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary by cubic

Add TTLs to more Redis keys to prevent stale state, and harden the checkpoint workflow with validations, timeouts, and better signaling. Also adds GPU options to cache benchmarks and minor metrics/instance fixes.

  • Bug Fixes

    • Set TTL on endpoint request token counters when created, using task timeout (fallback to DefaultEndpointRequestTimeoutS); refresh TTL on acquire/release; tests added.
    • Set TTL on pod container connection counters when first created (podContainerConnectionTimeout); test added.
    • Clean up scheduler:container:worker:index on worker removal; test verifies key deletion.
    • Use i.StubConfig for container request secrets to avoid stale config in endpoint instances.
    • Rename sandbox startup metric from network.ip_scan to network.ip_load.
  • New Features

    • Checkpoint validations in UpdateConfig: reject for serve stubs, multi-GPU or multiple GPU types, and when workspace storage is missing; test added.
    • More robust checkpoint creation: unblock runner by writing the “checkpoint complete” signal on persistence failure; add per-phase logs and sizes for runtime checkpoint, filesystem copy, archive/hash, upload, and cache store; use a deadline for runtime checkpoint and a 5s timeout for Create/UpdateCheckpoint RPCs.
    • Benchmarks (b9bench): add --sandbox-gpu and --sandbox-gpu-count flags and include them in reports.

Written for commit 5fe0a80. Summary will update on new commits.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 13 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/api/v1/stub.go">

<violation number="1" location="pkg/api/v1/stub.go:619">
P2: Checkpoint policy is duplicated and already inconsistent between create and update paths. This can make API behavior diverge for serve stubs and increases future drift risk.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread pkg/api/v1/stub.go
}

if stubConfig.CheckpointEnabled {
if stub.Type.IsServe() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Checkpoint policy is duplicated and already inconsistent between create and update paths. This can make API behavior diverge for serve stubs and increases future drift risk.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At pkg/api/v1/stub.go, line 619:

<comment>Checkpoint policy is duplicated and already inconsistent between create and update paths. This can make API behavior diverge for serve stubs and increases future drift risk.</comment>

<file context>
@@ -615,6 +615,21 @@ func (g *StubGroup) UpdateConfig(ctx echo.Context) error {
 	}
 
+	if stubConfig.CheckpointEnabled {
+		if stub.Type.IsServe() {
+			return HTTPBadRequest("Checkpoints are not supported for serve stubs")
+		}
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant