Skip to content

Try to find nodes that support the volume type#3013

Open
djeebus wants to merge 12 commits into
mainfrom
try-to-find-matching-nodes
Open

Try to find nodes that support the volume type#3013
djeebus wants to merge 12 commits into
mainfrom
try-to-find-matching-nodes

Conversation

@djeebus

@djeebus djeebus commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Old API behavior:

  • on orchestrator node boot: nodes get custom labels, but no default labels
  • on sandbox create: get required sandbox creation labels associated with the team, add "default" to the labels of nodes if they have no labels, add "default" to labels of sandbox creation request if they have no label, find nodes whose labels match, schedule sandbox on those nodes. if they have no volume, try the next node, keep going
  • on volume creation/deletion, randomize list of orchestrators, try on each one until you succeed

New API behavior:

  • on orchestrator node boot, node gets (custom label OR default label) AND persistent volume type labels
  • on sandbox create: get required sandbox creation labels associated with the team, add "default" to the labels of sandbox creation request if they have no label, generate persistent volume type labels, find all nodes that match, schedule sandbox on those nodes.
  • volume creation/deletion, randomize list of orchesetrators, but try those with matching volume labels first. if those don't work, try the rest until you succeed or exhaust the options
  flowchart TB
      subgraph OLD["🕐 OLD — sandbox create"]
          direction TB
          OB["Node boot:<br/>node gets custom labels only<br/>(no default, no volume labels)"]
          OS1["Get team's required<br/>sandbox-creation labels"]
          OS2["If a node has no labels,<br/>add 'default' to the node"]
          OS3["If request has no labels,<br/>add 'default' to the request"]
          OS4["Find nodes whose labels match"]
          OS5["Schedule on a matching node;<br/>if no room, try the next match"]
          OB --> OS1 --> OS2 --> OS3 --> OS4 --> OS5
      end
  
      subgraph NEW["✨  NEW — sandbox create"]
          direction TB
          NB["Node boot:<br/>node gets (custom OR default label)<br/>AND persistent-volume-type labels"]

          F1{"sandbox-label-based-scheduling<br/>flag on?"}
          ANY["No label filtering:<br/>schedule on ANY node"]

          NS1["Get team's required<br/>sandbox-creation labels<br/>('default' if team has none)"]
          F2{"sandbox-volume-label-based-scheduling<br/>flag on?"}
          NS3["Add persistent<br/>volume-type labels"]
          NS4["Find ALL nodes matching<br/>(sandbox + volume labels)"]
          NS4b["Find ALL nodes matching<br/>sandbox labels only"]
          NS5["Schedule on those nodes"]

          NB --> F1
          F1 -- no --> ANY
          F1 -- yes --> NS1 --> F2
          F2 -- yes --> NS3 --> NS4 --> NS5
          F2 -- no --> NS4b --> NS5
      end   

      OS1 -.migration.-> NS1
Loading
  flowchart TB
      subgraph OLD["🕐 OLD — volume create / delete"]
          direction TB
          OB["Node boot:<br/>node gets custom labels only<br/>(no volume labels)"]
          OV1["Randomize list of<br/>ALL orchestrators"]
          OV2["Try each one in order<br/>until one succeeds"]
          OB --> OV1 --> OV2
      end

      subgraph NEW["✨ NEW — volume create / delete"]
          direction TB
          NB["Node boot:<br/>node gets persistent-volume-type labels"]
          NV1["Randomize list of orchestrators"]
          NV2["Try nodes WITH matching<br/>volume label first"]
          NV3{"Succeeded?"}
          NV4["Fall back to the rest,<br/>try until success or<br/>options exhausted"]
          NV5(["Done"])
          NB --> NV1 --> NV2 --> NV3
          NV3 -- yes --> NV5
          NV3 -- "no (retryable)" --> NV4 --> NV5
      end

      OV1 -.migration.-> NV1```
Loading

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@cursor

cursor Bot commented Jun 15, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes core sandbox placement and volume orchestration routing; misconfigured labels or flags could mis-schedule sandboxes or delay volume ops until fallback paths succeed.

Overview
This change steers volume create/delete and sandbox placement toward orchestrator nodes that advertise the right persistent volume type, instead of treating every node in a cluster the same.

For volumes, orchestrator retries now prefer nodes whose labels include persistent-volume-type=<type>, with shuffled order within the preferred and fallback pools. A LaunchDarkly flag can disable falling back to nodes without that label once the fleet is fully labeled.

For sandboxes, required scheduling labels are built in one place: team scheduling labels (defaulting to default when empty), and when enabled, extra labels derived from mounted volume types so sandboxes with persistent volumes land on capable nodes. Default node labels are applied when a node reports none, and label compatibility no longer injects defaults at comparison time—empty requirements mean no label filter at that layer.

New feature flags gate volume-type sandbox scheduling and unmatched-node volume fallback; cluster context for flags is aligned to use a resolved cluster UUID consistently.

Reviewed by Cursor Bugbot for commit 7eaaa4f. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The node pointer retrieved from the nodes slice in findVolumeNodes can potentially be nil, which would cause a nil pointer dereference panic when calling node.Labels(). A nil check should be added before accessing its methods to ensure safety.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +173 to +178
for _, node := range nodes {
labels := node.Labels()
if _, ok := labels[expectedLabel]; ok {
matchingNodes = append(matchingNodes, node)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The node pointer retrieved from the nodes slice can potentially be nil, which would cause a nil pointer dereference panic when calling node.Labels(). Adding a nil check before accessing its methods ensures safety.

	for _, node := range nodes {
		if node == nil {
			continue
		}
		labels := node.Labels()
		if _, ok := labels[expectedLabel]; ok {
			matchingNodes = append(matchingNodes, node)
		}
	}

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Comment thread packages/api/internal/handlers/volume_util.go Outdated
Comment thread packages/api/internal/handlers/volume_util.go Outdated
Comment thread packages/api/internal/orchestrator/create_instance.go
Comment thread packages/api/internal/orchestrator/create_instance.go Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c721a5e. Configure here.

Comment thread packages/api/internal/orchestrator/placement/label_compatibility.go
@dobrac

dobrac commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

For the volumes, wondering if the specific handling for volumes in scheduling is necessary. In theory just the labels should be enough and then pipe through the existing pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants