Skip to content

Commit edb5daa

Browse files
committed
Add draft for dynamic disks RFC
1 parent 030f059 commit edb5daa

1 file changed

Lines changed: 138 additions & 0 deletions

File tree

toc/rfc/rfc-draft-dynamic-disks.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Meta
2+
[meta]: #meta
3+
- Name: Dynamic Disks Support in BOSH
4+
- Start Date: 2026-01-07
5+
- Author(s): @mariash, @Alphasite
6+
- Status: Draft <!-- Acceptable values: Draft, Approved, On Hold, Superseded -->
7+
- RFC Pull Request: https://github.com/cloudfoundry/community/pull/1401
8+
9+
10+
## Summary
11+
12+
This RFC proposes adding support for dynamic disk management in BOSH Director. The feature introduces a Director API that allows BOSH-deployed workloads running on BOSH-managed VMs to request persistent disks during their lifecycle, and to have the Director orchestrate the creation, attachment, detachment, and deletion of those disks using existing CPIs.
13+
14+
The intent is to make BOSH Director the central control plane for disk management across orchestrators such as Cloud Foundry, Kubernetes, CI systems, and other platforms deployed by BOSH.
15+
16+
## Problem
17+
18+
### Current limitations
19+
20+
The current BOSH disk model is almost entirely manifest-driven and static:
21+
22+
* Disks are defined with fixed sizes in deployment manifests.
23+
* The Director creates and attaches disks only at deploy time.
24+
* Runtime workflows on the VM have no supported way to request additional persistent storage.
25+
* Any external orchestrator (e.g., Kubernetes CSI) must rely on its own IAAS-specific disk management, bypassing the Director.
26+
27+
This leads to several concrete problems:
28+
29+
### A. Fragmented Control Planes
30+
31+
When BOSH deploys another orchestrator (for example Kubernetes), that orchestrator typically brings its own storage subsystem. This means a deployment ends up with two independent systems managing IAAS disks.
32+
33+
### B. Coordinated Lifecycle Safety on Individual VMs
34+
35+
External disk managers have no awareness of BOSH Director’s per-VM lifecycle operations. When IAAS-specific automation performs disk workflows independently, it can unintentionally conflict with operations the Director is executing on the same VM like stop, start, restart, etc
36+
37+
### C. Increased Operational Complexity
38+
39+
Operators deploying platforms like Kubernetes clusters via BOSH must:
40+
* Configure IAAS storage access in addition to BOSH access.
41+
* Maintain separate attach/detach automation.
42+
43+
### D. High-Speed Storage Requirements for Stateful Workloads
44+
45+
Local disks attached directly to VMs are essential for workloads that depend on predictable, low-latency, high-throughput persistent storage. In particular, distributed systems built on consensus algorithms such as RAFT require fast durable writes to ensure correct and performant operation. Today, workloads are limited to external storage services like Databases or remote file systems like NFS and SMB. These storage options do not provide low-latency, high-throughput performance required by such stateful workloads.
46+
47+
### E. Lack of Flexibility for Stateful Workloads
48+
49+
Platforms and orchestrators deployed by BOSH, including Cloud Foundry, CI systems, and Kubernetes clusters, regularly move long-running processes between VMs as part of normal operations such as rolling updates and failure recovery. Workloads that depend on local persistent disks for high-speed durable storage require that storage to follow as the workload moves between VMs.
50+
51+
## Proposal
52+
53+
### Goals
54+
55+
* Preserve backwards compatibility.
56+
* Provide a Director API that can be used by authorized clients to manage disks as first-class resources.
57+
* Ensure that dynamic disk management does not conflict with standard BOSH workflows such as deployments, upgrades, and VM lifecycle operations.
58+
* Ensure that disk device discovery is handled consistently across IAAS providers.
59+
* Ensure correct dynamic disks detachment and cleanup as part of VM and deployment lifecycle operations.
60+
* Keep all IAAS interaction inside existing CPIs.
61+
62+
### BOSH Director API
63+
64+
Extend BOSH Director to expose an authorized API for disk operations to provide, detach and delete disks for VM:
65+
66+
* `POST /dynamic_disks/provide`
67+
* `POST /dynamic_disks/:disk_name/detach`
68+
* `DELETE /dynamic_disks/:disk_name`
69+
70+
#### POST /dynamic_disks/provide
71+
72+
Accepts:
73+
74+
* `disk_name` - a unique disk identifier
75+
* `disk_size` - size of the disk in MB (standard unit used in BOSH for disk size)
76+
* `disk_pool_name` - name of the disk pool provided in BOSH cloud config
77+
* `instance_id` - ID of the BOSH VM instance requesting the disk
78+
* `metadata` - disk metadata that will be set in IAAS on a disk resource
79+
80+
Returns: `disk_cid`
81+
82+
As part of this API call, the Director should schedule a ProvideDynamicDisk job.
83+
84+
If the disk already exists and is attached to a different VM the job should fail.
85+
86+
If the disk already exists and is attached to the requested VM the job should succeed as the call is idempotent.
87+
88+
If the disk exists and is not attached to any VM, then it will be attached to the requested VM using CPI `attach_disk` call.
89+
90+
If the disk does not exist, it will be created using CPI `create_disk` call and then attached to the requested VM using `attach_disk` call.
91+
92+
When the disk is attached the Director should tell BOSH Agent to resolve the symlink based on the device path resolver it is configured with and create a symlink to that resolved path in `/var/vcap/data/dynamic_disks`.
93+
94+
If metadata is provided or is different from existing metadata the job should call `set_disk_metadata` CPI method.
95+
96+
#### POST /dynamic_disks/:disk_name/detach
97+
98+
Accepts:
99+
100+
* `disk_name` - a unique disk identifier
101+
102+
Returns: success or failure
103+
104+
As part of this API call, the Director should schedule a DetachDynamicDisk job. Detach is treated as a desired-state operation. No validation is required for the VM to which the disk is attached. If the disk is already detached the operation should succeed. This ensures that operation is idempotent.
105+
106+
If the disk with the specified name doesn’t exist this job should fail.
107+
108+
If the disk is attached this operation should call detach_disk CPI method.
109+
110+
When the disk is detached the Director should call remove_dynamic_disk BOSH Agent method. BOSH Agent should remove the resolved device symlink in `/var/vcap/data/dynamic_disks`.
111+
112+
#### DELETE /dynamic_disks/:disk_name
113+
114+
Accepts:
115+
116+
* `disk_name` - a unique disk identifier
117+
118+
Returns: success or failure
119+
120+
As part of this API call, the Director should schedule a DeleteDynamicDisk job.
121+
122+
This job should succeed if disk does not exist since this call is idempotent.
123+
124+
If the disk exists it should delete the disk using delete_disk CPI method. And then delete the disk record from the dynamic disks record.
125+
126+
### VM lock
127+
128+
All VM operations should be protected by VM lock including starting, stopping, restarting, recreating and deleting VM, attaching and detaching disks to VM. This provides coordination between VM lifecycle and disk management operations, ensuring that the VM and its attached storage remain in a consistent and safe state.
129+
130+
### Dedicated worker queue
131+
132+
Disk management operations are executed in a dedicated worker queue to ensure that runtime storage workflows do not block or degrade standard BOSH operations such as deployments, upgrades, and VM lifecycle operations.
133+
134+
### Dynamic disk lifecycle integration
135+
136+
* When a VM is recreated, all dynamically attached disks MUST be safely detached before VM deletion and may be reattached to the replacement VM if requested by the orchestrator.
137+
* When a VM is deleted, all dynamically attached disks MUST be detached prior to VM deletion.
138+
* When a deployment is deleted, all dynamic disks associated with that deployment MUST be deleted.

0 commit comments

Comments
 (0)