@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
88- [ Prerequisites] ( #prerequisites )
99- [ Quick Start] ( #quick-start )
1010- [ Dynamic Configuration] ( #dynamic-configuration )
11+ - [ Tier-Based GPU Selection] ( #tier-based-gpu-selection )
1112- [ SSH Key Security] ( #ssh-key-security )
1213- [ Configuration Options] ( #configuration-options )
1314- [ Provider Limitations] ( #provider-limitations )
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
111112
112113For more details on the dynamic configuration system, see [ Dynamic Cloud Kconfig Documentation] ( ../../docs/dynamic-cloud-kconfig.md ) .
113114
115+ ## Tier-Based GPU Selection
116+
117+ Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
118+ a single instance type, you can specify a maximum tier and kdevops will automatically select
119+ the highest available GPU within that tier.
120+
121+ ### How It Works
122+
123+ 1 . ** Specify Maximum Tier** : Choose a tier group like ` H100_OR_LESS `
124+ 2 . ** Capacity Check** : The system queries Lambda Labs API for available instances
125+ 3 . ** Tier Fallback** : Tries each tier from highest to lowest until one is available
126+ 4 . ** Auto-Provision** : Deploys to the first region with available capacity
127+
128+ ### Single GPU Tier Groups
129+
130+ | Tier Group | Fallback Order | Use Case |
131+ | ------------| ----------------| ----------|
132+ | ` GH200_OR_LESS ` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
133+ | ` H100_OR_LESS ` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
134+ | ` A100_OR_LESS ` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
135+ | ` A6000_OR_LESS ` | A6000 → RTX6000 → A10 | Budget-friendly |
136+
137+ ### Multi-GPU (8x) Tier Groups
138+
139+ | Tier Group | Fallback Order | Use Case |
140+ | ------------| ----------------| ----------|
141+ | ` 8X_B200_OR_LESS ` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
142+ | ` 8X_H100_OR_LESS ` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
143+ | ` 8X_A100_OR_LESS ` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
144+
145+ ### Quick Start with Tier Selection
146+
147+ ``` bash
148+ # Single GPU - best available up to H100
149+ make defconfig-lambdalabs-h100-or-less
150+ make bringup
151+
152+ # Single GPU - best available up to GH200
153+ make defconfig-lambdalabs-gh200-or-less
154+ make bringup
155+
156+ # 8x GPU - best available up to H100
157+ make defconfig-lambdalabs-8x-h100-or-less
158+ make bringup
159+ ```
160+
161+ ### Checking Capacity
162+
163+ Before deploying, you can check current GPU availability:
164+
165+ ``` bash
166+ # Check all available GPU instances
167+ python3 scripts/lambdalabs_check_capacity.py
168+
169+ # Check specific instance type
170+ python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
171+
172+ # JSON output for scripting
173+ python3 scripts/lambdalabs_check_capacity.py --json
174+ ```
175+
176+ ### Tier Selection Script
177+
178+ The tier selection script finds the best available GPU:
179+
180+ ``` bash
181+ # Find best single GPU up to H100
182+ python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
183+
184+ # Find best 8x GPU up to H100
185+ python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
186+
187+ # List all available tier groups
188+ python3 scripts/lambdalabs_select_tier.py --list-tiers
189+ ```
190+
191+ Example output:
192+ ```
193+ Checking tier group: h100-or-less
194+ Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
195+
196+ Checking tier 'h100-sxm': gpu_1x_h100_sxm5
197+ Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
198+
199+ Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
200+ gpu_1x_h100_sxm5 us-west-1
201+ ```
202+
203+ ### Benefits of Tier-Based Selection
204+
205+ - ** Higher Success Rate** : Automatically falls back to available GPUs
206+ - ** No Manual Intervention** : System handles capacity changes
207+ - ** Best Performance** : Always gets the highest tier available
208+ - ** Simple Configuration** : One defconfig covers multiple GPU types
209+
114210## SSH Key Security
115211
116212### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
168264| --------| -------------| ----------|
169265| ` defconfig-lambdalabs ` | Smart instance + unique SSH keys | Production (recommended) |
170266| ` defconfig-lambdalabs-shared-key ` | Smart instance + shared SSH key | Legacy/testing |
267+ | ` defconfig-lambdalabs-gh200-or-less ` | Best single GPU up to GH200 | Maximum performance |
268+ | ` defconfig-lambdalabs-h100-or-less ` | Best single GPU up to H100 | High performance |
269+ | ` defconfig-lambdalabs-a100-or-less ` | Best single GPU up to A100 | Cost-effective |
270+ | ` defconfig-lambdalabs-8x-b200-or-less ` | Best 8-GPU up to B200 | Maximum multi-GPU |
271+ | ` defconfig-lambdalabs-8x-h100-or-less ` | Best 8-GPU up to H100 | High-end multi-GPU |
171272
172273### Manual Configuration
173274
@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
274375| --------| ---------|
275376| ` lambdalabs_api.py ` | Main API integration, generates Kconfig |
276377| ` lambdalabs_smart_inference.py ` | Smart instance/region selection |
378+ | ` lambdalabs_check_capacity.py ` | Check GPU availability across regions |
379+ | ` lambdalabs_select_tier.py ` | Tier-based GPU selection with fallback |
277380| ` lambdalabs_ssh_keys.py ` | SSH key management |
278381| ` lambdalabs_list_instances.py ` | List running instances |
279382| ` lambdalabs_credentials.py ` | Manage API credentials |
0 commit comments