Skip to content

Commit 02efba9

Browse files
authored
Merge pull request #71 from linux-kdevops/cel/terraform-docs
Update terraform-related documentation
2 parents 8a5387d + 304c6a3 commit 02efba9

3 files changed

Lines changed: 136 additions & 2 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -526,7 +526,7 @@ Below are sections which get into technical details of how kdevops works.
526526
* [Linux distribution support](docs/linux-distro-support.md)
527527
* [Overriding all Ansible role options with one file](docs/ansible-override.md)
528528
* [kdevops Vagrant support](docs/kdevops-vagrant.md)
529-
* [kdevops terraform support - cloud setup with kdevops](docs/kdevops-terraform.md)
529+
* [kdevops terraform and cloud provider support](docs/kdevops-terraform.md) - AWS, Azure, GCE, OCI, Lambda Labs, DataCrunch
530530
* [kdevops local Ansible roles](docs/ansible-roles.md)
531531
* [Tutorial on building your own custom Vagrant boxes](docs/custom-vagrant-boxes.md)
532532

docs/kdevops-terraform.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,13 @@ a Terraform plan.
77
Terraform is used to deploy your development hosts on cloud virtual machines.
88
Below are the list of clouds providers currently supported:
99

10+
**Traditional Cloud Providers:**
1011
* azure - Microsoft Azure
1112
* aws - Amazon Web Service
1213
* gce - Google Cloud Compute
1314
* oci - Oracle Cloud Infrastructure
15+
16+
**Neoclouds (GPU-optimized):**
1417
* datacrunch - DataCrunch GPU Cloud
1518
* lambdalabs - Lambda Labs GPU Cloud
1619

@@ -271,7 +274,18 @@ If your Ansible controller (where you run "make bringup") and your
271274
test instances operate inside the same subnet, you can disable the
272275
TERRAFORM_OCI_ASSIGN_PUBLIC_IP option for better network security.
273276

274-
### DataCrunch - GPU Cloud Provider
277+
## Neoclouds
278+
279+
A neocloud is a new type of specialized cloud provider that focuses on offering
280+
high-performance computing, particularly GPU-as-a-Service, to handle demanding
281+
AI and machine learning workloads. Unlike traditional, general-purpose cloud
282+
providers like AWS or Azure, neoclouds are purpose-built for AI with
283+
infrastructure optimized for raw speed, specialized hardware like dense GPU
284+
clusters, and tailored services like fast deployment and simplified pricing.
285+
286+
kdevops supports the following neocloud providers:
287+
288+
### DataCrunch
275289

276290
kdevops supports DataCrunch, a cloud provider specialized in GPU computing
277291
with competitive pricing for NVIDIA A100, H100, B200, and B300 instances.
@@ -450,3 +464,20 @@ provider_installation {
450464
```
451465

452466
For more information, visit: https://datacrunch.io/
467+
468+
### Lambda Labs
469+
470+
kdevops supports Lambda Labs, a cloud provider focused on GPU instances for
471+
machine learning workloads with competitive pricing.
472+
473+
For detailed documentation on Lambda Labs integration, including tier-based
474+
GPU selection, smart instance selection, and dynamic Kconfig generation, see:
475+
476+
* [Lambda Labs Dynamic Cloud Kconfig](dynamic-cloud-kconfig.md) - Dynamic configuration generation for Lambda Labs
477+
* [Lambda Labs CLI Reference](lambda-cli.1) - Man page for the lambda-cli tool
478+
479+
Lambda Labs offers various GPU instance types including A10, A100, and H100
480+
configurations. kdevops provides smart selection features that automatically
481+
choose the cheapest available instance type and region.
482+
483+
For more information, visit: https://lambdalabs.com/

terraform/lambdalabs/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
88
- [Prerequisites](#prerequisites)
99
- [Quick Start](#quick-start)
1010
- [Dynamic Configuration](#dynamic-configuration)
11+
- [Tier-Based GPU Selection](#tier-based-gpu-selection)
1112
- [SSH Key Security](#ssh-key-security)
1213
- [Configuration Options](#configuration-options)
1314
- [Provider Limitations](#provider-limitations)
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
111112

112113
For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).
113114

115+
## Tier-Based GPU Selection
116+
117+
Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
118+
a single instance type, you can specify a maximum tier and kdevops will automatically select
119+
the highest available GPU within that tier.
120+
121+
### How It Works
122+
123+
1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
124+
2. **Capacity Check**: The system queries Lambda Labs API for available instances
125+
3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
126+
4. **Auto-Provision**: Deploys to the first region with available capacity
127+
128+
### Single GPU Tier Groups
129+
130+
| Tier Group | Fallback Order | Use Case |
131+
|------------|----------------|----------|
132+
| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
133+
| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
134+
| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
135+
| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |
136+
137+
### Multi-GPU (8x) Tier Groups
138+
139+
| Tier Group | Fallback Order | Use Case |
140+
|------------|----------------|----------|
141+
| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
142+
| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
143+
| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
144+
145+
### Quick Start with Tier Selection
146+
147+
```bash
148+
# Single GPU - best available up to H100
149+
make defconfig-lambdalabs-h100-or-less
150+
make bringup
151+
152+
# Single GPU - best available up to GH200
153+
make defconfig-lambdalabs-gh200-or-less
154+
make bringup
155+
156+
# 8x GPU - best available up to H100
157+
make defconfig-lambdalabs-8x-h100-or-less
158+
make bringup
159+
```
160+
161+
### Checking Capacity
162+
163+
Before deploying, you can check current GPU availability:
164+
165+
```bash
166+
# Check all available GPU instances
167+
python3 scripts/lambdalabs_check_capacity.py
168+
169+
# Check specific instance type
170+
python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
171+
172+
# JSON output for scripting
173+
python3 scripts/lambdalabs_check_capacity.py --json
174+
```
175+
176+
### Tier Selection Script
177+
178+
The tier selection script finds the best available GPU:
179+
180+
```bash
181+
# Find best single GPU up to H100
182+
python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
183+
184+
# Find best 8x GPU up to H100
185+
python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
186+
187+
# List all available tier groups
188+
python3 scripts/lambdalabs_select_tier.py --list-tiers
189+
```
190+
191+
Example output:
192+
```
193+
Checking tier group: h100-or-less
194+
Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
195+
196+
Checking tier 'h100-sxm': gpu_1x_h100_sxm5
197+
Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
198+
199+
Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
200+
gpu_1x_h100_sxm5 us-west-1
201+
```
202+
203+
### Benefits of Tier-Based Selection
204+
205+
- **Higher Success Rate**: Automatically falls back to available GPUs
206+
- **No Manual Intervention**: System handles capacity changes
207+
- **Best Performance**: Always gets the highest tier available
208+
- **Simple Configuration**: One defconfig covers multiple GPU types
209+
114210
## SSH Key Security
115211

116212
### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
168264
|--------|-------------|----------|
169265
| `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
170266
| `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
267+
| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
268+
| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
269+
| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
270+
| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
271+
| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |
171272

172273
### Manual Configuration
173274

@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
274375
|--------|---------|
275376
| `lambdalabs_api.py` | Main API integration, generates Kconfig |
276377
| `lambdalabs_smart_inference.py` | Smart instance/region selection |
378+
| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
379+
| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
277380
| `lambdalabs_ssh_keys.py` | SSH key management |
278381
| `lambdalabs_list_instances.py` | List running instances |
279382
| `lambdalabs_credentials.py` | Manage API credentials |

0 commit comments

Comments
 (0)