Merge pull request #71 from linux-kdevops/cel/terraform-docs

chucklever · web-flow · commit 02efba9f25c9 · 2025-12-16T14:07:07.000-05:00
Update terraform-related documentation
diff --git a/README.md b/README.md
@@ -526,7 +526,7 @@ Below are sections which get into technical details of how kdevops works.
   * [Linux distribution support](docs/linux-distro-support.md)
   * [Overriding all Ansible role options with one file](docs/ansible-override.md)
   * [kdevops Vagrant support](docs/kdevops-vagrant.md)
-  * [kdevops terraform support - cloud setup with kdevops](docs/kdevops-terraform.md)
+  * [kdevops terraform and cloud provider support](docs/kdevops-terraform.md) - AWS, Azure, GCE, OCI, Lambda Labs, DataCrunch
   * [kdevops local Ansible roles](docs/ansible-roles.md)
   * [Tutorial on building your own custom Vagrant boxes](docs/custom-vagrant-boxes.md)
 
diff --git a/docs/kdevops-terraform.md b/docs/kdevops-terraform.md
@@ -7,10 +7,13 @@ a Terraform plan.
 Terraform is used to deploy your development hosts on cloud virtual machines.
 Below are the list of clouds providers currently supported:
 
+**Traditional Cloud Providers:**
   * azure - Microsoft Azure
   * aws - Amazon Web Service
   * gce - Google Cloud Compute
   * oci - Oracle Cloud Infrastructure
+
+**Neoclouds (GPU-optimized):**
   * datacrunch - DataCrunch GPU Cloud
   * lambdalabs - Lambda Labs GPU Cloud
 
@@ -271,7 +274,18 @@ If your Ansible controller (where you run "make bringup") and your
 test instances operate inside the same subnet, you can disable the
 TERRAFORM_OCI_ASSIGN_PUBLIC_IP option for better network security.
 
-### DataCrunch - GPU Cloud Provider
+## Neoclouds
+
+A neocloud is a new type of specialized cloud provider that focuses on offering
+high-performance computing, particularly GPU-as-a-Service, to handle demanding
+AI and machine learning workloads. Unlike traditional, general-purpose cloud
+providers like AWS or Azure, neoclouds are purpose-built for AI with
+infrastructure optimized for raw speed, specialized hardware like dense GPU
+clusters, and tailored services like fast deployment and simplified pricing.
+
+kdevops supports the following neocloud providers:
+
+### DataCrunch
 
 kdevops supports DataCrunch, a cloud provider specialized in GPU computing
 with competitive pricing for NVIDIA A100, H100, B200, and B300 instances.
@@ -450,3 +464,20 @@ provider_installation {
 ```
 
 For more information, visit: https://datacrunch.io/
+
+### Lambda Labs
+
+kdevops supports Lambda Labs, a cloud provider focused on GPU instances for
+machine learning workloads with competitive pricing.
+
+For detailed documentation on Lambda Labs integration, including tier-based
+GPU selection, smart instance selection, and dynamic Kconfig generation, see:
+
+  * [Lambda Labs Dynamic Cloud Kconfig](dynamic-cloud-kconfig.md) - Dynamic configuration generation for Lambda Labs
+  * [Lambda Labs CLI Reference](lambda-cli.1) - Man page for the lambda-cli tool
+
+Lambda Labs offers various GPU instance types including A10, A100, and H100
+configurations. kdevops provides smart selection features that automatically
+choose the cheapest available instance type and region.
+
+For more information, visit: https://lambdalabs.com/
diff --git a/terraform/lambdalabs/README.md b/terraform/lambdalabs/README.md
@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
 - [Prerequisites](#prerequisites)
 - [Quick Start](#quick-start)
 - [Dynamic Configuration](#dynamic-configuration)
+- [Tier-Based GPU Selection](#tier-based-gpu-selection)
 - [SSH Key Security](#ssh-key-security)
 - [Configuration Options](#configuration-options)
 - [Provider Limitations](#provider-limitations)
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
 
 For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).
 
+## Tier-Based GPU Selection
+
+Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
+a single instance type, you can specify a maximum tier and kdevops will automatically select
+the highest available GPU within that tier.
+
+### How It Works
+
+1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
+2. **Capacity Check**: The system queries Lambda Labs API for available instances
+3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
+4. **Auto-Provision**: Deploys to the first region with available capacity
+
+### Single GPU Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
+| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
+| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
+| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |
+
+### Multi-GPU (8x) Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
+| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
+| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
+
+### Quick Start with Tier Selection
+
+```bash
+# Single GPU - best available up to H100
+make defconfig-lambdalabs-h100-or-less
+make bringup
+
+# Single GPU - best available up to GH200
+make defconfig-lambdalabs-gh200-or-less
+make bringup
+
+# 8x GPU - best available up to H100
+make defconfig-lambdalabs-8x-h100-or-less
+make bringup
+```
+
+### Checking Capacity
+
+Before deploying, you can check current GPU availability:
+
+```bash
+# Check all available GPU instances
+python3 scripts/lambdalabs_check_capacity.py
+
+# Check specific instance type
+python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
+
+# JSON output for scripting
+python3 scripts/lambdalabs_check_capacity.py --json
+```
+
+### Tier Selection Script
+
+The tier selection script finds the best available GPU:
+
+```bash
+# Find best single GPU up to H100
+python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
+
+# Find best 8x GPU up to H100
+python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
+
+# List all available tier groups
+python3 scripts/lambdalabs_select_tier.py --list-tiers
+```
+
+Example output:
+```
+Checking tier group: h100-or-less
+Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
+
+Checking tier 'h100-sxm': gpu_1x_h100_sxm5
+  Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
+
+Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
+gpu_1x_h100_sxm5 us-west-1
+```
+
+### Benefits of Tier-Based Selection
+
+- **Higher Success Rate**: Automatically falls back to available GPUs
+- **No Manual Intervention**: System handles capacity changes
+- **Best Performance**: Always gets the highest tier available
+- **Simple Configuration**: One defconfig covers multiple GPU types
+
 ## SSH Key Security
 
 ### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
 |--------|-------------|----------|
 | `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
 | `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
+| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
+| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
+| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
+| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
+| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |
 
 ### Manual Configuration
 
@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
 |--------|---------|
 | `lambdalabs_api.py` | Main API integration, generates Kconfig |
 | `lambdalabs_smart_inference.py` | Smart instance/region selection |
+| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
+| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
 | `lambdalabs_ssh_keys.py` | SSH key management |
 | `lambdalabs_list_instances.py` | List running instances |
 | `lambdalabs_credentials.py` | Manage API credentials |