diff --git a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb new file mode 100644 index 0000000..2313fd7 --- /dev/null +++ b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb @@ -0,0 +1,1007 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comprehensive OSFT Training Tutorial\n", + "\n", + "This notebook provides a comprehensive guide to Orthogonal Subspace Fine-Tuning (OSFT) using the training_hub library. We'll cover:\n", + "\n", + "- **All available parameters** and their detailed explanations\n", + "- **Single-node and multi-node training** configurations\n", + "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n", + "- **Best practices and troubleshooting**\n", + "\n", + "OSFT (Orthogonal Subspace Fine-Tuning) is an algorithm based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097) that enables continual training of pre-trained or instruction-tuned models **without** catastrophic forgetting and **without** needing replay buffers or supplementary datasets.\n", + "\n", + "This tutorial serves as both a learning resource and a template you can adapt for your specific continual learning needs.\n", + "\n", + "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/osft_qwen_example.py`, `scripts/osft_llama_example.py`, and `scripts/osft_phi_example.py` with better logging consistency.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is OSFT?\n", + "\n", + "OSFT (Orthogonal Subspace Fine-Tuning) is a continual learning algorithm that allows you to adapt pre-trained or instruction-tuned models to new domains **without catastrophic forgetting**. Based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), OSFT fundamentally changes how we approach model adaptation.\n", + "\n", + "### Key Innovation\n", + "\n", + "Traditional fine-tuning updates all model parameters, which can overwrite previously learned knowledge. OSFT instead:\n", + "1. **Identifies orthogonal subspaces** in the model's weight matrices\n", + "2. **Restricts updates to these subspaces**, preserving existing knowledge\n", + "3. **Eliminates the need for replay buffers** or supplementary datasets\n", + "\n", + "### OSFT vs Traditional Fine-Tuning\n", + "\n", + "| Aspect | Traditional SFT | OSFT |\n", + "|--------|----------------|------|\n", + "| **Catastrophic Forgetting** | Common problem | Prevented by design |\n", + "| **Data Requirements** | Needs replay/mixed data | Only new domain data |\n", + "| **Preservation Method** | Data mixing ratios | Algorithm (math guarantees) |\n", + "| **Memory Usage** | Similar | Similar |\n", + "| **Complexity** | Complex data pipelines | Simple, direct |\n", + "\n", + "### When to Use OSFT\n", + "\n", + "**Perfect for:**\n", + "- Adding domain-specific knowledge (medical, legal, technical)\n", + "- Adapting to new languages or dialects\n", + "- Customizing instruction formats\n", + "- Continual learning across multiple domains\n", + "- Any scenario where you need to preserve existing capabilities\n", + "\n", + "**Not needed for:**\n", + "- Training from scratch\n", + "- Base model pre-training\n", + "- When you want to completely replace model behavior\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Understanding the Key Parameter: `unfreeze_rank_ratio`\n", + "\n", + "The `unfreeze_rank_ratio` is the most important OSFT-specific parameter. It controls the balance between preservation and adaptation.\n", + "\n", + "### What Does It Do?\n", + "\n", + "- Controls **how much of each weight matrix** can be updated during training\n", + "- Range: `0.0` to `1.0`\n", + "- Lower values = more preservation, slower adaptation\n", + "- Higher values = more adaptation, slightly less preservation\n", + "\n", + "### Visual Intuition\n", + "\n", + "Think of a weight matrix as a building:\n", + "- `unfreeze_rank_ratio = 0.1`: You can only renovate 10% of the rooms\n", + "- `unfreeze_rank_ratio = 0.3`: You can renovate 30% of the rooms\n", + "- `unfreeze_rank_ratio = 1.0`: You can renovate the entire building (standard fine-tuning)\n", + "\n", + "The \"rooms\" you renovate are carefully chosen to be orthogonal to existing knowledge, preventing damage to what's already there.\n", + "\n", + "### Recommended Settings by Use Case\n", + "\n", + "| Use Case | Recommended Ratio | Why? |\n", + "|----------|-------------------|------|\n", + "| **Minor format adjustments** | 0.1-0.15 | Minimal changes needed |\n", + "| **Domain vocabulary addition** | 0.15-0.25 | Add terms without losing general knowledge |\n", + "| **Domain specialization** | 0.25-0.35 | Balance preservation and new expertise |\n", + "| **Major capability expansion** | 0.35-0.5 | Significant new learning required |\n", + "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n", + "\n", + "### Practical Guidelines\n", + "\n", + "```python\n", + "# Conservative: Maximum preservation\n", + "unfreeze_rank_ratio = 0.2 # Great for adding specialized knowledge\n", + "\n", + "# Balanced: Good for most use cases \n", + "unfreeze_rank_ratio = 0.3 # Ideal default for domain adaptation\n", + "\n", + "# Aggressive: When you need significant changes\n", + "unfreeze_rank_ratio = 0.4 # Use when preservation is less critical\n", + "```\n", + "\n", + "**Pro tip:** Start conservative (0.2-0.3) and increase only if needed. It's easier to train again with higher ratio than to recover lost capabilities!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "data_path = \"./test_osft_data.jsonl\"\n", + "if not os.path.exists(data_path):\n", + " print(f\"Creating dummy dataset at {data_path}\")\n", + " dummy_data = [\n", + " {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n", + " ] * 10\n", + " with open(data_path, \"w\") as f:\n", + " for d in dummy_data:\n", + " f.write(json.dumps(d) + \"\\n\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The `target_patterns` Parameter (Advanced Users Only)\n", + "\n", + "There's an optional `target_patterns` parameter that allows targeting specific model layers for OSFT:\n", + "\n", + "```python\n", + "target_patterns = None # Default: applies OSFT to all appropriate layers (RECOMMENDED)\n", + "```\n", + "\n", + "**โ ๏ธ Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n", + "\n", + "If you do need to use it, it performs simple substring matching on module names:\n", + "- `target_patterns = [\"attention\"]` โ Targets modules with \"attention\" in the name\n", + "- `target_patterns = [\"mlp\"]` โ Targets modules with \"mlp\" in the name\n", + "\n", + "**For 99% of users:** Just use the default (`None`) and let OSFT handle layer selection automatically. The algorithm knows what it's doing.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Before running this notebook, install the required dependencies:\n", + "\n", + "```bash\n", + "# Install training-hub (brings in mini-trainer and other deps)\n", + "pip install training-hub\n", + "\n", + "# Install PyTorch 2.9+ with CUDA 12.9 support (required by mini-trainer 0.7+)\n", + "pip install torch==2.9.0+cu129 torchvision==0.24.0+cu129 --index-url https://download.pytorch.org/whl/cu129\n", + "\n", + "# Install a compatible NCCL build (needed if system NCCL mismatches your CUDA driver)\n", + "pip install nvidia-nccl-cu12 nvidia-nvshmem-cu12\n", + "```\n", + "\n", + "> **Note:** If `flash-attn` is not available on your platform, the notebook sets `TESTING=true` to fall back to PyTorch SDPA attention. If `liger-kernel` is not installed, set `use_liger = False` in the parameters cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Imports\n", + "\n", + "Let's start by importing the necessary libraries and setting up our environment.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import training_hub for OSFT training\n", + "from training_hub import osft\n", + "\n", + "# Standard library imports\n", + "import os\n", + "import time\n", + "from datetime import datetime\n", + "from pathlib import Path\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Format Requirements\n", + "\n", + "Before configuring your training, ensure your data is in the correct format. OSFT uses the mini-trainer backend, which supports both standard messages format and pre-processed datasets.\n", + "\n", + "### Required Format: JSONL with Messages\n", + "\n", + "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n", + "\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n", + "```\n", + "\n", + "### Message Structure\n", + "\n", + "Each conversation contains a `messages` array with message objects having:\n", + "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n", + "- **`content`**: The text content of the message\n", + "- **`reasoning_content`** (optional): Additional reasoning traces\n", + "\n", + "### Masking Control with `unmask_messages` Parameter\n", + "\n", + "Control which parts of the conversation are used for training loss:\n", + "\n", + "#### Standard Instruction Tuning (default)\n", + "```python\n", + "osft(..., unmask_messages=False) # Only assistant responses used for loss\n", + "```\n", + "- **Trains only on assistant responses** (standard instruction-following)\n", + "- System messages are always masked (ignored for loss)\n", + "- User messages are masked\n", + "- Assistant messages are unmasked (used for loss calculation)\n", + "\n", + "#### Pretraining Mode\n", + "```python\n", + "osft(..., unmask_messages=True) # All content except system messages used for loss\n", + "```\n", + "- **Trains on all content except system messages**\n", + "- System messages are always masked\n", + "- User and assistant messages are both unmasked\n", + "- Useful for pretraining-style data where the model should learn from all text\n", + "\n", + "### Pre-processed Dataset Option\n", + "\n", + "If you have pre-processed data with `input_ids` and `labels` fields:\n", + "\n", + "```json\n", + "{\"input_ids\": [1, 2, 3, ...], \"labels\": [1, 2, 3, ...]}\n", + "```\n", + "\n", + "Use with:\n", + "```python\n", + "osft(..., use_processed_dataset=True)\n", + "```\n", + "\n", + "### Data Path Configuration\n", + "\n", + "When configuring your training, point to your JSONL file:\n", + "\n", + "```python\n", + "data_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n", + "```\n", + "\n", + "The training pipeline will automatically:\n", + "1. Load and validate your JSONL data\n", + "2. Apply chat templates based on your model\n", + "3. Handle masking according to the `unmask_messages` setting\n", + "4. Process the data for efficient training\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Configuration Examples\n", + "\n", + "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and continual learning requirements.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# MODEL CONFIGURATION EXAMPLES FOR OSFT\n", + "# These are example configurations - adjust based on your hardware and requirements\n", + "# =============================================================================\n", + "\n", + "# Example 1: Qwen 3 0.6B\n", + "qwen_example = {\n", + " \"model_name\": \"Qwen 3 0.6B Instruct\",\n", + " \"model_path\": \"Qwen/Qwen3-0.6B\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.25, # Conservative for preserving multilingual capabilities\n", + " \"example_max_tokens_per_gpu\": 2048,\n", + " \"example_max_seq_len\": 2048, # Qwen 2.5 supports long context\n", + " \"example_batch_size\": 1,\n", + " \"example_learning_rate\": 5e-6, \n", + " \"notes\": \"Excellent for domain adaptation while preserving multilingual capabilities\"\n", + "}\n", + "\n", + "# Example 2: Llama 3.1 8B Instruct\n", + "llama_example = {\n", + " \"model_name\": \"Llama 3.1 8B Instruct\",\n", + " \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.3, # Slightly higher for more adaptation freedom\n", + " \"example_max_tokens_per_gpu\": 10000,\n", + " \"example_max_seq_len\": 8192, # Supports up to 128K but 8K is common\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Ideal for adding specialized knowledge without losing general capabilities\"\n", + "}\n", + "\n", + "# Example 3: Phi 4 Mini\n", + "phi_example = {\n", + " \"model_name\": \"Phi 4 Mini\",\n", + " \"model_path\": \"microsoft/Phi-4-mini-instruct\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.25, # Conservative for smaller model\n", + " \"example_max_tokens_per_gpu\": 8192,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 64,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Efficient for edge deployment with continual adaptation\"\n", + "}\n", + "\n", + "# Example 4: Generic 7B Base Model\n", + "generic_7b_example = {\n", + " \"model_name\": \"Generic 7B Base\",\n", + " \"model_path\": \"/path/to/your-7b-model\", # Local path to model directory\n", + " \"example_unfreeze_rank_ratio\": 0.3, # Balanced preservation vs adaptation\n", + " \"example_max_tokens_per_gpu\": 10000,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Good baseline for most 7B instruction-tuned models\"\n", + "}\n", + "\n", + "# Example 5: Smaller Model (1B-3B)\n", + "small_model_example = {\n", + " \"model_name\": \"Small Model (1B-3B)\",\n", + " \"model_path\": \"/path/to/small-model\", # Local path or HuggingFace name\n", + " \"example_unfreeze_rank_ratio\": 0.4, # Higher ratio for smaller models\n", + " \"example_max_tokens_per_gpu\": 16_000,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 3e-5,\n", + " \"notes\": \"Smaller models can handle more aggressive adaptation\"\n", + "}\n", + "\n", + "# =============================================================================\n", + "# SELECT YOUR CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Choose one of the examples above as a starting point\n", + "selected_example = qwen_example # Change this to your preferred example\n", + "\n", + "print(f\"Selected Example: {selected_example['model_name']}\")\n", + "print(f\"Model Path: {selected_example['model_path']}\")\n", + "print(f\"OSFT Unfreeze Rank Ratio: {selected_example['example_unfreeze_rank_ratio']}\")\n", + "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n", + "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n", + "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n", + "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n", + "print(f\"Notes: {selected_example['notes']}\")\n", + "print(\"\\n๐ก Remember: OSFT preserves original capabilities without needing replay buffers!\")\n", + "print(\" Adjust unfreeze_rank_ratio based on preservation vs adaptation needs.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Complete Parameter Reference\n", + "\n", + "Let's configure all available OSFT parameters with detailed explanations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# COMPLETE OSFT PARAMETER CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Experiment identification\n", + "experiment_name = \"osft_comprehensive_example\"\n", + "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n", + "\n", + "# =============================================================================\n", + "# REQUIRED PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# TODO: revert these overrides after we've concluded training\n", + "model_path = selected_example[\"model_path\"] # HuggingFace model name or local path\n", + "data_path = \"./test_osft_data.jsonl\" # Path to training data in JSONL format\n", + "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\" # Where to save checkpoints\n", + "unfreeze_rank_ratio = selected_example[\"example_unfreeze_rank_ratio\"] # OSFT-specific parameter\n", + "effective_batch_size = selected_example[\"example_batch_size\"] # Effective batch size for training\n", + "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"] # Maximum tokens per GPU (memory limit)\n", + "max_seq_len = selected_example[\"example_max_seq_len\"] # Maximum sequence length\n", + "learning_rate = selected_example[\"example_learning_rate\"] # Learning rate for training\n", + "\n", + "print(\"๐ Required Parameters (all must be specified):\")\n", + "print(f\" โข model_path: {model_path}\")\n", + "print(f\" โข data_path: {data_path}\")\n", + "print(f\" โข ckpt_output_dir: {ckpt_output_dir}\")\n", + "print(f\" โข unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n", + "print(f\" โข effective_batch_size: {effective_batch_size}\")\n", + "print(f\" โข max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n", + "print(f\" โข max_seq_len: {max_seq_len:,}\")\n", + "print(f\" โข learning_rate: {learning_rate}\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# OSFT-SPECIFIC PARAMETERS\n", + "# =============================================================================\n", + "\n", + "target_patterns = None # Optional: Patterns to match specific modules for OSFT\n", + "# Example: [\"*attention*\", \"*mlp*\"] to target attention and MLP layers\n", + "\n", + "print(\"๐ง OSFT-Specific Parameters:\")\n", + "print(f\" unfreeze_rank_ratio: {unfreeze_rank_ratio} - Controls how much of each matrix is unfrozen\")\n", + "print(f\" โข 0.1-0.3: Conservative, maximum preservation\")\n", + "print(f\" โข 0.3-0.5: Balanced adaptation\")\n", + "print(f\" โข >0.5: Rarely needed for typical use cases\")\n", + "print(f\" target_patterns: {target_patterns} - Optional patterns for selecting specific modules\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# TRAINING HYPERPARAMETERS\n", + "# =============================================================================\n", + "\n", + "# num_epochs = 3 # Number of training epochs\n", + "num_epochs = 1 # Number of training epochs\n", + "seed = 42 # Random seed for reproducibility\n", + "lr_scheduler = \"cosine\" # Learning rate scheduler\n", + "lr_scheduler_kwargs = {} # Scheduler parameters\n", + "warmup_steps = 0 # Number of warmup steps\n", + "\n", + "print(\"๐ฏ Training Hyperparameters:\")\n", + "print(f\" effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n", + "print(f\" learning_rate: {learning_rate} - Learning rate for model updates\")\n", + "print(f\" num_epochs: {num_epochs} - Number of training epochs\")\n", + "print(f\" lr_scheduler: '{lr_scheduler}' - Learning rate scheduler type\")\n", + "print(f\" lr_scheduler_kwargs: {lr_scheduler_kwargs} - Scheduler parameters\")\n", + "print(f\" warmup_steps: {warmup_steps} - Number of warmup steps\")\n", + "print(f\" seed: {seed} - Random seed for reproducibility\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# MEMORY AND PERFORMANCE PARAMETERS\n", + "# =============================================================================\n", + "\n", + "use_liger = True # Use Liger kernels for efficiency\n", + "\n", + "print(\"โก Memory and Performance Parameters:\")\n", + "print(f\" max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU (hard-cap for memory)\")\n", + "print(f\" max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n", + "print(f\" use_liger: {use_liger} - Use Liger kernels for efficiency\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# DATA PROCESSING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "data_output_dir = \"/dev/shm/osft_data\" # Directory for processed data (RAM disk for speed)\n", + "use_processed_dataset = False # Whether data is pre-processed\n", + "unmask_messages = False # Whether to unmask all messages for pretraining-style learning\n", + "\n", + "print(\"๐พ Data Processing Parameters:\")\n", + "print(f\" data_path: '{data_path}' - Path to training data (JSONL format)\")\n", + "print(f\" data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n", + "print(f\" use_processed_dataset: {use_processed_dataset} - Whether to use pre-processed data\")\n", + "print(f\" unmask_messages: {unmask_messages} - Whether to unmask all messages\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CHECKPOINTING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "checkpoint_at_epoch = True # Whether to checkpoint at each epoch\n", + "save_final_checkpoint = True # Whether to save final checkpoint\n", + "\n", + "print(\"๐พ Checkpointing Parameters:\")\n", + "print(f\" ckpt_output_dir: '{ckpt_output_dir}' - Directory to save checkpoints\")\n", + "print(f\" checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n", + "print(f\" save_final_checkpoint: {save_final_checkpoint} - Whether to save final checkpoint\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distributed Training Configuration\n", + "\n", + "Configure distributed training for both single-node and multi-node setups.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# DISTRIBUTED TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# Configuration options for different setups\n", + "distributed_configs = {\n", + " \"single_gpu_dev\": {\n", + " \"nproc_per_node\": 1,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 1,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Development setup with single GPU\"\n", + " },\n", + " \"single_node_8gpu\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 100,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Single node with 8 GPUs\"\n", + " },\n", + " \"multi_node_master\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 2, # 2 nodes\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 42,\n", + " # master node IP\n", + " \"rdzv_endpoint\": \"10.241.128.23:1738\", # Replace with actual master IP\n", + " \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n", + " },\n", + " \"multi_node_worker\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 2, # 2 nodes\n", + " \"node_rank\": 1, # Change this for each worker node (1, 2, 3, ...)\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.241.128.23:1738\", # Same as master\n", + " \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n", + " }\n", + "}\n", + "\n", + "# Select your distributed configuration\n", + "selected_distributed = \"single_gpu_dev\" # Change this to match your setup\n", + "dist_config = distributed_configs[selected_distributed]\n", + "\n", + "# Extract distributed training parameters\n", + "nproc_per_node = dist_config[\"nproc_per_node\"] # Number of processes (GPUs) per node\n", + "nnodes = dist_config[\"nnodes\"] # Total number of nodes\n", + "node_rank = dist_config[\"node_rank\"] # Rank of this node (0 to nnodes-1)\n", + "rdzv_id = dist_config[\"rdzv_id\"] # Unique job ID for rendezvous\n", + "rdzv_endpoint = dist_config[\"rdzv_endpoint\"] # Master node endpoint for multi-node training\n", + "\n", + "# Calculate total resources\n", + "total_gpus = nproc_per_node * nnodes\n", + "per_gpu_batch_size = effective_batch_size // total_gpus\n", + "\n", + "print(\"๐ฅ๏ธ Distributed Training Parameters:\")\n", + "print(f\" Configuration: {dist_config['description']}\")\n", + "print(f\" nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n", + "print(f\" nnodes: {nnodes} - Total number of nodes\")\n", + "print(f\" node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n", + "print(f\" rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n", + "print(f\" rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n", + "print()\n", + "print(f\"๐ Resource Calculation:\")\n", + "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} ร {nnodes})\")\n", + "print(f\" Effective batch size: {effective_batch_size}\")\n", + "print(f\" Approximate per-GPU batch size: {per_gpu_batch_size}\")\n", + "print(f\" (Actual micro-batch size determined automatically by gradient accumulation)\")\n", + "print()\n", + "\n", + "# Multi-node setup instructions\n", + "if nnodes > 1:\n", + " print(\"๐ง Multi-Node Setup Instructions:\")\n", + " print(f\" 1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n", + " print(f\" 2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n", + " print(f\" 3. Set node_rank to 0 for master, 1,2,3... for workers\")\n", + " print(f\" 4. Start training on ALL nodes simultaneously\")\n", + " print()\n", + "\n", + "# OSFT-specific multi-node considerations\n", + "print(\"๐ OSFT Multi-Node Considerations:\")\n", + "print(\" โข OSFT works seamlessly across multiple nodes\")\n", + "print(\" โข No special replay buffer coordination needed (unlike SFT)\")\n", + "print(\" โข Each node processes its data portion with the same unfreeze_rank_ratio\")\n", + "print(\" โข Gradients are synchronized automatically across all nodes\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute Training\n", + "\n", + "Now let's run the actual OSFT training with all our configured parameters.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# TRAINING EXECUTION\n", + "# =============================================================================\n", + "\n", + "print(\"๐ Starting OSFT Training\")\n", + "print(\"=\" * 60)\n", + "print(f\"Experiment: {full_experiment_name}\")\n", + "print(f\"Model: {selected_example['model_name']}\")\n", + "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node ร {nnodes} nodes)\")\n", + "print(f\"Configuration: {dist_config['description']}\")\n", + "print(f\"Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", + "print()\n", + "print(\"โจ OSFT Advantages:\")\n", + "print(\" โข No catastrophic forgetting\")\n", + "print(\" โข No replay buffer needed\")\n", + "print(\" โข Preserves original model capabilities\")\n", + "print()\n", + "\n", + "# Prepare all training parameters\n", + "training_params = {\n", + " # Required parameters\n", + " 'model_path': model_path,\n", + " 'data_path': data_path,\n", + " 'ckpt_output_dir': ckpt_output_dir,\n", + " 'unfreeze_rank_ratio': unfreeze_rank_ratio,\n", + " 'effective_batch_size': effective_batch_size,\n", + " 'max_tokens_per_gpu': max_tokens_per_gpu,\n", + " 'max_seq_len': max_seq_len,\n", + " 'learning_rate': learning_rate,\n", + " \n", + " # Optional OSFT-specific parameters\n", + " 'target_patterns': target_patterns,\n", + " \n", + " # Training duration\n", + " 'num_epochs': num_epochs,\n", + " \n", + " # Data processing parameters\n", + " 'data_output_dir': data_output_dir,\n", + " 'use_processed_dataset': use_processed_dataset,\n", + " 'unmask_messages': unmask_messages,\n", + " 'warmup_steps': warmup_steps,\n", + " \n", + " # Optimization parameters\n", + " 'use_liger': use_liger,\n", + " 'seed': seed,\n", + " 'lr_scheduler': lr_scheduler,\n", + " 'lr_scheduler_kwargs': lr_scheduler_kwargs,\n", + " \n", + " # Checkpointing parameters\n", + " 'checkpoint_at_epoch': checkpoint_at_epoch,\n", + " 'save_final_checkpoint': save_final_checkpoint,\n", + " \n", + " # Distributed training parameters\n", + " 'nproc_per_node': nproc_per_node,\n", + " 'nnodes': nnodes,\n", + " 'node_rank': node_rank,\n", + " 'rdzv_id': rdzv_id,\n", + " 'rdzv_endpoint': rdzv_endpoint,\n", + "}\n", + "\n", + "# Display final configuration summary\n", + "print(\"๐ Final Training Configuration:\")\n", + "for key, value in training_params.items():\n", + " if value is not None: # Only show non-None values\n", + " print(f\" {key}: {value}\")\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\"โณ Training starting...\")\n", + "print(\"=\"*60)\n", + "\n", + "# Execute training\n", + "start_time = time.time()\n", + "\n", + "try:\n", + " result = osft(**training_params)\n", + " \n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(\"โ OSFT Training completed successfully!\")\n", + " print(f\"โฑ๏ธ Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", + " print(f\"๐ Checkpoints saved to: {ckpt_output_dir}\")\n", + " print(\"=\"*60)\n", + " print()\n", + " print(\"๐ฏ What you've achieved with OSFT:\")\n", + " print(\" โข Model adapted to new domain/task\")\n", + " print(\" โข Original capabilities preserved\")\n", + " print(\" โข No catastrophic forgetting occurred\")\n", + " print(\" โข Ready for deployment without regression testing!\")\n", + " \n", + "except Exception as e:\n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(f\"โ Training failed after {duration/60:.1f} minutes\")\n", + " print(f\"Error: {e}\")\n", + " print(\"=\"*60)\n", + " \n", + " print(\"\\n๐ Quick Troubleshooting Checklist:\")\n", + " print(\" โก Check that model_path exists or is a valid HuggingFace model name\")\n", + " print(\" โก Verify data_path points to valid JSONL file\")\n", + " print(\" โก Ensure ckpt_output_dir parent directory exists and is writable\")\n", + " print(\" โก Try reducing max_tokens_per_gpu if you see OOM errors\")\n", + " print(\" โก Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n", + " print(\" โก For multi-node: verify network connectivity and endpoints\")\n", + " print(\" โก Check that mini-trainer backend dependencies are installed\")\n", + " \n", + " raise\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Training Analysis\n", + "\n", + "After training completes, let's analyze the results and provide guidance for next steps.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# POST-TRAINING ANALYSIS AND NEXT STEPS\n", + "# =============================================================================\n", + "\n", + "print(\"๐ Post-Training Analysis\")\n", + "print(\"=\" * 50)\n", + "\n", + "# Check for saved checkpoints\n", + "checkpoint_dir = ckpt_output_dir\n", + "\n", + "if os.path.exists(checkpoint_dir):\n", + " checkpoints = [d for d in os.listdir(checkpoint_dir) \n", + " if os.path.isdir(os.path.join(checkpoint_dir, d))]\n", + " \n", + " if checkpoints:\n", + " print(f\"โ Found {len(checkpoints)} checkpoint(s):\")\n", + " for ckpt in sorted(checkpoints):\n", + " ckpt_path = os.path.join(checkpoint_dir, ckpt)\n", + " print(f\" ๐ {ckpt}\")\n", + " \n", + " # Identify the final checkpoint\n", + " final_checkpoint = sorted(checkpoints)[-1]\n", + " final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n", + " \n", + " print(f\"\\n๐ฏ Final model checkpoint: {final_checkpoint_path}\")\n", + " \n", + " # Provide model loading example\n", + " print(f\"\\n๐ป Model Loading Example:\")\n", + " print(f\"```python\")\n", + " print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n", + " print(f\"\")\n", + " print(f\"# Load your OSFT-adapted model\")\n", + " print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"\")\n", + " print(f\"# Test the model - it should maintain original capabilities\")\n", + " print(f\"# while excelling at your new domain/task\")\n", + " print(f\"inputs = tokenizer('Your domain-specific prompt:', return_tensors='pt')\")\n", + " print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n", + " print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n", + " print(f\"print(response)\")\n", + " print(f\"```\")\n", + " else:\n", + " print(f\"โ No checkpoints found in {checkpoint_dir}\")\n", + "else:\n", + " print(f\"โ Checkpoint directory not found: {checkpoint_dir}\")\n", + "\n", + "# Training summary\n", + "print(f\"\\n๐ Training Summary:\")\n", + "print(f\" Model: {selected_example['model_name']}\")\n", + "print(f\" Algorithm: OSFT (Orthogonal Subspace Fine-Tuning)\")\n", + "print(f\" Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", + "print(f\" Epochs: {num_epochs}\")\n", + "print(f\" Global Batch Size: {effective_batch_size}\")\n", + "print(f\" Learning Rate: {learning_rate}\")\n", + "print(f\" Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n", + "print(f\" Max Sequence Length: {max_seq_len:,}\")\n", + "print(f\" Total GPUs: {total_gpus}\")\n", + "print(f\" Distributed Config: {dist_config['description']}\")\n", + "\n", + "# OSFT-specific validation recommendations\n", + "print(f\"\\n๐งช OSFT-Specific Validation Steps:\")\n", + "print(f\" 1. **Test Original Capabilities**: Verify the model still performs well on\")\n", + "print(f\" general tasks it was originally trained for\")\n", + "print(f\" 2. **Test New Domain**: Confirm improved performance on your target domain\")\n", + "print(f\" 3. **No Regression Testing Needed**: Unlike SFT, OSFT preserves capabilities\")\n", + "print(f\" by design, reducing validation overhead\")\n", + "print(f\" 4. **Compare with Base Model**: Run side-by-side comparisons to see\")\n", + "print(f\" improvements without degradation\")\n", + "\n", + "# Next steps recommendations\n", + "print(f\"\\n๐ Recommended Next Steps:\")\n", + "print(f\" 1. ๐ฏ Test on domain-specific evaluation sets\")\n", + "print(f\" 2. ๐ Compare performance with base model on both general and domain tasks\")\n", + "print(f\" 3. ๐ If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n", + "print(f\" 4. ๐ก If too much change occurred, reduce unfreeze_rank_ratio\")\n", + "print(f\" 5. ๐ Document the unfreeze_rank_ratio that works best for your use case\")\n", + "print(f\" 6. ๐ข Deploy with confidence - no catastrophic forgetting!\")\n", + "\n", + "# Performance optimization tips\n", + "print(f\"\\nโก OSFT-Specific Optimization Tips:\")\n", + "print(f\" โข Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n", + "if unfreeze_rank_ratio < 0.2:\n", + " print(f\" Very conservative - great preservation, slower adaptation\")\n", + " print(f\" Consider increasing to 0.25-0.3 if need more adaptation\")\n", + "elif unfreeze_rank_ratio < 0.35:\n", + " print(f\" Balanced - good preservation with reasonable adaptation\")\n", + " print(f\" This is ideal for most use cases\")\n", + "else:\n", + " print(f\" Aggressive - faster adaptation, slightly less preservation\")\n", + " print(f\" Consider reducing if seeing any capability degradation\")\n", + "\n", + "print(f\" โข Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n", + "print(f\" โข For production: use the script version for better logging and resumption\")\n", + "\n", + "print(f\"\\nโจ OSFT Training Complete!\")\n", + "print(f\"Your model has been successfully adapted without forgetting!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parameter Reference Summary\n", + "\n", + "Quick reference for all OSFT parameters and their purposes.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Required Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `model_path` | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n", + "| `data_path` | Path to the training data | `\"/path/to/train.jsonl\"` |\n", + "| `ckpt_output_dir` | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n", + "| `unfreeze_rank_ratio` | **OSFT-specific**: Controls preservation vs adaptation | `0.25`, `0.3`, `0.4` |\n", + "| `effective_batch_size` | Effective batch size for training | `64`, `128`, `256` |\n", + "| `max_tokens_per_gpu` | Maximum tokens per GPU (memory limit) | `16384`, `25000`, `40000` |\n", + "| `max_seq_len` | Maximum sequence length | `2048`, `8192`, `32768` |\n", + "| `learning_rate` | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n", + "\n", + "### OSFT-Specific Parameters\n", + "\n", + "| Parameter | Description | Recommended Values | Use Case |\n", + "|-----------|-------------|-------------------|----------|\n", + "| `unfreeze_rank_ratio` | Controls how much of each matrix is unfrozen | `0.1-0.3` | Conservative preservation |\n", + "| | | `0.3-0.5` | Balanced adaptation |\n", + "| | | `>0.5` | Rarely needed |\n", + "| `target_patterns` | Optional patterns to match specific modules | `None` | Default (all modules) |\n", + "\n", + "### Training Configuration Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|-----------------|\n", + "| `num_epochs` | Number of training epochs | `1` |\n", + "| `seed` | Random seed for reproducibility | `42` |\n", + "| `use_liger` | Enable Liger kernels for efficiency | `True` |\n", + "| `warmup_steps` | Number of warmup steps | `0` |\n", + "| `lr_scheduler` | Learning rate scheduler | `\"cosine\"` |\n", + "| `lr_scheduler_kwargs` | Additional scheduler parameters | `{\"eta_min\": 1e-6}` |\n", + "\n", + "### Data Processing Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|-----------------|\n", + "| `data_output_dir` | Directory to save processed data | Defaults to `f\"{ckpt_output_dir}/_internal_data_processing\"`, Recommended value is `\"/dev/shm\"` (shared memory) |\n", + "| `use_processed_dataset` | Use pre-processed data with input_ids/labels | `False` |\n", + "| `unmask_messages` | Unmask all messages for pretraining-style learning | `False` |\n", + "\n", + "### Checkpointing Parameters\n", + "\n", + "| Parameter | Description | Recommended |\n", + "|-----------|-------------|-------------|\n", + "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n", + "| `save_final_checkpoint` | Whether to save final checkpoint | `True` |\n", + "\n", + "### Distributed Training Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n", + "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n", + "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n", + "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n", + "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n", + "\n", + "### Unfreeze Rank Ratio Guidelines\n", + "\n", + "| Use Case | Recommended Ratio | Rationale |\n", + "|----------|-------------------|-----------|\n", + "| **Minor format changes** | 0.1-0.15 | Maximum preservation, minimal changes |\n", + "| **Domain vocabulary addition** | 0.15-0.25 | Add specialized terms without losing general knowledge |\n", + "| **Domain specialization** | 0.25-0.35 | Balance between preservation and adaptation |\n", + "| **Major capability expansion** | 0.35-0.5 | More freedom for significant new capabilities |\n", + "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n", + "\n", + "### OSFT vs SFT Key Differences\n", + "\n", + "| Aspect | OSFT | SFT |\n", + "|--------|------|-----|\n", + "| **Catastrophic Forgetting** | Prevented by design | Requires replay buffers |\n", + "| **Data Requirements** | Only new domain data | Needs mixed/replay data |\n", + "| **Memory Usage** | Similar to SFT | Similar to OSFT |\n", + "| **Key Parameter** | `unfreeze_rank_ratio` | N/A |\n", + "| **Backend** | mini-trainer | instructlab-training |\n", + "| **Best For** | Continual learning, domain adaptation | Initial fine-tuning |\n", + "\n", + "### Popular Model Examples for OSFT\n", + "\n", + "| Model | HuggingFace Path | Recommended `unfreeze_rank_ratio` | `max_tokens_per_gpu` |\n", + "|-------|------------------|-----------------------------------|----------------------|\n", + "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | 0.25 | 10000 |\n", + "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | 0.3 | 10000 |\n", + "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | 0.25 | 15000 |\n", + "\n", + "### Script Alternative\n", + "\n", + "For production workloads or long-running training, use the script version:\n", + "\n", + "```bash\n", + "# Qwen example\n", + "python scripts/osft_qwen_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.25\n", + "\n", + "# Llama example\n", + "python scripts/osft_llama_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.3\n", + "\n", + "# Phi example\n", + "python scripts/osft_phi_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.25\n", + "```\n", + "\n", + "### When to Use OSFT vs SFT\n", + "\n", + "**Use OSFT when:**\n", + "- Adding domain-specific knowledge to an already-trained model\n", + "- Need to preserve original capabilities without regression\n", + "- Don't have access to original training data for replay\n", + "- Want to avoid catastrophic forgetting\n", + "- Performing continual learning across multiple domains\n", + "\n", + "**Use SFT when:**\n", + "- Training a model from scratch or base model\n", + "- Have comprehensive training data covering all desired capabilities \n", + "- Don't need to preserve specific prior behaviors\n", + "- Performing initial instruction tuning\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb new file mode 100644 index 0000000..cfb19a3 --- /dev/null +++ b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb @@ -0,0 +1,1679 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comprehensive SFT Training Tutorial\n", + "\n", + "This notebook provides a comprehensive guide to Supervised Fine-Tuning (SFT) using the training_hub library. We'll cover:\n", + "\n", + "- **All available parameters** and their detailed explanations\n", + "- **Single-node and multi-node training** configurations\n", + "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n", + "- **Best practices and troubleshooting**\n", + "\n", + "This tutorial serves as both a learning resource and a template you can adapt for your specific fine-tuning needs.\n", + "\n", + "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/sft_qwen_example.py`, `scripts/sft_llama_example.py`, and `scripts/sft_phi_example.py` with better logging consistency." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Imports\n", + "\n", + "Let's start by importing the necessary libraries and setting up our environment.\n", + "\n", + "Install `training-hub` if it's not installed yet.\n", + "```\n", + "export UV_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple\n", + "export UV_HTTP_TIMEOUT=300\n", + "pip install uv -i https://pypi.tuna.tsinghua.edu.cn/simple\n", + "uv pip install -q training-hub -i https://pypi.tuna.tsinghua.edu.cn/simple\n", + "```\n", + "\n", + "Reinstall pytorch to fit your current CUDA versions (e.g. to fit cuda-12.4, install torch==2.6.0 with cu124 support):\n", + "\n", + "```\n", + "uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Import training_hub for SFT training\n", + "from training_hub import sft\n", + "\n", + "# Standard library imports\n", + "import os\n", + "import time\n", + "from datetime import datetime\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "data_path = \"./test_sft_data.jsonl\"\n", + "if not os.path.exists(data_path):\n", + " print(f\"Creating dummy dataset at {data_path}\")\n", + " dummy_data = [\n", + " {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n", + " ] * 10\n", + " with open(data_path, \"w\") as f:\n", + " for d in dummy_data:\n", + " f.write(json.dumps(d) + \"\\n\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Format Requirements\n", + "\n", + "Before configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n", + "\n", + "### Required Format: JSONL with Messages\n", + "\n", + "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n", + "\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n", + "```\n", + "\n", + "### Message Structure\n", + "\n", + "Each conversation contains a `messages` array with message objects having:\n", + "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n", + "- **`content`**: The text content of the message\n", + "- **`reasoning_content`** (optional): Additional reasoning traces\n", + "\n", + "### Masking Behavior with `unmask` Field\n", + "\n", + "You can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n", + "\n", + "#### Standard Instruction Tuning (default)\n", + "```json\n", + "{\"messages\": [...]}\n", + "```\n", + "or\n", + "```json\n", + "{\"messages\": [...], \"unmask\": false}\n", + "```\n", + "- **Trains only on assistant responses** (standard instruction-following)\n", + "- System messages are always masked (ignored for loss)\n", + "- User messages are masked\n", + "- Assistant messages are unmasked (used for loss calculation)\n", + "\n", + "#### Pretraining Mode\n", + "```json\n", + "{\"messages\": [...], \"unmask\": true}\n", + "```\n", + "- **Trains on all content except system messages**\n", + "- System messages are always masked\n", + "- User and assistant messages are both unmasked\n", + "- Useful for pretraining-style data where the model should learn from all text\n", + "\n", + "### Example Data Formats\n", + "\n", + "**Standard SFT (instruction-following):**\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n if n == 0 or n == 1:\\n return 1\\n return n * factorial(n - 1)\\n```\"}]}\n", + "```\n", + "\n", + "**Pretraining-style (learn from all content):**\n", + "```json\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n", + "```\n", + "\n", + "### Data Path Configuration\n", + "\n", + "When configuring your training, point to your JSONL file:\n", + "\n", + "```python\n", + "data_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n", + "```\n", + "\n", + "The training pipeline will automatically:\n", + "1. Load and validate your JSONL data\n", + "2. Apply chat templates based on your model\n", + "3. Handle masking according to the `unmask` setting\n", + "4. Process the data for efficient training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Configuration Examples\n", + "\n", + "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and requirements." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Selected Example: Qwen 3 0.6B\n", + "Model Path: /opt/app-root/src/qwen3-0.6b\n", + "Example Max Tokens per GPU: 2,048\n", + "Example Max Sequence Length: 2,048\n", + "Example Batch Size: 1\n", + "Example Learning Rate: 1e-05\n", + "\n", + "\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# MODEL CONFIGURATION EXAMPLES\n", + "# These are example configurations - adjust based on your hardware and requirements\n", + "# =============================================================================\n", + "\n", + "# Example 1: Qwen 2.5 7B Instruct\n", + "qwen_example = {\n", + " \"model_name\": \"Qwen 3 0.6B\",\n", + " \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 2048,\n", + " \"example_max_seq_len\": 2048,\n", + " \"example_batch_size\": 1,\n", + " \"example_learning_rate\": 1e-5,\n", + "}\n", + "\n", + "# Example 2: Llama 3.1 8B Instruct\n", + "llama_example = {\n", + " \"model_name\": \"Llama 3.1 8B Instruct\",\n", + " \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 18000,\n", + " \"example_max_seq_len\": 16384,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 1e-5,\n", + "}\n", + "\n", + "# Example 3: Phi 4 Mini\n", + "phi_example = {\n", + " \"model_name\": \"Phi 4 Mini\",\n", + " \"model_path\": \"microsoft/Phi-4-mini-instruct\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 25000,\n", + " \"example_max_seq_len\": 8192,\n", + " \"example_batch_size\": 64,\n", + " \"example_learning_rate\": 5e-6,\n", + "}\n", + "\n", + "# Example 4: Generic 7B Base Model\n", + "generic_7b_example = {\n", + " \"model_name\": \"Generic 7B Base\",\n", + " \"model_path\": \"/path/to/your-7b-model\", # Local path to model directory\n", + " \"example_max_tokens_per_gpu\": 25000,\n", + " \"example_max_seq_len\": 20000,\n", + " \"example_batch_size\": 256,\n", + " \"example_learning_rate\": 2e-5,\n", + "}\n", + "\n", + "# Example 5: Smaller Model (1B-3B)\n", + "small_model_example = {\n", + " \"model_name\": \"Small Model (1B-3B)\",\n", + " \"model_path\": \"/path/to/small-model\", # Local path or HuggingFace name\n", + " \"example_max_tokens_per_gpu\": 40000,\n", + " \"example_max_seq_len\": 32768,\n", + " \"example_batch_size\": 512,\n", + " \"example_learning_rate\": 3e-5,\n", + "}\n", + "\n", + "# =============================================================================\n", + "# SELECT YOUR CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Choose one of the examples above as a starting point\n", + "selected_example = qwen_example # Change this to your preferred example\n", + "\n", + "print(f\"Selected Example: {selected_example['model_name']}\")\n", + "print(f\"Model Path: {selected_example['model_path']}\")\n", + "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n", + "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n", + "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n", + "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n", + "print(\"\\n\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Complete Parameter Reference\n", + "\n", + "Let's configure all available SFT parameters with detailed explanations." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\udccb Required Parameters:\n", + " model_path: Path to the model to fine-tune (HuggingFace name or local path)\n", + " data_path: Path to the training data (JSONL format)\n", + " ckpt_output_dir: Directory to save checkpoints\n", + "\n", + "\ud83c\udfaf Core Training Parameters:\n", + " num_epochs: 3 - Number of training epochs\n", + " effective_batch_size: 1 - Effective batch size for training\n", + " learning_rate: 1e-05 - Learning rate for training\n", + " max_seq_len: 2,048 - Maximum sequence length\n", + " max_tokens_per_gpu: 2,048 - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\n", + "\n", + "\ud83d\udcbe Data Processing Parameters:\n", + " data_output_dir: '/dev/shm' - Directory to save processed data\n", + " warmup_steps: 100 - Number of warmup steps\n", + "\n", + "\ud83d\udcbe Checkpointing Parameters:\n", + " save_samples: 0 - Number of samples to save after training (0 disables saving based on sample count)\n", + " checkpoint_at_epoch: True - Whether to checkpoint at each epoch\n", + " accelerate_full_state_at_epoch: True - Whether to save full state at epoch for automatic checkpoint resumption\n", + "\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# COMPLETE SFT PARAMETER CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Experiment identification\n", + "experiment_name = \"sft_comprehensive_example\"\n", + "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n", + "\n", + "# =============================================================================\n", + "# REQUIRED PARAMETERS\n", + "# =============================================================================\n", + "\n", + "model_path = selected_example[\"model_path\"] # HuggingFace model name or local path\n", + "data_path = \"./test_sft_data.jsonl\" # Path to training data in JSONL format\n", + "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\" # Where to save checkpoints\n", + "\n", + "print(\"\ud83d\udccb Required Parameters:\")\n", + "print(f\" model_path: Path to the model to fine-tune (HuggingFace name or local path)\")\n", + "print(f\" data_path: Path to the training data (JSONL format)\")\n", + "print(f\" ckpt_output_dir: Directory to save checkpoints\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CORE TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "num_epochs = 3 # Number of training epochs\n", + "effective_batch_size = selected_example[\"example_batch_size\"] # Effective batch size for training\n", + "learning_rate = selected_example[\"example_learning_rate\"] # Learning rate for training\n", + "max_seq_len = selected_example[\"example_max_seq_len\"] # Maximum sequence length\n", + "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"] # Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs)\n", + "\n", + "print(\"\ud83c\udfaf Core Training Parameters:\")\n", + "print(f\" num_epochs: {num_epochs} - Number of training epochs\")\n", + "print(f\" effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n", + "print(f\" learning_rate: {learning_rate} - Learning rate for training\")\n", + "print(f\" max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n", + "print(f\" max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# DATA AND PROCESSING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "data_output_dir = \"/dev/shm\" # Directory to save processed data\n", + "warmup_steps = 100 # Number of warmup steps\n", + "\n", + "print(\"\ud83d\udcbe Data Processing Parameters:\")\n", + "print(f\" data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n", + "print(f\" warmup_steps: {warmup_steps} - Number of warmup steps\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CHECKPOINTING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "save_samples = 0 # Number of samples to save after training (0 disables saving based on sample count)\n", + "checkpoint_at_epoch = True # Whether to checkpoint at each epoch\n", + "accelerate_full_state_at_epoch = True # Whether to save full state at epoch for automatic checkpoint resumption\n", + "\n", + "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n", + "print(f\" save_samples: {save_samples} - Number of samples to save after training (0 disables saving based on sample count)\")\n", + "print(f\" checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n", + "print(f\" accelerate_full_state_at_epoch: {accelerate_full_state_at_epoch} - Whether to save full state at epoch for automatic checkpoint resumption\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distributed Training Configuration\n", + "\n", + "Configure distributed training for both single-node and multi-node setups." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\udda5\ufe0f Distributed Training Parameters:\n", + " Configuration: Development setup with single GPU\n", + " nproc_per_node: 1 - Number of processes (GPUs) per node\n", + " nnodes: 1 - Total number of nodes\n", + " node_rank: 0 - Rank of this node (0 to nnodes-1)\n", + " rdzv_id: 1 - Unique job ID for rendezvous\n", + " rdzv_endpoint: '127.0.0.1:29500' - Master node endpoint for multi-node training\n", + "\n", + "\ud83d\udcca Resource Calculation:\n", + " Total GPUs: 1 (1 \u00d7 1)\n", + " Effective batch size: 1\n", + " Approximate per-GPU batch size: 1\n", + " (Actual micro-batch size determined automatically by gradient accumulation)\n", + "\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# DISTRIBUTED TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# Configuration options for different setups\n", + "distributed_configs = {\n", + " \"single_gpu_dev\": {\n", + " \"nproc_per_node\": 1,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 1,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Development setup with single GPU\"\n", + " },\n", + " \"single_node_8gpu\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 100,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Single node with 8 GPUs\"\n", + " },\n", + " \"multi_node_master\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 4,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.0.0.1:29500\", # Replace with actual master IP\n", + " \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n", + " },\n", + " \"multi_node_worker\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 4,\n", + " \"node_rank\": 1, # Change this for each worker node (1, 2, 3, ...)\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.0.0.1:29500\", # Same as master\n", + " \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n", + " }\n", + "}\n", + "\n", + "# Select your distributed configuration\n", + "selected_distributed = \"single_gpu_dev\" # Change this to match your setup\n", + "dist_config = distributed_configs[selected_distributed]\n", + "\n", + "# Extract distributed training parameters\n", + "nproc_per_node = dist_config[\"nproc_per_node\"] # Number of processes (GPUs) per node\n", + "nnodes = dist_config[\"nnodes\"] # Total number of nodes\n", + "node_rank = dist_config[\"node_rank\"] # Rank of this node (0 to nnodes-1)\n", + "rdzv_id = dist_config[\"rdzv_id\"] # Unique job ID for rendezvous\n", + "rdzv_endpoint = dist_config[\"rdzv_endpoint\"] # Master node endpoint for multi-node training\n", + "\n", + "# Calculate total resources\n", + "total_gpus = nproc_per_node * nnodes\n", + "per_gpu_batch_size = effective_batch_size // total_gpus\n", + "\n", + "print(\"\ud83d\udda5\ufe0f Distributed Training Parameters:\")\n", + "print(f\" Configuration: {dist_config['description']}\")\n", + "print(f\" nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n", + "print(f\" nnodes: {nnodes} - Total number of nodes\")\n", + "print(f\" node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n", + "print(f\" rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n", + "print(f\" rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n", + "print()\n", + "print(f\"\ud83d\udcca Resource Calculation:\")\n", + "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n", + "print(f\" Effective batch size: {effective_batch_size}\")\n", + "print(f\" Approximate per-GPU batch size: {per_gpu_batch_size}\")\n", + "print(f\" (Actual micro-batch size determined automatically by gradient accumulation)\")\n", + "print()\n", + "\n", + "# Multi-node setup instructions\n", + "if nnodes > 1:\n", + " print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n", + " print(f\" 1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n", + " print(f\" 2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n", + " print(f\" 3. Set node_rank to 0 for master, 1,2,3... for workers\")\n", + " print(f\" 4. Start training on ALL nodes simultaneously\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute Training\n", + "\n", + "Now let's run the actual SFT training with all our configured parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\ude80 Starting SFT Training\n", + "============================================================\n", + "Experiment: sft_comprehensive_example_20260323_072149\n", + "Model: Qwen 3 0.6B\n", + "Total GPUs: 1 (1 per node \u00d7 1 nodes)\n", + "Configuration: Development setup with single GPU\n", + "\n", + "\ud83d\udccb Final Training Configuration:\n", + " model_path: /opt/app-root/src/qwen3-0.6b\n", + " data_path: ./test_sft_data.jsonl\n", + " ckpt_output_dir: checkpoints/sft_comprehensive_example_20260323_072149\n", + " num_epochs: 3\n", + " effective_batch_size: 1\n", + " learning_rate: 1e-05\n", + " max_seq_len: 2048\n", + " max_tokens_per_gpu: 2048\n", + " data_output_dir: /dev/shm\n", + " warmup_steps: 100\n", + " save_samples: 0\n", + " checkpoint_at_epoch: True\n", + " accelerate_full_state_at_epoch: True\n", + " nproc_per_node: 1\n", + " nnodes: 1\n", + " node_rank: 0\n", + " rdzv_id: 1\n", + " rdzv_endpoint: 127.0.0.1:29500\n", + " disable_flash_attn: True\n", + "\n", + "============================================================\n", + "\u23f3 Training starting...\n", + "============================================================\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:21:52] INFO Starting training setup... main_ds.py:591\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:21:52]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Starting training setup\u001b[33m...\u001b[0m \u001b]8;id=399409;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=389230;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#591\u001b\\\u001b[2m591\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=592580;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152011;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:21:55] WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:21:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=209439;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=307155;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9ca4f757279149fdb5e5bbeb03e9bfd7", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Converting samples into input_ids and labels... (num_proc=3): 100%|##########| 3/3 [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
[07:22:07] INFO ten largest length percentiles: data_process.py:1355\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m ten largest length percentiles: \u001b]8;id=670115;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=495753;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1355\u001b\\\u001b[2m1355\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 90th: 70 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 90th: \u001b[1;36m70\u001b[0m \u001b]8;id=657600;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=118760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 91th: 70 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 91th: \u001b[1;36m70\u001b[0m \u001b]8;id=996028;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=273551;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 92th: 71 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 92th: \u001b[1;36m71\u001b[0m \u001b]8;id=658621;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=94733;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 93th: 71 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 93th: \u001b[1;36m71\u001b[0m \u001b]8;id=246082;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=931645;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 94th: 72 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 94th: \u001b[1;36m72\u001b[0m \u001b]8;id=452894;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=347117;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 95th: 73 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 95th: \u001b[1;36m73\u001b[0m \u001b]8;id=626760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619372;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 96th: 73 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 96th: \u001b[1;36m73\u001b[0m \u001b]8;id=265428;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=452333;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 97th: 74 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 97th: \u001b[1;36m74\u001b[0m \u001b]8;id=841787;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=465368;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 98th: 74 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 98th: \u001b[1;36m74\u001b[0m \u001b]8;id=252732;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=895643;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 99th: 75 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 99th: \u001b[1;36m75\u001b[0m \u001b]8;id=731576;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=8096;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 100th: 76 data_process.py:1358\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 100th: \u001b[1;36m76\u001b[0m \u001b]8;id=559594;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=56715;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO at 2048 max sequence length, the number of samples to be dropped is 0 data_process.py:1362\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m at \u001b[1;36m2048\u001b[0m max sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m \u001b]8;id=663979;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152605;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1362\u001b\\\u001b[2m1362\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO (0.00 of total) data_process.py:1367\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m0.00\u001b[0m of total\u001b[1m)\u001b[0m \u001b]8;id=915262;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=203311;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1367\u001b\\\u001b[2m1367\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 0th: 28 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 0th: \u001b[1;36m28\u001b[0m \u001b]8;id=674658;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=54929;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 1th: 28 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 1th: \u001b[1;36m28\u001b[0m \u001b]8;id=856039;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=914854;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 2th: 28 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 2th: \u001b[1;36m28\u001b[0m \u001b]8;id=556445;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=385141;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 3th: 29 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 3th: \u001b[1;36m29\u001b[0m \u001b]8;id=738851;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=147455;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 4th: 29 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 4th: \u001b[1;36m29\u001b[0m \u001b]8;id=628289;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=34700;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 5th: 29 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 5th: \u001b[1;36m29\u001b[0m \u001b]8;id=243709;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=947817;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 6th: 30 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 6th: \u001b[1;36m30\u001b[0m \u001b]8;id=318583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=443338;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 7th: 30 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 7th: \u001b[1;36m30\u001b[0m \u001b]8;id=143291;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=65864;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 8th: 30 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 8th: \u001b[1;36m30\u001b[0m \u001b]8;id=411537;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=957882;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 9th: 31 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 9th: \u001b[1;36m31\u001b[0m \u001b]8;id=831237;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=856583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO quantile 10th: 31 data_process.py:1378\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 10th: \u001b[1;36m31\u001b[0m \u001b]8;id=612081;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=731268;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO at 20 min sequence length, the number of samples to be dropped is 0 data_process.py:1382\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m at \u001b[1;36m20\u001b[0m min sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m \u001b]8;id=102879;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=675812;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1382\u001b\\\u001b[2m1382\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=881091;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=320766;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "037b02187d454cd880a6d947d3353fb0", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filter (num_proc=3): 0%| | 0/3 [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO Samples Previews... data_process.py:1392\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Samples Previews\u001b[33m...\u001b[0m \u001b]8;id=64443;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=282143;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1392\u001b\\\u001b[2m1392\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=75781;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=607291;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6a16e69d33d54d0da44ea627def0e9cd", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filtering out pretraining samples (num_proc=3): 0%| | 0/3 [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=397717;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=196121;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f72a54e41a794200833ea3262ce90f54", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filtering out pretraining samples (num_proc=3): 0%| | 0/3 [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[35mOriginal Input: <|im_start|>user\n", + "What is machine learning?<|im_end|>\n", + "<|im_start|>assistant\n", + "
[07:22:08] WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:08]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=60262;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=179809;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "52504f1d6ee449c6b4e2b8a216b127fe", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Validating unmask tokens not in data (num_proc=3): 0%| | 0/3 [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "95f860ee16624fa5a74029005869986b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Creating json from Arrow format: 0%| | 0/1 [00:00, ?ba/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
[07:22:09] INFO Running training command as subprocess: torchrun --nproc-per-node=1 --nnodes=1 main_ds.py:794\n", + " --node-rank=0 --rdzv-id=1 --rdzv-endpoint=127.0.0.1:29500 \n", + " /opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py \n", + " --model_name_or_path=/opt/app-root/src/qwen3-0.6b \n", + " --data_path=/dev/shm/data.jsonl \n", + " --output_dir=checkpoints/sft_comprehensive_example_20260323_072149 \n", + " --num_epochs=3 --effective_batch_size=1 --learning_rate=1e-05 \n", + " --num_warmup_steps=100 --save_samples=0 --log_level=INFO --max_batch_len=2048 \n", + " --seed=42 --adamw_weight_decay=0.0 --adamw_beta1=0.9 --adamw_beta2=0.95 \n", + " --adamw_eps=1e-08 --checkpoint_at_epoch --accelerate_full_state_at_epoch \n", + " --disable_flash_attn --distributed_training_framework=fsdp \n", + " --fsdp_sharding_strategy=HYBRID_SHARD \n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:09]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Running training command as subprocess: torchrun --nproc-per-\u001b[33mnode\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mnnodes\u001b[0m=\u001b[1;36m1\u001b[0m \u001b]8;id=725937;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=274171;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#794\u001b\\\u001b[2m794\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m --node-\u001b[33mrank\u001b[0m=\u001b[1;36m0\u001b[0m --rdzv-\u001b[33mid\u001b[0m=\u001b[1;36m1\u001b[0m --rdzv-\u001b[33mendpoint\u001b[0m=\u001b[1;92m127\u001b[0m\u001b[1;92m.0.0.1\u001b[0m:\u001b[1;36m29500\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[35m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/\u001b[0m\u001b[95mmain_ds.py\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mmodel_name_or_path\u001b[0m=\u001b[35m/opt/app-root/src/\u001b[0m\u001b[95mqwen3-0.6b\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mdata_path\u001b[0m=\u001b[35m/dev/shm/\u001b[0m\u001b[95mdata.jsonl\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33moutput_dir\u001b[0m=\u001b[35mcheckpoints\u001b[0m/sft_comprehensive_example_20260323_072149 \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mnum_epochs\u001b[0m=\u001b[1;36m3\u001b[0m --\u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mlearning_rate\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-05\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mnum_warmup_steps\u001b[0m=\u001b[1;36m100\u001b[0m --\u001b[33msave_samples\u001b[0m=\u001b[1;36m0\u001b[0m --\u001b[33mlog_level\u001b[0m=\u001b[35mINFO\u001b[0m --\u001b[33mmax_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mseed\u001b[0m=\u001b[1;36m42\u001b[0m --\u001b[33madamw_weight_decay\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.0\u001b[0m --\u001b[33madamw_beta1\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.9\u001b[0m --\u001b[33madamw_beta2\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.95\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33madamw_eps\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-08\u001b[0m --checkpoint_at_epoch --accelerate_full_state_at_epoch \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --disable_flash_attn --\u001b[33mdistributed_training_framework\u001b[0m=\u001b[35mfsdp\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mfsdp_sharding_strategy\u001b[0m=\u001b[35mHYBRID_SHARD\u001b[0m \u001b[2m \u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:22: UserWarning: DeepSpeed CPU Optimizer is not available. Some features may be unavailable.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:34: UserWarning: DeepSpeed is not available. Some features may be unavailable.\n", + " warnings.warn(\n", + "Loading weights: 100%|| 311/311 [00:04<00:00, 64.15it/s]]]]]]]] \n", + "The tied weights mapping and config for this model specifies to tie model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning\n", + "Generating train split: 3 examples [00:00, 565.78 examples/s]\n", + "\u001b[2;36m[07:22:34]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m \u001b[33mnum_gpus\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mavg_sample_len\u001b[0m=\u001b[1;36m50\u001b[0m\u001b[1;36m.000\u001b[0m, \u001b]8;id=234053;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=146316;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#499\u001b\\\u001b[2m499\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m \u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mmax_batch_len_per_gpu\u001b[0m=\u001b[1;36m2048\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mpacking_max_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m, \u001b[33mnum_batches\u001b[0m=\u001b[1;36m3\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mavg_samples_per_batch\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1;36m.000\u001b[0m, \u001b[33mtotal_samples\u001b[0m=\u001b[1;36m3\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m sharding_strategy is deprecated in favor \u001b]8;id=91161;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py\u001b\\\u001b[2mdataclasses.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619176;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py#1962\u001b\\\u001b[2m1962\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m of reshard_after_forward. This will be \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m removed in a future version of \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m Accelerate. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m Detected kernel version \u001b[1;36m3.10\u001b[0m.\u001b[1;36m0\u001b[0m, which is below \u001b]8;id=529903;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py\u001b\\\u001b[2mother.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=631262;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py#513\u001b\\\u001b[2m513\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m the recommended minimum of \u001b[1;36m5.5\u001b[0m.\u001b[1;36m0\u001b[0m; this can \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m cause the process to hang. It is recommended to \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m upgrade the kernel to the minimum version or \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m higher. \u001b[2m \u001b[0m\n", + "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py:444: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.HYBRID_SHARD since the world size is 1.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3ForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3DecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, self_attn.q_norm.weight, self_attn.k_norm.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1998: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.\n", + " warnings.warn(\n", + "Epoch 0: 0%| | 0/3 [00:00, ?it/s][HAMI-core ERROR (pid:4849 thread=139739523683904 allocator.c:54)]: Device 0 OOM 8619558912 / 8589934592\n", + "[HAMI-core ERROR (pid:4849 thread=139739523683904 allocator.c:54)]: Device 0 OOM 8860731392 / 8589934592\n", + "\u001b[2;36m[07:22:48]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Epoch: \u001b[1;36m0\u001b[0m, Step: \u001b[1;36m1\u001b[0m, Rank: \u001b[1;36m0\u001b[0m, loss = \u001b[1;36m0.601831\u001b[0m, \u001b]8;id=735392;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=571412;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#231\u001b\\\u001b[2m231\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m grad_accum_steps = \u001b[1;36m1\u001b[0m \u001b[2m \u001b[0m\n", + "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 8825079808 / 8589934592\n", + "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 9447933952 / 8589934592\n", + "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 9447933952 / 8589934592\n", + "[rank0]: Traceback (most recent call last):\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 1111, in
[07:22:55] ERROR Training subprocess has not exited yet. Sending SIGTERM. Process code: 1 main_ds.py:824\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31mERROR \u001b[0m Training subprocess has not exited yet. Sending SIGTERM. Process code: \u001b[1;36m1\u001b[0m \u001b]8;id=361092;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=987123;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#824\u001b\\\u001b[2m824\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO Waiting for process to exit, 60s... main_ds.py:830\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Waiting for process to exit, 60s\u001b[33m...\u001b[0m \u001b]8;id=771553;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=303331;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#830\u001b\\\u001b[2m830\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "\u274c Training failed after 1.1 minutes\n", + "Error: Suffered a failure during distributed training. Please see the training logs for more context.\n", + "============================================================\n", + "\n", + "\ud83d\udd0d Quick Troubleshooting Checklist:\n", + " \u25a1 Check that model_path exists or is a valid HuggingFace model name\n", + " \u25a1 Verify data_path points to valid JSONL file\n", + " \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\n", + " \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\n", + " \u25a1 For multi-node: verify network connectivity and endpoints\n", + " \u25a1 Check that all file paths are accessible from the training process\n" + ] + }, + { + "ename": "RuntimeError", + "evalue": "Suffered a failure during distributed training. Please see the training logs for more context.", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mRuntimeError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[12]\u001b[39m\u001b[32m, line 59\u001b[39m\n\u001b[32m 56\u001b[39m start_time = time.time()\n\u001b[32m 58\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m result = \u001b[43msft\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mtraining_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 61\u001b[39m end_time = time.time()\n\u001b[32m 62\u001b[39m duration = end_time - start_time\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:355\u001b[39m, in \u001b[36msft\u001b[39m\u001b[34m(model_path, data_path, ckpt_output_dir, backend, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m 352\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m create_algorithm\n\u001b[32m 354\u001b[39m algorithm = create_algorithm(\u001b[33m'\u001b[39m\u001b[33msft\u001b[39m\u001b[33m'\u001b[39m, backend)\n\u001b[32m--> \u001b[39m\u001b[32m355\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43malgorithm\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 356\u001b[39m \u001b[43m \u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 357\u001b[39m \u001b[43m \u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 358\u001b[39m \u001b[43m \u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 359\u001b[39m \u001b[43m \u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 360\u001b[39m \u001b[43m \u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 361\u001b[39m \u001b[43m \u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m=\u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 362\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 363\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 364\u001b[39m \u001b[43m \u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 365\u001b[39m \u001b[43m \u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 366\u001b[39m \u001b[43m \u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 367\u001b[39m \u001b[43m \u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 368\u001b[39m \u001b[43m \u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 369\u001b[39m \u001b[43m \u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m=\u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 370\u001b[39m \u001b[43m \u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 371\u001b[39m \u001b[43m \u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 372\u001b[39m \u001b[43m \u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 373\u001b[39m \u001b[43m \u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 374\u001b[39m \u001b[43m \u001b[49m\u001b[43meps\u001b[49m\u001b[43m=\u001b[49m\u001b[43meps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 375\u001b[39m \u001b[43m \u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m=\u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 376\u001b[39m \u001b[43m \u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 377\u001b[39m \u001b[43m \u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 378\u001b[39m \u001b[43m \u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 379\u001b[39m \u001b[43m \u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 380\u001b[39m \u001b[43m \u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 381\u001b[39m \u001b[43m \u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 382\u001b[39m \u001b[43m \u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 383\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 384\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 385\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 386\u001b[39m \u001b[43m \u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 387\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 388\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 389\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:213\u001b[39m, in \u001b[36mSFTAlgorithm.train\u001b[39m\u001b[34m(self, model_path, data_path, ckpt_output_dir, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m 209\u001b[39m params[key] = value\n\u001b[32m 211\u001b[39m params.update(kwargs)\n\u001b[32m--> \u001b[39m\u001b[32m213\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbackend\u001b[49m\u001b[43m.\u001b[49m\u001b[43mexecute_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mparams\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:70\u001b[39m, in \u001b[36mInstructLabTrainingSFTBackend.execute_training\u001b[39m\u001b[34m(self, algorithm_params)\u001b[39m\n\u001b[32m 67\u001b[39m torchrun_args = TorchrunArgs(**final_torchrun_params)\n\u001b[32m 69\u001b[39m \u001b[38;5;66;03m# Execute training\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m70\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 71\u001b[39m \u001b[43m \u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorchrun_args\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 72\u001b[39m \u001b[43m \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtraining_args\u001b[49m\n\u001b[32m 73\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/__init__.py:41\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m 38\u001b[39m \u001b[38;5;66;03m# Local\u001b[39;00m\n\u001b[32m 39\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[34;01mmain_ds\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m run_training\n\u001b[32m---> \u001b[39m\u001b[32m41\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:841\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m 839\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m interrupt\n\u001b[32m 840\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m failure:\n\u001b[32m--> \u001b[39m\u001b[32m841\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[32m 842\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mSuffered a failure during distributed training. Please see the training logs for more context.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 843\u001b[39m )\n", + "\u001b[31mRuntimeError\u001b[39m: Suffered a failure during distributed training. Please see the training logs for more context." + ] + } + ], + "source": [ + "# =============================================================================\n", + "# TRAINING EXECUTION\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\ude80 Starting SFT Training\")\n", + "print(\"=\" * 60)\n", + "print(f\"Experiment: {full_experiment_name}\")\n", + "print(f\"Model: {selected_example['model_name']}\")\n", + "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n", + "print(f\"Configuration: {dist_config['description']}\")\n", + "print()\n", + "\n", + "# Prepare all training parameters\n", + "training_params = {\n", + " # Required parameters\n", + " 'model_path': model_path,\n", + " 'data_path': data_path,\n", + " 'ckpt_output_dir': ckpt_output_dir,\n", + " \n", + " # Core training parameters\n", + " 'num_epochs': num_epochs,\n", + " 'effective_batch_size': effective_batch_size,\n", + " 'learning_rate': learning_rate,\n", + " 'max_seq_len': max_seq_len,\n", + " 'max_tokens_per_gpu': max_tokens_per_gpu,\n", + " \n", + " # Data and processing parameters\n", + " 'data_output_dir': data_output_dir,\n", + " 'warmup_steps': warmup_steps,\n", + " 'save_samples': save_samples,\n", + " \n", + " # Checkpointing parameters\n", + " 'checkpoint_at_epoch': checkpoint_at_epoch,\n", + " 'accelerate_full_state_at_epoch': accelerate_full_state_at_epoch,\n", + " \n", + " # Distributed training parameters\n", + " 'nproc_per_node': nproc_per_node,\n", + " 'nnodes': nnodes,\n", + " 'node_rank': node_rank,\n", + " 'rdzv_id': rdzv_id,\n", + " 'rdzv_endpoint': rdzv_endpoint,\n", + "\n", + " 'disable_flash_attn': True\n", + "}\n", + "\n", + "# Display final configuration summary\n", + "print(\"\ud83d\udccb Final Training Configuration:\")\n", + "for key, value in training_params.items():\n", + " print(f\" {key}: {value}\")\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\"\u23f3 Training starting...\")\n", + "print(\"=\"*60)\n", + "\n", + "# Execute training\n", + "start_time = time.time()\n", + "\n", + "try:\n", + " result = sft(**training_params)\n", + " \n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(\"\u2705 Training completed successfully!\")\n", + " print(f\"\u23f1\ufe0f Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", + " print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n", + " print(\"=\"*60)\n", + " \n", + "except Exception as e:\n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n", + " print(f\"Error: {e}\")\n", + " print(\"=\"*60)\n", + " \n", + " print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n", + " print(\" \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n", + " print(\" \u25a1 Verify data_path points to valid JSONL file\")\n", + " print(\" \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n", + " print(\" \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n", + " print(\" \u25a1 For multi-node: verify network connectivity and endpoints\")\n", + " print(\" \u25a1 Check that all file paths are accessible from the training process\")\n", + " \n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Training Analysis\n", + "\n", + "After training completes, let's analyze the results and provide guidance for next steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# POST-TRAINING ANALYSIS AND NEXT STEPS\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\udcca Post-Training Analysis\")\n", + "print(\"=\" * 50)\n", + "\n", + "# Check for saved checkpoints\n", + "checkpoint_dir = f\"{ckpt_output_dir}/hf_format\"\n", + "\n", + "if os.path.exists(checkpoint_dir):\n", + " checkpoints = [d for d in os.listdir(checkpoint_dir) \n", + " if os.path.isdir(os.path.join(checkpoint_dir, d))]\n", + " \n", + " if checkpoints:\n", + " print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n", + " for ckpt in sorted(checkpoints):\n", + " ckpt_path = os.path.join(checkpoint_dir, ckpt)\n", + " print(f\" \ud83d\udcc1 {ckpt}\")\n", + " \n", + " # Identify the final checkpoint\n", + " final_checkpoint = sorted(checkpoints)[-1]\n", + " final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n", + " \n", + " print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n", + " \n", + " # Provide model loading example\n", + " print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n", + " print(f\"```python\")\n", + " print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n", + " print(f\"\")\n", + " print(f\"# Load your fine-tuned model\")\n", + " print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"\")\n", + " print(f\"# Generate text\")\n", + " print(f\"inputs = tokenizer('Your prompt here:', return_tensors='pt')\")\n", + " print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n", + " print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n", + " print(f\"print(response)\")\n", + " print(f\"```\")\n", + " else:\n", + " print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n", + "else:\n", + " print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n", + "\n", + "# Training summary\n", + "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n", + "print(f\" Model: {selected_example['model_name']}\")\n", + "print(f\" Epochs: {num_epochs}\")\n", + "print(f\" Global Batch Size: {effective_batch_size}\")\n", + "print(f\" Learning Rate: {learning_rate}\")\n", + "print(f\" Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n", + "print(f\" Max Sequence Length: {max_seq_len:,}\")\n", + "print(f\" Total GPUs: {total_gpus}\")\n", + "print(f\" Distributed Config: {dist_config['description']}\")\n", + "\n", + "# Next steps recommendations\n", + "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n", + "print(f\" 1. \ud83e\uddea Test your model with sample inputs to verify it's working\")\n", + "print(f\" 2. \ud83d\udcca Evaluate performance on your validation/test datasets\")\n", + "print(f\" 3. \ud83d\udd04 Compare outputs with the original base model\")\n", + "print(f\" 4. \ud83c\udfaf Fine-tune hyperparameters if needed (learning rate, batch size)\")\n", + "print(f\" 5. \ud83d\udcdd Document your configuration and results for reproducibility\")\n", + "print(f\" 6. \ud83d\udea2 Deploy for inference using your preferred serving framework\")\n", + "\n", + "# Performance optimization tips\n", + "print(f\"\\n\u26a1 Performance Optimization Tips:\")\n", + "print(f\" \u2022 If training was slow: increase max_tokens_per_gpu or effective_batch_size\")\n", + "print(f\" \u2022 If you hit OOM errors: reduce max_tokens_per_gpu or effective_batch_size\")\n", + "print(f\" \u2022 For better convergence: try different learning rates or warmup_steps\")\n", + "print(f\" \u2022 For production training: consider using the script version for better logging\")\n", + "\n", + "print(f\"\\n\u2728 SFT Training Complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parameter Reference Summary\n", + "\n", + "Quick reference for all SFT parameters and their purposes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Core Parameters\n", + "\n", + "| Parameter | Required | Description | Example Values |\n", + "|-----------|----------|-------------|----------------|\n", + "| `model_path` | \u2705 | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n", + "| `data_path` | \u2705 | Path to the training data | `\"/path/to/train.jsonl\"` |\n", + "| `ckpt_output_dir` | \u2705 | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n", + "| `num_epochs` | \u274c | Number of training epochs | `1`, `3`, `5` |\n", + "| `effective_batch_size` | \u274c | Effective batch size for training | `64`, `128`, `256` |\n", + "| `learning_rate` | \u274c | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n", + "| `max_seq_len` | \u274c | Maximum sequence length | `2048`, `8192`, `16384` |\n", + "| `max_tokens_per_gpu` | \u274c | Maximum tokens per GPU in a mini-batch (hard-cap for memory) | `15000`, `25000`, `40000` |\n", + "\n", + "### Data Processing Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|------------------|\n", + "| `data_output_dir` | Directory to save processed data | `\"/dev/shm\"` (RAM disk) |\n", + "| `warmup_steps` | Number of warmup steps | `100`, `500` |\n", + "\n", + "### Checkpointing Parameters\n", + "\n", + "| Parameter | Description | Recommended |\n", + "|-----------|-------------|-------------|\n", + "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n", + "| `accelerate_full_state_at_epoch` | Whether to save full state at epoch for automatic checkpoint resumption | `True` |\n", + "| `save_samples` | Number of samples to save after training (0 disables) | `1000`, `0` (disabled) |\n", + "\n", + "### Distributed Training Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n", + "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n", + "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n", + "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n", + "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n", + "\n", + "### Memory Optimization Guidelines\n", + "\n", + "- **Start conservative**: Begin with lower `max_tokens_per_gpu` values and increase gradually\n", + "- **Monitor usage**: Watch GPU memory during training and adjust accordingly\n", + "- **Balance batch size**: Larger `effective_batch_size` can improve training stability\n", + "- **Use RAM disk**: Set `data_output_dir=\"/dev/shm\"` for faster data loading\n", + "\n", + "### Multi-Node Setup Checklist\n", + "\n", + "1. \u2705 Ensure network connectivity between all nodes\n", + "2. \u2705 Use the same `rdzv_id` and `rdzv_endpoint` on all nodes\n", + "3. \u2705 Set unique `node_rank` for each node (0, 1, 2, ...)\n", + "4. \u2705 Verify all nodes can access model and data paths\n", + "5. \u2705 Start training simultaneously on all nodes\n", + "\n", + "### Popular Model Examples\n", + "\n", + "| Model | HuggingFace Path | Example Config |\n", + "|-------|------------------|----------------|\n", + "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | `max_tokens_per_gpu=20000` |\n", + "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | `max_tokens_per_gpu=18000` |\n", + "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | `max_tokens_per_gpu=25000` |\n", + "\n", + "### Script Alternative\n", + "\n", + "For production workloads or long-running training, use the script version:\n", + "\n", + "```bash\n", + "python scripts/sft_qwen_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "\n", + "python scripts/sft_llama_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "\n", + "python scripts/sft_phi_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "```" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx new file mode 100644 index 0000000..ac63978 --- /dev/null +++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx @@ -0,0 +1,198 @@ +--- +weight: 25 +--- + +# Fine-tuning LLMs with Training Hub + +## Background + +`training_hub` is a Python library that provides a unified, high-level API for running Supervised Fine-Tuning (SFT) and Orthogonal Subspace Fine-Tuning (OSFT) on large language models. It abstracts away the complexity of distributed training configuration, memory management, and backend orchestration, letting you focus on experiment parameters. + +**Key benefits:** + +- **Unified API**: A single function call (`sft(...)` or `osft(...)`) handles single-GPU, multi-GPU, and multi-node training without changing your code. +- **Automatic memory management**: The `max_tokens_per_gpu` parameter caps GPU memory usage and automatically computes micro-batch size and gradient accumulation to maintain your target `effective_batch_size`. +- **OSFT for continual learning**: The `osft` function implements [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), which restricts weight updates to orthogonal subspaces โ preventing catastrophic forgetting without replay buffers or supplementary datasets. +- **Production-ready**: Built-in checkpointing, experiment tracking, and Liger kernel support for throughput efficiency. + +### SFT vs OSFT + +| Aspect | SFT | OSFT | +|--------|-----|------| +| **Use case** | Initial instruction tuning, base model fine-tuning | Continual domain adaptation of already-tuned models | +| **Catastrophic forgetting** | Requires mixed/replay data to mitigate | Prevented algorithmically | +| **Key parameter** | Standard hyperparameters | `unfreeze_rank_ratio` (0.0โ1.0) | +| **Backend** | instructlab-training | mini-trainer | + +## Requirements + +- **Alauda AI** and **Alauda AI Workbench** must be installed in your cluster. +- A Workbench (Notebook) instance with: + - Access to install Python packages from the internet (or a configured internal PyPI mirror). + - GPU resources attached (at least one NVIDIA GPU). + - Sufficient shared storage for model checkpoints. +- A HuggingFace model (local path or model name resolvable from the instance). +- Training data in **JSONL format** (see [Data Format](#data-format) below). + +## Data Format + +Training data must be a JSON Lines (`.jsonl`) file where each line is a conversation: + +```json +{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of AI..."}]} +``` + +Supported `role` values: `system`, `user`, `assistant`, `pretraining`. + +**Masking behavior:** + +- **SFT (default)** โ only assistant responses contribute to the training loss. Add `"unmask": true` to a sample to include all non-system content in the loss (pretraining style). +- **OSFT** โ controlled via the `unmask_messages` parameter (`False` by default; set `True` for pretraining style). + +Pre-processed datasets with `input_ids` and `labels` fields are also supported via `use_processed_dataset=True`. + +## Download Notebooks and Run Examples + +Two comprehensive tutorial notebooks are provided. Download them to your Workbench instance and execute them cell by cell. + +| Notebook | Algorithm | Download | +|----------|-----------|----------| +| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [Download sft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) | +| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [Download osft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) | + +### Step 1 โ Install Dependencies + +Open a terminal in your Workbench instance and install `training-hub`: + +```bash +pip install training-hub +``` + +### Step 2 โ Upload or Prepare Data + +Place your `.jsonl` training file in a path accessible to the notebook, for example `/data/train.jsonl`. + +### Step 3 โ Open and Configure the Notebook + +Open the downloaded notebook in your Workbench instance. The key cells to configure are: + +**Select your model** (both notebooks): + +```python +# Change to your model's HuggingFace name or local path +model_path = "Qwen/Qwen2.5-7B-Instruct" +``` + +Bundled model presets cover Qwen 2.5 7B, Llama 3.1 8B, Phi 4 Mini, and generic 7B/small models. + +**Set required paths** (both notebooks): + +```python +data_path = "/path/to/your/training_data.jsonl" +ckpt_output_dir = "/path/to/checkpoints/my_experiment" +``` + +**OSFT only โ set the orthogonality ratio:** + +```python +unfreeze_rank_ratio = 0.25 # 0.1โ0.3 conservative, 0.3โ0.5 balanced +``` + +**Select distributed configuration:** + +```python +selected_distributed = "single_node_8gpu" # or "single_gpu_dev", "multi_node_master", etc. +``` + +### Step 4 โ Execute Training + +Run all cells in sequence. The final training cell calls either: + +```python +# SFT +from training_hub import sft +result = sft( + model_path=model_path, + data_path=data_path, + ckpt_output_dir=ckpt_output_dir, + effective_batch_size=128, + max_tokens_per_gpu=20000, + max_seq_len=16384, + learning_rate=1e-5, + num_epochs=3, + nproc_per_node=8, + ... +) + +# OSFT +from training_hub import osft +result = osft( + model_path=model_path, + data_path=data_path, + ckpt_output_dir=ckpt_output_dir, + unfreeze_rank_ratio=0.25, + effective_batch_size=128, + max_tokens_per_gpu=10000, + max_seq_len=8192, + learning_rate=5e-6, + num_epochs=1, + nproc_per_node=8, + ... +) +``` + +Checkpoints are written to `ckpt_output_dir` at the end of each epoch (configurable via `checkpoint_at_epoch`). + +## Key Parameters + +### Common Parameters (SFT and OSFT) + +| Parameter | Required | Description | +|-----------|----------|-------------| +| `model_path` | Yes | HuggingFace model name or local path | +| `data_path` | Yes | Path to JSONL training data | +| `ckpt_output_dir` | Yes | Directory to save checkpoints | +| `effective_batch_size` | Yes | Global effective batch size | +| `max_tokens_per_gpu` | Yes | Per-GPU token budget; controls memory and auto-computes micro-batch size | +| `max_seq_len` | Yes | Maximum sequence length | +| `learning_rate` | Yes | Optimizer learning rate | +| `num_epochs` | No | Training epochs (default: `1`) | +| `lr_scheduler` | No | Scheduler type, e.g. `"cosine"` | +| `warmup_steps` | No | Linear warmup steps (default: `0`) | +| `use_liger` | No | Enable Liger kernels for efficiency (default: `True` for OSFT) | +| `seed` | No | Random seed (default: `42`) | +| `data_output_dir` | No | Processed data cache dir; use `"/dev/shm"` for RAM-disk speed | +| `use_processed_dataset` | No | Skip tokenization if data has `input_ids`/`labels` | +| `checkpoint_at_epoch` | No | Save checkpoint each epoch (default: `True`) | +| `save_final_checkpoint` | No | Save a final checkpoint after training (default: `True`) | +| `nproc_per_node` | No | GPUs per node | +| `nnodes` | No | Total nodes (default: `1`) | +| `node_rank` | No | This node's rank (default: `0`) | +| `rdzv_id` | No | Rendezvous job ID | +| `rdzv_endpoint` | No | Master node `host:port` for multi-node | + +### OSFT-specific Parameters + +| Parameter | Required | Description | +|-----------|----------|-------------| +| `unfreeze_rank_ratio` | Yes | Fraction of each weight matrix that can be updated (0.0โ1.0). Lower = more preservation. | +| `unmask_messages` | No | If `True`, trains on all non-system content (pretraining style) | +| `target_patterns` | No | Substring patterns to restrict OSFT to specific layers (default: `None`, all layers) | + +### Multi-node Training + +For multi-node jobs, run the notebook (or equivalent script) on every node simultaneously with matching `rdzv_id` and `rdzv_endpoint`, varying only `node_rank` per node: + +```python +# Master node (node_rank=0) +nproc_per_node = 8 +nnodes = 2 +node_rank = 0 +rdzv_id = 42 +rdzv_endpoint = "10.0.0.1:29500" + +# Worker node (node_rank=1) +node_rank = 1 # all other params identical +``` + +All nodes must have network connectivity to the `rdzv_endpoint` before training begins.