From 5edfb2915afd0fbe274ee30af7ed8a469fbc4524 Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Mon, 23 Mar 2026 21:20:07 +0800 Subject: [PATCH 1/3] add docs for training hub --- .../how_to/osft_comprehensive_tutorial.ipynb | 1007 ++++++++++ .../how_to/sft_comprehensive_tutorial.ipynb | 1679 +++++++++++++++++ .../how_to/training_hub_fine_tuning.mdx | 198 ++ 3 files changed, 2884 insertions(+) create mode 100644 docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb create mode 100644 docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb create mode 100644 docs/en/workbench/how_to/training_hub_fine_tuning.mdx diff --git a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb new file mode 100644 index 0000000..ebd6edd --- /dev/null +++ b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb @@ -0,0 +1,1007 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comprehensive OSFT Training Tutorial\n", + "\n", + "This notebook provides a comprehensive guide to Orthogonal Subspace Fine-Tuning (OSFT) using the training_hub library. We'll cover:\n", + "\n", + "- **All available parameters** and their detailed explanations\n", + "- **Single-node and multi-node training** configurations\n", + "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n", + "- **Best practices and troubleshooting**\n", + "\n", + "OSFT (Orthogonal Subspace Fine-Tuning) is an algorithm based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097) that enables continual training of pre-trained or instruction-tuned models **without** catastrophic forgetting and **without** needing replay buffers or supplementary datasets.\n", + "\n", + "This tutorial serves as both a learning resource and a template you can adapt for your specific continual learning needs.\n", + "\n", + "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/osft_qwen_example.py`, `scripts/osft_llama_example.py`, and `scripts/osft_phi_example.py` with better logging consistency.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is OSFT?\n", + "\n", + "OSFT (Orthogonal Subspace Fine-Tuning) is a continual learning algorithm that allows you to adapt pre-trained or instruction-tuned models to new domains **without catastrophic forgetting**. Based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), OSFT fundamentally changes how we approach model adaptation.\n", + "\n", + "### Key Innovation\n", + "\n", + "Traditional fine-tuning updates all model parameters, which can overwrite previously learned knowledge. OSFT instead:\n", + "1. **Identifies orthogonal subspaces** in the model's weight matrices\n", + "2. **Restricts updates to these subspaces**, preserving existing knowledge\n", + "3. **Eliminates the need for replay buffers** or supplementary datasets\n", + "\n", + "### OSFT vs Traditional Fine-Tuning\n", + "\n", + "| Aspect | Traditional SFT | OSFT |\n", + "|--------|----------------|------|\n", + "| **Catastrophic Forgetting** | Common problem | Prevented by design |\n", + "| **Data Requirements** | Needs replay/mixed data | Only new domain data |\n", + "| **Preservation Method** | Data mixing ratios | Algorithm (math guarantees) |\n", + "| **Memory Usage** | Similar | Similar |\n", + "| **Complexity** | Complex data pipelines | Simple, direct |\n", + "\n", + "### When to Use OSFT\n", + "\n", + "**Perfect for:**\n", + "- Adding domain-specific knowledge (medical, legal, technical)\n", + "- Adapting to new languages or dialects\n", + "- Customizing instruction formats\n", + "- Continual learning across multiple domains\n", + "- Any scenario where you need to preserve existing capabilities\n", + "\n", + "**Not needed for:**\n", + "- Training from scratch\n", + "- Base model pre-training\n", + "- When you want to completely replace model behavior\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Understanding the Key Parameter: `unfreeze_rank_ratio`\n", + "\n", + "The `unfreeze_rank_ratio` is the most important OSFT-specific parameter. It controls the balance between preservation and adaptation.\n", + "\n", + "### What Does It Do?\n", + "\n", + "- Controls **how much of each weight matrix** can be updated during training\n", + "- Range: `0.0` to `1.0`\n", + "- Lower values = more preservation, slower adaptation\n", + "- Higher values = more adaptation, slightly less preservation\n", + "\n", + "### Visual Intuition\n", + "\n", + "Think of a weight matrix as a building:\n", + "- `unfreeze_rank_ratio = 0.1`: You can only renovate 10% of the rooms\n", + "- `unfreeze_rank_ratio = 0.3`: You can renovate 30% of the rooms\n", + "- `unfreeze_rank_ratio = 1.0`: You can renovate the entire building (standard fine-tuning)\n", + "\n", + "The \"rooms\" you renovate are carefully chosen to be orthogonal to existing knowledge, preventing damage to what's already there.\n", + "\n", + "### Recommended Settings by Use Case\n", + "\n", + "| Use Case | Recommended Ratio | Why? |\n", + "|----------|-------------------|------|\n", + "| **Minor format adjustments** | 0.1-0.15 | Minimal changes needed |\n", + "| **Domain vocabulary addition** | 0.15-0.25 | Add terms without losing general knowledge |\n", + "| **Domain specialization** | 0.25-0.35 | Balance preservation and new expertise |\n", + "| **Major capability expansion** | 0.35-0.5 | Significant new learning required |\n", + "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n", + "\n", + "### Practical Guidelines\n", + "\n", + "```python\n", + "# Conservative: Maximum preservation\n", + "unfreeze_rank_ratio = 0.2 # Great for adding specialized knowledge\n", + "\n", + "# Balanced: Good for most use cases \n", + "unfreeze_rank_ratio = 0.3 # Ideal default for domain adaptation\n", + "\n", + "# Aggressive: When you need significant changes\n", + "unfreeze_rank_ratio = 0.4 # Use when preservation is less critical\n", + "```\n", + "\n", + "**Pro tip:** Start conservative (0.2-0.3) and increase only if needed. It's easier to train again with higher ratio than to recover lost capabilities!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "data_path = \"./test_osft_data.jsonl\"\n", + "if not os.path.exists(data_path):\n", + " print(f\"Creating dummy dataset at {data_path}\")\n", + " dummy_data = [\n", + " {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n", + " ] * 10\n", + " with open(data_path, \"w\") as f:\n", + " for d in dummy_data:\n", + " f.write(json.dumps(d) + \"\\n\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The `target_patterns` Parameter (Advanced Users Only)\n", + "\n", + "There's an optional `target_patterns` parameter that allows targeting specific model layers for OSFT:\n", + "\n", + "```python\n", + "target_patterns = None # Default: applies OSFT to all appropriate layers (RECOMMENDED)\n", + "```\n", + "\n", + "**\u26a0\ufe0f Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n", + "\n", + "If you do need to use it, it performs simple substring matching on module names:\n", + "- `target_patterns = [\"attention\"]` \u2192 Targets modules with \"attention\" in the name\n", + "- `target_patterns = [\"mlp\"]` \u2192 Targets modules with \"mlp\" in the name\n", + "\n", + "**For 99% of users:** Just use the default (`None`) and let OSFT handle layer selection automatically. The algorithm knows what it's doing.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Before running this notebook, install the required dependencies:\n", + "\n", + "```bash\n", + "# Install training-hub (brings in mini-trainer and other deps)\n", + "pip install training-hub\n", + "\n", + "# Install PyTorch 2.9+ with CUDA 12.9 support (required by mini-trainer 0.7+)\n", + "pip install torch==2.9.0+cu129 torchvision==0.24.0+cu129 --index-url https://download.pytorch.org/whl/cu129\n", + "\n", + "# Install a compatible NCCL build (needed if system NCCL mismatches your CUDA driver)\n", + "pip install nvidia-nccl-cu12 nvidia-nvshmem-cu12\n", + "```\n", + "\n", + "> **Note:** If `flash-attn` is not available on your platform, the notebook sets `TESTING=true` to fall back to PyTorch SDPA attention. If `liger-kernel` is not installed, set `use_liger = False` in the parameters cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Imports\n", + "\n", + "Let's start by importing the necessary libraries and setting up our environment.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import training_hub for OSFT training\n", + "from training_hub import osft\n", + "\n", + "# Standard library imports\n", + "import os\n", + "import time\n", + "from datetime import datetime\n", + "from pathlib import Path\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Format Requirements\n", + "\n", + "Before configuring your training, ensure your data is in the correct format. OSFT uses the mini-trainer backend, which supports both standard messages format and pre-processed datasets.\n", + "\n", + "### Required Format: JSONL with Messages\n", + "\n", + "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n", + "\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n", + "```\n", + "\n", + "### Message Structure\n", + "\n", + "Each conversation contains a `messages` array with message objects having:\n", + "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n", + "- **`content`**: The text content of the message\n", + "- **`reasoning_content`** (optional): Additional reasoning traces\n", + "\n", + "### Masking Control with `unmask_messages` Parameter\n", + "\n", + "Control which parts of the conversation are used for training loss:\n", + "\n", + "#### Standard Instruction Tuning (default)\n", + "```python\n", + "osft(..., unmask_messages=False) # Only assistant responses used for loss\n", + "```\n", + "- **Trains only on assistant responses** (standard instruction-following)\n", + "- System messages are always masked (ignored for loss)\n", + "- User messages are masked\n", + "- Assistant messages are unmasked (used for loss calculation)\n", + "\n", + "#### Pretraining Mode\n", + "```python\n", + "osft(..., unmask_messages=True) # All content except system messages used for loss\n", + "```\n", + "- **Trains on all content except system messages**\n", + "- System messages are always masked\n", + "- User and assistant messages are both unmasked\n", + "- Useful for pretraining-style data where the model should learn from all text\n", + "\n", + "### Pre-processed Dataset Option\n", + "\n", + "If you have pre-processed data with `input_ids` and `labels` fields:\n", + "\n", + "```json\n", + "{\"input_ids\": [1, 2, 3, ...], \"labels\": [1, 2, 3, ...]}\n", + "```\n", + "\n", + "Use with:\n", + "```python\n", + "osft(..., use_processed_dataset=True)\n", + "```\n", + "\n", + "### Data Path Configuration\n", + "\n", + "When configuring your training, point to your JSONL file:\n", + "\n", + "```python\n", + "data_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n", + "```\n", + "\n", + "The training pipeline will automatically:\n", + "1. Load and validate your JSONL data\n", + "2. Apply chat templates based on your model\n", + "3. Handle masking according to the `unmask_messages` setting\n", + "4. Process the data for efficient training\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Configuration Examples\n", + "\n", + "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and continual learning requirements.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# MODEL CONFIGURATION EXAMPLES FOR OSFT\n", + "# These are example configurations - adjust based on your hardware and requirements\n", + "# =============================================================================\n", + "\n", + "# Example 1: Qwen 2.5 7B Instruct\n", + "qwen_example = {\n", + " \"model_name\": \"Qwen 2.5 7B Instruct\",\n", + " \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.25, # Conservative for preserving multilingual capabilities\n", + " \"example_max_tokens_per_gpu\": 2048,\n", + " \"example_max_seq_len\": 2048, # Qwen 2.5 supports long context\n", + " \"example_batch_size\": 1,\n", + " \"example_learning_rate\": 5e-6, \n", + " \"notes\": \"Excellent for domain adaptation while preserving multilingual capabilities\"\n", + "}\n", + "\n", + "# Example 2: Llama 3.1 8B Instruct\n", + "llama_example = {\n", + " \"model_name\": \"Llama 3.1 8B Instruct\",\n", + " \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.3, # Slightly higher for more adaptation freedom\n", + " \"example_max_tokens_per_gpu\": 10000,\n", + " \"example_max_seq_len\": 8192, # Supports up to 128K but 8K is common\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Ideal for adding specialized knowledge without losing general capabilities\"\n", + "}\n", + "\n", + "# Example 3: Phi 4 Mini\n", + "phi_example = {\n", + " \"model_name\": \"Phi 4 Mini\",\n", + " \"model_path\": \"microsoft/Phi-4-mini-instruct\", # HuggingFace model name or local path\n", + " \"example_unfreeze_rank_ratio\": 0.25, # Conservative for smaller model\n", + " \"example_max_tokens_per_gpu\": 8192,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 64,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Efficient for edge deployment with continual adaptation\"\n", + "}\n", + "\n", + "# Example 4: Generic 7B Base Model\n", + "generic_7b_example = {\n", + " \"model_name\": \"Generic 7B Base\",\n", + " \"model_path\": \"/path/to/your-7b-model\", # Local path to model directory\n", + " \"example_unfreeze_rank_ratio\": 0.3, # Balanced preservation vs adaptation\n", + " \"example_max_tokens_per_gpu\": 10000,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 5e-6,\n", + " \"notes\": \"Good baseline for most 7B instruction-tuned models\"\n", + "}\n", + "\n", + "# Example 5: Smaller Model (1B-3B)\n", + "small_model_example = {\n", + " \"model_name\": \"Small Model (1B-3B)\",\n", + " \"model_path\": \"/path/to/small-model\", # Local path or HuggingFace name\n", + " \"example_unfreeze_rank_ratio\": 0.4, # Higher ratio for smaller models\n", + " \"example_max_tokens_per_gpu\": 16_000,\n", + " \"example_max_seq_len\": 4096,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 3e-5,\n", + " \"notes\": \"Smaller models can handle more aggressive adaptation\"\n", + "}\n", + "\n", + "# =============================================================================\n", + "# SELECT YOUR CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Choose one of the examples above as a starting point\n", + "selected_example = qwen_example # Change this to your preferred example\n", + "\n", + "print(f\"Selected Example: {selected_example['model_name']}\")\n", + "print(f\"Model Path: {selected_example['model_path']}\")\n", + "print(f\"OSFT Unfreeze Rank Ratio: {selected_example['example_unfreeze_rank_ratio']}\")\n", + "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n", + "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n", + "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n", + "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n", + "print(f\"Notes: {selected_example['notes']}\")\n", + "print(\"\\n\ud83d\udca1 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n", + "print(\" Adjust unfreeze_rank_ratio based on preservation vs adaptation needs.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Complete Parameter Reference\n", + "\n", + "Let's configure all available OSFT parameters with detailed explanations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# COMPLETE OSFT PARAMETER CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Experiment identification\n", + "experiment_name = \"osft_comprehensive_example\"\n", + "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n", + "\n", + "# =============================================================================\n", + "# REQUIRED PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# TODO: revert these overrides after we've concluded training\n", + "model_path = selected_example[\"model_path\"] # HuggingFace model name or local path\n", + "data_path = \"./test_osft_data.jsonl\" # Path to training data in JSONL format\n", + "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\" # Where to save checkpoints\n", + "unfreeze_rank_ratio = selected_example[\"example_unfreeze_rank_ratio\"] # OSFT-specific parameter\n", + "effective_batch_size = selected_example[\"example_batch_size\"] # Effective batch size for training\n", + "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"] # Maximum tokens per GPU (memory limit)\n", + "max_seq_len = selected_example[\"example_max_seq_len\"] # Maximum sequence length\n", + "learning_rate = selected_example[\"example_learning_rate\"] # Learning rate for training\n", + "\n", + "print(\"\ud83d\udccb Required Parameters (all must be specified):\")\n", + "print(f\" \u2022 model_path: {model_path}\")\n", + "print(f\" \u2022 data_path: {data_path}\")\n", + "print(f\" \u2022 ckpt_output_dir: {ckpt_output_dir}\")\n", + "print(f\" \u2022 unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n", + "print(f\" \u2022 effective_batch_size: {effective_batch_size}\")\n", + "print(f\" \u2022 max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n", + "print(f\" \u2022 max_seq_len: {max_seq_len:,}\")\n", + "print(f\" \u2022 learning_rate: {learning_rate}\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# OSFT-SPECIFIC PARAMETERS\n", + "# =============================================================================\n", + "\n", + "target_patterns = None # Optional: Patterns to match specific modules for OSFT\n", + "# Example: [\"*attention*\", \"*mlp*\"] to target attention and MLP layers\n", + "\n", + "print(\"\ud83d\udd27 OSFT-Specific Parameters:\")\n", + "print(f\" unfreeze_rank_ratio: {unfreeze_rank_ratio} - Controls how much of each matrix is unfrozen\")\n", + "print(f\" \u2022 0.1-0.3: Conservative, maximum preservation\")\n", + "print(f\" \u2022 0.3-0.5: Balanced adaptation\")\n", + "print(f\" \u2022 >0.5: Rarely needed for typical use cases\")\n", + "print(f\" target_patterns: {target_patterns} - Optional patterns for selecting specific modules\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# TRAINING HYPERPARAMETERS\n", + "# =============================================================================\n", + "\n", + "# num_epochs = 3 # Number of training epochs\n", + "num_epochs = 1 # Number of training epochs\n", + "seed = 42 # Random seed for reproducibility\n", + "lr_scheduler = \"cosine\" # Learning rate scheduler\n", + "lr_scheduler_kwargs = {} # Scheduler parameters\n", + "warmup_steps = 0 # Number of warmup steps\n", + "\n", + "print(\"\ud83c\udfaf Training Hyperparameters:\")\n", + "print(f\" effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n", + "print(f\" learning_rate: {learning_rate} - Learning rate for model updates\")\n", + "print(f\" num_epochs: {num_epochs} - Number of training epochs\")\n", + "print(f\" lr_scheduler: '{lr_scheduler}' - Learning rate scheduler type\")\n", + "print(f\" lr_scheduler_kwargs: {lr_scheduler_kwargs} - Scheduler parameters\")\n", + "print(f\" warmup_steps: {warmup_steps} - Number of warmup steps\")\n", + "print(f\" seed: {seed} - Random seed for reproducibility\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# MEMORY AND PERFORMANCE PARAMETERS\n", + "# =============================================================================\n", + "\n", + "use_liger = True # Use Liger kernels for efficiency\n", + "\n", + "print(\"\u26a1 Memory and Performance Parameters:\")\n", + "print(f\" max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU (hard-cap for memory)\")\n", + "print(f\" max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n", + "print(f\" use_liger: {use_liger} - Use Liger kernels for efficiency\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# DATA PROCESSING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "data_output_dir = \"/dev/shm/osft_data\" # Directory for processed data (RAM disk for speed)\n", + "use_processed_dataset = False # Whether data is pre-processed\n", + "unmask_messages = False # Whether to unmask all messages for pretraining-style learning\n", + "\n", + "print(\"\ud83d\udcbe Data Processing Parameters:\")\n", + "print(f\" data_path: '{data_path}' - Path to training data (JSONL format)\")\n", + "print(f\" data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n", + "print(f\" use_processed_dataset: {use_processed_dataset} - Whether to use pre-processed data\")\n", + "print(f\" unmask_messages: {unmask_messages} - Whether to unmask all messages\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CHECKPOINTING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "checkpoint_at_epoch = True # Whether to checkpoint at each epoch\n", + "save_final_checkpoint = True # Whether to save final checkpoint\n", + "\n", + "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n", + "print(f\" ckpt_output_dir: '{ckpt_output_dir}' - Directory to save checkpoints\")\n", + "print(f\" checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n", + "print(f\" save_final_checkpoint: {save_final_checkpoint} - Whether to save final checkpoint\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distributed Training Configuration\n", + "\n", + "Configure distributed training for both single-node and multi-node setups.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# DISTRIBUTED TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# Configuration options for different setups\n", + "distributed_configs = {\n", + " \"single_gpu_dev\": {\n", + " \"nproc_per_node\": 1,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 1,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Development setup with single GPU\"\n", + " },\n", + " \"single_node_8gpu\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 100,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Single node with 8 GPUs\"\n", + " },\n", + " \"multi_node_master\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 2, # 2 nodes\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 42,\n", + " # master node IP\n", + " \"rdzv_endpoint\": \"10.241.128.23:1738\", # Replace with actual master IP\n", + " \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n", + " },\n", + " \"multi_node_worker\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 2, # 2 nodes\n", + " \"node_rank\": 1, # Change this for each worker node (1, 2, 3, ...)\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.241.128.23:1738\", # Same as master\n", + " \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n", + " }\n", + "}\n", + "\n", + "# Select your distributed configuration\n", + "selected_distributed = \"single_gpu_dev\" # Change this to match your setup\n", + "dist_config = distributed_configs[selected_distributed]\n", + "\n", + "# Extract distributed training parameters\n", + "nproc_per_node = dist_config[\"nproc_per_node\"] # Number of processes (GPUs) per node\n", + "nnodes = dist_config[\"nnodes\"] # Total number of nodes\n", + "node_rank = dist_config[\"node_rank\"] # Rank of this node (0 to nnodes-1)\n", + "rdzv_id = dist_config[\"rdzv_id\"] # Unique job ID for rendezvous\n", + "rdzv_endpoint = dist_config[\"rdzv_endpoint\"] # Master node endpoint for multi-node training\n", + "\n", + "# Calculate total resources\n", + "total_gpus = nproc_per_node * nnodes\n", + "per_gpu_batch_size = effective_batch_size // total_gpus\n", + "\n", + "print(\"\ud83d\udda5\ufe0f Distributed Training Parameters:\")\n", + "print(f\" Configuration: {dist_config['description']}\")\n", + "print(f\" nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n", + "print(f\" nnodes: {nnodes} - Total number of nodes\")\n", + "print(f\" node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n", + "print(f\" rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n", + "print(f\" rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n", + "print()\n", + "print(f\"\ud83d\udcca Resource Calculation:\")\n", + "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n", + "print(f\" Effective batch size: {effective_batch_size}\")\n", + "print(f\" Approximate per-GPU batch size: {per_gpu_batch_size}\")\n", + "print(f\" (Actual micro-batch size determined automatically by gradient accumulation)\")\n", + "print()\n", + "\n", + "# Multi-node setup instructions\n", + "if nnodes > 1:\n", + " print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n", + " print(f\" 1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n", + " print(f\" 2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n", + " print(f\" 3. Set node_rank to 0 for master, 1,2,3... for workers\")\n", + " print(f\" 4. Start training on ALL nodes simultaneously\")\n", + " print()\n", + "\n", + "# OSFT-specific multi-node considerations\n", + "print(\"\ud83d\udcdd OSFT Multi-Node Considerations:\")\n", + "print(\" \u2022 OSFT works seamlessly across multiple nodes\")\n", + "print(\" \u2022 No special replay buffer coordination needed (unlike SFT)\")\n", + "print(\" \u2022 Each node processes its data portion with the same unfreeze_rank_ratio\")\n", + "print(\" \u2022 Gradients are synchronized automatically across all nodes\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute Training\n", + "\n", + "Now let's run the actual OSFT training with all our configured parameters.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# TRAINING EXECUTION\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\ude80 Starting OSFT Training\")\n", + "print(\"=\" * 60)\n", + "print(f\"Experiment: {full_experiment_name}\")\n", + "print(f\"Model: {selected_example['model_name']}\")\n", + "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n", + "print(f\"Configuration: {dist_config['description']}\")\n", + "print(f\"Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", + "print()\n", + "print(\"\u2728 OSFT Advantages:\")\n", + "print(\" \u2022 No catastrophic forgetting\")\n", + "print(\" \u2022 No replay buffer needed\")\n", + "print(\" \u2022 Preserves original model capabilities\")\n", + "print()\n", + "\n", + "# Prepare all training parameters\n", + "training_params = {\n", + " # Required parameters\n", + " 'model_path': model_path,\n", + " 'data_path': data_path,\n", + " 'ckpt_output_dir': ckpt_output_dir,\n", + " 'unfreeze_rank_ratio': unfreeze_rank_ratio,\n", + " 'effective_batch_size': effective_batch_size,\n", + " 'max_tokens_per_gpu': max_tokens_per_gpu,\n", + " 'max_seq_len': max_seq_len,\n", + " 'learning_rate': learning_rate,\n", + " \n", + " # Optional OSFT-specific parameters\n", + " 'target_patterns': target_patterns,\n", + " \n", + " # Training duration\n", + " 'num_epochs': num_epochs,\n", + " \n", + " # Data processing parameters\n", + " 'data_output_dir': data_output_dir,\n", + " 'use_processed_dataset': use_processed_dataset,\n", + " 'unmask_messages': unmask_messages,\n", + " 'warmup_steps': warmup_steps,\n", + " \n", + " # Optimization parameters\n", + " 'use_liger': use_liger,\n", + " 'seed': seed,\n", + " 'lr_scheduler': lr_scheduler,\n", + " 'lr_scheduler_kwargs': lr_scheduler_kwargs,\n", + " \n", + " # Checkpointing parameters\n", + " 'checkpoint_at_epoch': checkpoint_at_epoch,\n", + " 'save_final_checkpoint': save_final_checkpoint,\n", + " \n", + " # Distributed training parameters\n", + " 'nproc_per_node': nproc_per_node,\n", + " 'nnodes': nnodes,\n", + " 'node_rank': node_rank,\n", + " 'rdzv_id': rdzv_id,\n", + " 'rdzv_endpoint': rdzv_endpoint,\n", + "}\n", + "\n", + "# Display final configuration summary\n", + "print(\"\ud83d\udccb Final Training Configuration:\")\n", + "for key, value in training_params.items():\n", + " if value is not None: # Only show non-None values\n", + " print(f\" {key}: {value}\")\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\"\u23f3 Training starting...\")\n", + "print(\"=\"*60)\n", + "\n", + "# Execute training\n", + "start_time = time.time()\n", + "\n", + "try:\n", + " result = osft(**training_params)\n", + " \n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(\"\u2705 OSFT Training completed successfully!\")\n", + " print(f\"\u23f1\ufe0f Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", + " print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n", + " print(\"=\"*60)\n", + " print()\n", + " print(\"\ud83c\udfaf What you've achieved with OSFT:\")\n", + " print(\" \u2022 Model adapted to new domain/task\")\n", + " print(\" \u2022 Original capabilities preserved\")\n", + " print(\" \u2022 No catastrophic forgetting occurred\")\n", + " print(\" \u2022 Ready for deployment without regression testing!\")\n", + " \n", + "except Exception as e:\n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n", + " print(f\"Error: {e}\")\n", + " print(\"=\"*60)\n", + " \n", + " print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n", + " print(\" \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n", + " print(\" \u25a1 Verify data_path points to valid JSONL file\")\n", + " print(\" \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n", + " print(\" \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n", + " print(\" \u25a1 Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n", + " print(\" \u25a1 For multi-node: verify network connectivity and endpoints\")\n", + " print(\" \u25a1 Check that mini-trainer backend dependencies are installed\")\n", + " \n", + " raise\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Training Analysis\n", + "\n", + "After training completes, let's analyze the results and provide guidance for next steps.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# POST-TRAINING ANALYSIS AND NEXT STEPS\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\udcca Post-Training Analysis\")\n", + "print(\"=\" * 50)\n", + "\n", + "# Check for saved checkpoints\n", + "checkpoint_dir = ckpt_output_dir\n", + "\n", + "if os.path.exists(checkpoint_dir):\n", + " checkpoints = [d for d in os.listdir(checkpoint_dir) \n", + " if os.path.isdir(os.path.join(checkpoint_dir, d))]\n", + " \n", + " if checkpoints:\n", + " print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n", + " for ckpt in sorted(checkpoints):\n", + " ckpt_path = os.path.join(checkpoint_dir, ckpt)\n", + " print(f\" \ud83d\udcc1 {ckpt}\")\n", + " \n", + " # Identify the final checkpoint\n", + " final_checkpoint = sorted(checkpoints)[-1]\n", + " final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n", + " \n", + " print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n", + " \n", + " # Provide model loading example\n", + " print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n", + " print(f\"```python\")\n", + " print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n", + " print(f\"\")\n", + " print(f\"# Load your OSFT-adapted model\")\n", + " print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"\")\n", + " print(f\"# Test the model - it should maintain original capabilities\")\n", + " print(f\"# while excelling at your new domain/task\")\n", + " print(f\"inputs = tokenizer('Your domain-specific prompt:', return_tensors='pt')\")\n", + " print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n", + " print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n", + " print(f\"print(response)\")\n", + " print(f\"```\")\n", + " else:\n", + " print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n", + "else:\n", + " print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n", + "\n", + "# Training summary\n", + "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n", + "print(f\" Model: {selected_example['model_name']}\")\n", + "print(f\" Algorithm: OSFT (Orthogonal Subspace Fine-Tuning)\")\n", + "print(f\" Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", + "print(f\" Epochs: {num_epochs}\")\n", + "print(f\" Global Batch Size: {effective_batch_size}\")\n", + "print(f\" Learning Rate: {learning_rate}\")\n", + "print(f\" Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n", + "print(f\" Max Sequence Length: {max_seq_len:,}\")\n", + "print(f\" Total GPUs: {total_gpus}\")\n", + "print(f\" Distributed Config: {dist_config['description']}\")\n", + "\n", + "# OSFT-specific validation recommendations\n", + "print(f\"\\n\ud83e\uddea OSFT-Specific Validation Steps:\")\n", + "print(f\" 1. **Test Original Capabilities**: Verify the model still performs well on\")\n", + "print(f\" general tasks it was originally trained for\")\n", + "print(f\" 2. **Test New Domain**: Confirm improved performance on your target domain\")\n", + "print(f\" 3. **No Regression Testing Needed**: Unlike SFT, OSFT preserves capabilities\")\n", + "print(f\" by design, reducing validation overhead\")\n", + "print(f\" 4. **Compare with Base Model**: Run side-by-side comparisons to see\")\n", + "print(f\" improvements without degradation\")\n", + "\n", + "# Next steps recommendations\n", + "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n", + "print(f\" 1. \ud83c\udfaf Test on domain-specific evaluation sets\")\n", + "print(f\" 2. \ud83d\udcca Compare performance with base model on both general and domain tasks\")\n", + "print(f\" 3. \ud83d\udd04 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n", + "print(f\" 4. \ud83d\udca1 If too much change occurred, reduce unfreeze_rank_ratio\")\n", + "print(f\" 5. \ud83d\udcdd Document the unfreeze_rank_ratio that works best for your use case\")\n", + "print(f\" 6. \ud83d\udea2 Deploy with confidence - no catastrophic forgetting!\")\n", + "\n", + "# Performance optimization tips\n", + "print(f\"\\n\u26a1 OSFT-Specific Optimization Tips:\")\n", + "print(f\" \u2022 Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n", + "if unfreeze_rank_ratio < 0.2:\n", + " print(f\" Very conservative - great preservation, slower adaptation\")\n", + " print(f\" Consider increasing to 0.25-0.3 if need more adaptation\")\n", + "elif unfreeze_rank_ratio < 0.35:\n", + " print(f\" Balanced - good preservation with reasonable adaptation\")\n", + " print(f\" This is ideal for most use cases\")\n", + "else:\n", + " print(f\" Aggressive - faster adaptation, slightly less preservation\")\n", + " print(f\" Consider reducing if seeing any capability degradation\")\n", + "\n", + "print(f\" \u2022 Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n", + "print(f\" \u2022 For production: use the script version for better logging and resumption\")\n", + "\n", + "print(f\"\\n\u2728 OSFT Training Complete!\")\n", + "print(f\"Your model has been successfully adapted without forgetting!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parameter Reference Summary\n", + "\n", + "Quick reference for all OSFT parameters and their purposes.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Required Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `model_path` | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n", + "| `data_path` | Path to the training data | `\"/path/to/train.jsonl\"` |\n", + "| `ckpt_output_dir` | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n", + "| `unfreeze_rank_ratio` | **OSFT-specific**: Controls preservation vs adaptation | `0.25`, `0.3`, `0.4` |\n", + "| `effective_batch_size` | Effective batch size for training | `64`, `128`, `256` |\n", + "| `max_tokens_per_gpu` | Maximum tokens per GPU (memory limit) | `16384`, `25000`, `40000` |\n", + "| `max_seq_len` | Maximum sequence length | `2048`, `8192`, `32768` |\n", + "| `learning_rate` | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n", + "\n", + "### OSFT-Specific Parameters\n", + "\n", + "| Parameter | Description | Recommended Values | Use Case |\n", + "|-----------|-------------|-------------------|----------|\n", + "| `unfreeze_rank_ratio` | Controls how much of each matrix is unfrozen | `0.1-0.3` | Conservative preservation |\n", + "| | | `0.3-0.5` | Balanced adaptation |\n", + "| | | `>0.5` | Rarely needed |\n", + "| `target_patterns` | Optional patterns to match specific modules | `None` | Default (all modules) |\n", + "\n", + "### Training Configuration Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|-----------------|\n", + "| `num_epochs` | Number of training epochs | `1` |\n", + "| `seed` | Random seed for reproducibility | `42` |\n", + "| `use_liger` | Enable Liger kernels for efficiency | `False` |\n", + "| `warmup_steps` | Number of warmup steps | `0` |\n", + "| `lr_scheduler` | Learning rate scheduler | `\"cosine\"` |\n", + "| `lr_scheduler_kwargs` | Additional scheduler parameters | `{\"eta_min\": 1e-6}` |\n", + "\n", + "### Data Processing Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|-----------------|\n", + "| `data_output_dir` | Directory to save processed data | Defaults to `f\"{ckpt_output_dir}/_internal_data_processing\"`, Recommended value is `\"/dev/shm\"` (shared memory) |\n", + "| `use_processed_dataset` | Use pre-processed data with input_ids/labels | `False` |\n", + "| `unmask_messages` | Unmask all messages for pretraining-style learning | `False` |\n", + "\n", + "### Checkpointing Parameters\n", + "\n", + "| Parameter | Description | Recommended |\n", + "|-----------|-------------|-------------|\n", + "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n", + "| `save_final_checkpoint` | Whether to save final checkpoint | `True` |\n", + "\n", + "### Distributed Training Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n", + "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n", + "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n", + "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n", + "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n", + "\n", + "### Unfreeze Rank Ratio Guidelines\n", + "\n", + "| Use Case | Recommended Ratio | Rationale |\n", + "|----------|-------------------|-----------|\n", + "| **Minor format changes** | 0.1-0.15 | Maximum preservation, minimal changes |\n", + "| **Domain vocabulary addition** | 0.15-0.25 | Add specialized terms without losing general knowledge |\n", + "| **Domain specialization** | 0.25-0.35 | Balance between preservation and adaptation |\n", + "| **Major capability expansion** | 0.35-0.5 | More freedom for significant new capabilities |\n", + "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n", + "\n", + "### OSFT vs SFT Key Differences\n", + "\n", + "| Aspect | OSFT | SFT |\n", + "|--------|------|-----|\n", + "| **Catastrophic Forgetting** | Prevented by design | Requires replay buffers |\n", + "| **Data Requirements** | Only new domain data | Needs mixed/replay data |\n", + "| **Memory Usage** | Similar to SFT | Similar to OSFT |\n", + "| **Key Parameter** | `unfreeze_rank_ratio` | N/A |\n", + "| **Backend** | mini-trainer | instructlab-training |\n", + "| **Best For** | Continual learning, domain adaptation | Initial fine-tuning |\n", + "\n", + "### Popular Model Examples for OSFT\n", + "\n", + "| Model | HuggingFace Path | Recommended `unfreeze_rank_ratio` | `max_tokens_per_gpu` |\n", + "|-------|------------------|-----------------------------------|----------------------|\n", + "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | 0.25 | 10000 |\n", + "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | 0.3 | 10000 |\n", + "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | 0.25 | 15000 |\n", + "\n", + "### Script Alternative\n", + "\n", + "For production workloads or long-running training, use the script version:\n", + "\n", + "```bash\n", + "# Qwen example\n", + "python scripts/osft_qwen_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.25\n", + "\n", + "# Llama example\n", + "python scripts/osft_llama_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.3\n", + "\n", + "# Phi example\n", + "python scripts/osft_phi_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints \\\n", + " --unfreeze-rank-ratio 0.25\n", + "```\n", + "\n", + "### When to Use OSFT vs SFT\n", + "\n", + "**Use OSFT when:**\n", + "- Adding domain-specific knowledge to an already-trained model\n", + "- Need to preserve original capabilities without regression\n", + "- Don't have access to original training data for replay\n", + "- Want to avoid catastrophic forgetting\n", + "- Performing continual learning across multiple domains\n", + "\n", + "**Use SFT when:**\n", + "- Training a model from scratch or base model\n", + "- Have comprehensive training data covering all desired capabilities \n", + "- Don't need to preserve specific prior behaviors\n", + "- Performing initial instruction tuning\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb new file mode 100644 index 0000000..cfb19a3 --- /dev/null +++ b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb @@ -0,0 +1,1679 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comprehensive SFT Training Tutorial\n", + "\n", + "This notebook provides a comprehensive guide to Supervised Fine-Tuning (SFT) using the training_hub library. We'll cover:\n", + "\n", + "- **All available parameters** and their detailed explanations\n", + "- **Single-node and multi-node training** configurations\n", + "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n", + "- **Best practices and troubleshooting**\n", + "\n", + "This tutorial serves as both a learning resource and a template you can adapt for your specific fine-tuning needs.\n", + "\n", + "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/sft_qwen_example.py`, `scripts/sft_llama_example.py`, and `scripts/sft_phi_example.py` with better logging consistency." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Imports\n", + "\n", + "Let's start by importing the necessary libraries and setting up our environment.\n", + "\n", + "Install `training-hub` if it's not installed yet.\n", + "```\n", + "export UV_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple\n", + "export UV_HTTP_TIMEOUT=300\n", + "pip install uv -i https://pypi.tuna.tsinghua.edu.cn/simple\n", + "uv pip install -q training-hub -i https://pypi.tuna.tsinghua.edu.cn/simple\n", + "```\n", + "\n", + "Reinstall pytorch to fit your current CUDA versions (e.g. to fit cuda-12.4, install torch==2.6.0 with cu124 support):\n", + "\n", + "```\n", + "uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Import training_hub for SFT training\n", + "from training_hub import sft\n", + "\n", + "# Standard library imports\n", + "import os\n", + "import time\n", + "from datetime import datetime\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "data_path = \"./test_sft_data.jsonl\"\n", + "if not os.path.exists(data_path):\n", + " print(f\"Creating dummy dataset at {data_path}\")\n", + " dummy_data = [\n", + " {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n", + " ] * 10\n", + " with open(data_path, \"w\") as f:\n", + " for d in dummy_data:\n", + " f.write(json.dumps(d) + \"\\n\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Format Requirements\n", + "\n", + "Before configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n", + "\n", + "### Required Format: JSONL with Messages\n", + "\n", + "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n", + "\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n", + "```\n", + "\n", + "### Message Structure\n", + "\n", + "Each conversation contains a `messages` array with message objects having:\n", + "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n", + "- **`content`**: The text content of the message\n", + "- **`reasoning_content`** (optional): Additional reasoning traces\n", + "\n", + "### Masking Behavior with `unmask` Field\n", + "\n", + "You can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n", + "\n", + "#### Standard Instruction Tuning (default)\n", + "```json\n", + "{\"messages\": [...]}\n", + "```\n", + "or\n", + "```json\n", + "{\"messages\": [...], \"unmask\": false}\n", + "```\n", + "- **Trains only on assistant responses** (standard instruction-following)\n", + "- System messages are always masked (ignored for loss)\n", + "- User messages are masked\n", + "- Assistant messages are unmasked (used for loss calculation)\n", + "\n", + "#### Pretraining Mode\n", + "```json\n", + "{\"messages\": [...], \"unmask\": true}\n", + "```\n", + "- **Trains on all content except system messages**\n", + "- System messages are always masked\n", + "- User and assistant messages are both unmasked\n", + "- Useful for pretraining-style data where the model should learn from all text\n", + "\n", + "### Example Data Formats\n", + "\n", + "**Standard SFT (instruction-following):**\n", + "```json\n", + "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n if n == 0 or n == 1:\\n return 1\\n return n * factorial(n - 1)\\n```\"}]}\n", + "```\n", + "\n", + "**Pretraining-style (learn from all content):**\n", + "```json\n", + "{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n", + "```\n", + "\n", + "### Data Path Configuration\n", + "\n", + "When configuring your training, point to your JSONL file:\n", + "\n", + "```python\n", + "data_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n", + "```\n", + "\n", + "The training pipeline will automatically:\n", + "1. Load and validate your JSONL data\n", + "2. Apply chat templates based on your model\n", + "3. Handle masking according to the `unmask` setting\n", + "4. Process the data for efficient training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Configuration Examples\n", + "\n", + "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and requirements." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Selected Example: Qwen 3 0.6B\n", + "Model Path: /opt/app-root/src/qwen3-0.6b\n", + "Example Max Tokens per GPU: 2,048\n", + "Example Max Sequence Length: 2,048\n", + "Example Batch Size: 1\n", + "Example Learning Rate: 1e-05\n", + "\n", + "\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# MODEL CONFIGURATION EXAMPLES\n", + "# These are example configurations - adjust based on your hardware and requirements\n", + "# =============================================================================\n", + "\n", + "# Example 1: Qwen 2.5 7B Instruct\n", + "qwen_example = {\n", + " \"model_name\": \"Qwen 3 0.6B\",\n", + " \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 2048,\n", + " \"example_max_seq_len\": 2048,\n", + " \"example_batch_size\": 1,\n", + " \"example_learning_rate\": 1e-5,\n", + "}\n", + "\n", + "# Example 2: Llama 3.1 8B Instruct\n", + "llama_example = {\n", + " \"model_name\": \"Llama 3.1 8B Instruct\",\n", + " \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 18000,\n", + " \"example_max_seq_len\": 16384,\n", + " \"example_batch_size\": 128,\n", + " \"example_learning_rate\": 1e-5,\n", + "}\n", + "\n", + "# Example 3: Phi 4 Mini\n", + "phi_example = {\n", + " \"model_name\": \"Phi 4 Mini\",\n", + " \"model_path\": \"microsoft/Phi-4-mini-instruct\", # HuggingFace model name or local path\n", + " \"example_max_tokens_per_gpu\": 25000,\n", + " \"example_max_seq_len\": 8192,\n", + " \"example_batch_size\": 64,\n", + " \"example_learning_rate\": 5e-6,\n", + "}\n", + "\n", + "# Example 4: Generic 7B Base Model\n", + "generic_7b_example = {\n", + " \"model_name\": \"Generic 7B Base\",\n", + " \"model_path\": \"/path/to/your-7b-model\", # Local path to model directory\n", + " \"example_max_tokens_per_gpu\": 25000,\n", + " \"example_max_seq_len\": 20000,\n", + " \"example_batch_size\": 256,\n", + " \"example_learning_rate\": 2e-5,\n", + "}\n", + "\n", + "# Example 5: Smaller Model (1B-3B)\n", + "small_model_example = {\n", + " \"model_name\": \"Small Model (1B-3B)\",\n", + " \"model_path\": \"/path/to/small-model\", # Local path or HuggingFace name\n", + " \"example_max_tokens_per_gpu\": 40000,\n", + " \"example_max_seq_len\": 32768,\n", + " \"example_batch_size\": 512,\n", + " \"example_learning_rate\": 3e-5,\n", + "}\n", + "\n", + "# =============================================================================\n", + "# SELECT YOUR CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Choose one of the examples above as a starting point\n", + "selected_example = qwen_example # Change this to your preferred example\n", + "\n", + "print(f\"Selected Example: {selected_example['model_name']}\")\n", + "print(f\"Model Path: {selected_example['model_path']}\")\n", + "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n", + "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n", + "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n", + "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n", + "print(\"\\n\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Complete Parameter Reference\n", + "\n", + "Let's configure all available SFT parameters with detailed explanations." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\udccb Required Parameters:\n", + " model_path: Path to the model to fine-tune (HuggingFace name or local path)\n", + " data_path: Path to the training data (JSONL format)\n", + " ckpt_output_dir: Directory to save checkpoints\n", + "\n", + "\ud83c\udfaf Core Training Parameters:\n", + " num_epochs: 3 - Number of training epochs\n", + " effective_batch_size: 1 - Effective batch size for training\n", + " learning_rate: 1e-05 - Learning rate for training\n", + " max_seq_len: 2,048 - Maximum sequence length\n", + " max_tokens_per_gpu: 2,048 - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\n", + "\n", + "\ud83d\udcbe Data Processing Parameters:\n", + " data_output_dir: '/dev/shm' - Directory to save processed data\n", + " warmup_steps: 100 - Number of warmup steps\n", + "\n", + "\ud83d\udcbe Checkpointing Parameters:\n", + " save_samples: 0 - Number of samples to save after training (0 disables saving based on sample count)\n", + " checkpoint_at_epoch: True - Whether to checkpoint at each epoch\n", + " accelerate_full_state_at_epoch: True - Whether to save full state at epoch for automatic checkpoint resumption\n", + "\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# COMPLETE SFT PARAMETER CONFIGURATION\n", + "# =============================================================================\n", + "\n", + "# Experiment identification\n", + "experiment_name = \"sft_comprehensive_example\"\n", + "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n", + "\n", + "# =============================================================================\n", + "# REQUIRED PARAMETERS\n", + "# =============================================================================\n", + "\n", + "model_path = selected_example[\"model_path\"] # HuggingFace model name or local path\n", + "data_path = \"./test_sft_data.jsonl\" # Path to training data in JSONL format\n", + "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\" # Where to save checkpoints\n", + "\n", + "print(\"\ud83d\udccb Required Parameters:\")\n", + "print(f\" model_path: Path to the model to fine-tune (HuggingFace name or local path)\")\n", + "print(f\" data_path: Path to the training data (JSONL format)\")\n", + "print(f\" ckpt_output_dir: Directory to save checkpoints\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CORE TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "num_epochs = 3 # Number of training epochs\n", + "effective_batch_size = selected_example[\"example_batch_size\"] # Effective batch size for training\n", + "learning_rate = selected_example[\"example_learning_rate\"] # Learning rate for training\n", + "max_seq_len = selected_example[\"example_max_seq_len\"] # Maximum sequence length\n", + "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"] # Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs)\n", + "\n", + "print(\"\ud83c\udfaf Core Training Parameters:\")\n", + "print(f\" num_epochs: {num_epochs} - Number of training epochs\")\n", + "print(f\" effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n", + "print(f\" learning_rate: {learning_rate} - Learning rate for training\")\n", + "print(f\" max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n", + "print(f\" max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# DATA AND PROCESSING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "data_output_dir = \"/dev/shm\" # Directory to save processed data\n", + "warmup_steps = 100 # Number of warmup steps\n", + "\n", + "print(\"\ud83d\udcbe Data Processing Parameters:\")\n", + "print(f\" data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n", + "print(f\" warmup_steps: {warmup_steps} - Number of warmup steps\")\n", + "print()\n", + "\n", + "# =============================================================================\n", + "# CHECKPOINTING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "save_samples = 0 # Number of samples to save after training (0 disables saving based on sample count)\n", + "checkpoint_at_epoch = True # Whether to checkpoint at each epoch\n", + "accelerate_full_state_at_epoch = True # Whether to save full state at epoch for automatic checkpoint resumption\n", + "\n", + "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n", + "print(f\" save_samples: {save_samples} - Number of samples to save after training (0 disables saving based on sample count)\")\n", + "print(f\" checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n", + "print(f\" accelerate_full_state_at_epoch: {accelerate_full_state_at_epoch} - Whether to save full state at epoch for automatic checkpoint resumption\")\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distributed Training Configuration\n", + "\n", + "Configure distributed training for both single-node and multi-node setups." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\udda5\ufe0f Distributed Training Parameters:\n", + " Configuration: Development setup with single GPU\n", + " nproc_per_node: 1 - Number of processes (GPUs) per node\n", + " nnodes: 1 - Total number of nodes\n", + " node_rank: 0 - Rank of this node (0 to nnodes-1)\n", + " rdzv_id: 1 - Unique job ID for rendezvous\n", + " rdzv_endpoint: '127.0.0.1:29500' - Master node endpoint for multi-node training\n", + "\n", + "\ud83d\udcca Resource Calculation:\n", + " Total GPUs: 1 (1 \u00d7 1)\n", + " Effective batch size: 1\n", + " Approximate per-GPU batch size: 1\n", + " (Actual micro-batch size determined automatically by gradient accumulation)\n", + "\n" + ] + } + ], + "source": [ + "# =============================================================================\n", + "# DISTRIBUTED TRAINING PARAMETERS\n", + "# =============================================================================\n", + "\n", + "# Configuration options for different setups\n", + "distributed_configs = {\n", + " \"single_gpu_dev\": {\n", + " \"nproc_per_node\": 1,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 1,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Development setup with single GPU\"\n", + " },\n", + " \"single_node_8gpu\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 1,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 100,\n", + " \"rdzv_endpoint\": \"127.0.0.1:29500\",\n", + " \"description\": \"Single node with 8 GPUs\"\n", + " },\n", + " \"multi_node_master\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 4,\n", + " \"node_rank\": 0,\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.0.0.1:29500\", # Replace with actual master IP\n", + " \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n", + " },\n", + " \"multi_node_worker\": {\n", + " \"nproc_per_node\": 8,\n", + " \"nnodes\": 4,\n", + " \"node_rank\": 1, # Change this for each worker node (1, 2, 3, ...)\n", + " \"rdzv_id\": 42,\n", + " \"rdzv_endpoint\": \"10.0.0.1:29500\", # Same as master\n", + " \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n", + " }\n", + "}\n", + "\n", + "# Select your distributed configuration\n", + "selected_distributed = \"single_gpu_dev\" # Change this to match your setup\n", + "dist_config = distributed_configs[selected_distributed]\n", + "\n", + "# Extract distributed training parameters\n", + "nproc_per_node = dist_config[\"nproc_per_node\"] # Number of processes (GPUs) per node\n", + "nnodes = dist_config[\"nnodes\"] # Total number of nodes\n", + "node_rank = dist_config[\"node_rank\"] # Rank of this node (0 to nnodes-1)\n", + "rdzv_id = dist_config[\"rdzv_id\"] # Unique job ID for rendezvous\n", + "rdzv_endpoint = dist_config[\"rdzv_endpoint\"] # Master node endpoint for multi-node training\n", + "\n", + "# Calculate total resources\n", + "total_gpus = nproc_per_node * nnodes\n", + "per_gpu_batch_size = effective_batch_size // total_gpus\n", + "\n", + "print(\"\ud83d\udda5\ufe0f Distributed Training Parameters:\")\n", + "print(f\" Configuration: {dist_config['description']}\")\n", + "print(f\" nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n", + "print(f\" nnodes: {nnodes} - Total number of nodes\")\n", + "print(f\" node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n", + "print(f\" rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n", + "print(f\" rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n", + "print()\n", + "print(f\"\ud83d\udcca Resource Calculation:\")\n", + "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n", + "print(f\" Effective batch size: {effective_batch_size}\")\n", + "print(f\" Approximate per-GPU batch size: {per_gpu_batch_size}\")\n", + "print(f\" (Actual micro-batch size determined automatically by gradient accumulation)\")\n", + "print()\n", + "\n", + "# Multi-node setup instructions\n", + "if nnodes > 1:\n", + " print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n", + " print(f\" 1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n", + " print(f\" 2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n", + " print(f\" 3. Set node_rank to 0 for master, 1,2,3... for workers\")\n", + " print(f\" 4. Start training on ALL nodes simultaneously\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute Training\n", + "\n", + "Now let's run the actual SFT training with all our configured parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\ud83d\ude80 Starting SFT Training\n", + "============================================================\n", + "Experiment: sft_comprehensive_example_20260323_072149\n", + "Model: Qwen 3 0.6B\n", + "Total GPUs: 1 (1 per node \u00d7 1 nodes)\n", + "Configuration: Development setup with single GPU\n", + "\n", + "\ud83d\udccb Final Training Configuration:\n", + " model_path: /opt/app-root/src/qwen3-0.6b\n", + " data_path: ./test_sft_data.jsonl\n", + " ckpt_output_dir: checkpoints/sft_comprehensive_example_20260323_072149\n", + " num_epochs: 3\n", + " effective_batch_size: 1\n", + " learning_rate: 1e-05\n", + " max_seq_len: 2048\n", + " max_tokens_per_gpu: 2048\n", + " data_output_dir: /dev/shm\n", + " warmup_steps: 100\n", + " save_samples: 0\n", + " checkpoint_at_epoch: True\n", + " accelerate_full_state_at_epoch: True\n", + " nproc_per_node: 1\n", + " nnodes: 1\n", + " node_rank: 0\n", + " rdzv_id: 1\n", + " rdzv_endpoint: 127.0.0.1:29500\n", + " disable_flash_attn: True\n", + "\n", + "============================================================\n", + "\u23f3 Training starting...\n", + "============================================================\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:21:52] INFO     Starting training setup...                                                       main_ds.py:591\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m[07:21:52]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Starting training setup\u001b[33m...\u001b[0m \u001b]8;id=399409;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=389230;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#591\u001b\\\u001b[2m591\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
           WARNING  num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.      arrow_dataset.py:3123\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=592580;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152011;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:21:55] WARNING  num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.      arrow_dataset.py:3123\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m[07:21:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=209439;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=307155;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9ca4f757279149fdb5e5bbeb03e9bfd7", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Converting samples into input_ids and labels... (num_proc=3): 100%|##########| 3/3 [00:00[07:22:07] INFO ten largest length percentiles: data_process.py:1355\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m ten largest length percentiles: \u001b]8;id=670115;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=495753;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1355\u001b\\\u001b[2m1355\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 90th: 70                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 90th: \u001b[1;36m70\u001b[0m \u001b]8;id=657600;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=118760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 91th: 70                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 91th: \u001b[1;36m70\u001b[0m \u001b]8;id=996028;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=273551;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 92th: 71                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 92th: \u001b[1;36m71\u001b[0m \u001b]8;id=658621;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=94733;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 93th: 71                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 93th: \u001b[1;36m71\u001b[0m \u001b]8;id=246082;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=931645;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 94th: 72                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 94th: \u001b[1;36m72\u001b[0m \u001b]8;id=452894;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=347117;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 95th: 73                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 95th: \u001b[1;36m73\u001b[0m \u001b]8;id=626760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619372;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 96th: 73                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 96th: \u001b[1;36m73\u001b[0m \u001b]8;id=265428;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=452333;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 97th: 74                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 97th: \u001b[1;36m74\u001b[0m \u001b]8;id=841787;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=465368;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 98th: 74                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 98th: \u001b[1;36m74\u001b[0m \u001b]8;id=252732;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=895643;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 99th: 75                                                          data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 99th: \u001b[1;36m75\u001b[0m \u001b]8;id=731576;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=8096;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 100th: 76                                                         data_process.py:1358\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 100th: \u001b[1;36m76\u001b[0m \u001b]8;id=559594;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=56715;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     at 2048 max sequence length, the number of samples to be dropped is 0      data_process.py:1362\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m at \u001b[1;36m2048\u001b[0m max sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m \u001b]8;id=663979;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152605;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1362\u001b\\\u001b[2m1362\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     (0.00 of total)                                                            data_process.py:1367\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m0.00\u001b[0m of total\u001b[1m)\u001b[0m \u001b]8;id=915262;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=203311;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1367\u001b\\\u001b[2m1367\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 0th: 28                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 0th: \u001b[1;36m28\u001b[0m \u001b]8;id=674658;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=54929;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 1th: 28                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 1th: \u001b[1;36m28\u001b[0m \u001b]8;id=856039;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=914854;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 2th: 28                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 2th: \u001b[1;36m28\u001b[0m \u001b]8;id=556445;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=385141;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 3th: 29                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 3th: \u001b[1;36m29\u001b[0m \u001b]8;id=738851;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=147455;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 4th: 29                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 4th: \u001b[1;36m29\u001b[0m \u001b]8;id=628289;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=34700;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 5th: 29                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 5th: \u001b[1;36m29\u001b[0m \u001b]8;id=243709;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=947817;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 6th: 30                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 6th: \u001b[1;36m30\u001b[0m \u001b]8;id=318583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=443338;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 7th: 30                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 7th: \u001b[1;36m30\u001b[0m \u001b]8;id=143291;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=65864;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 8th: 30                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 8th: \u001b[1;36m30\u001b[0m \u001b]8;id=411537;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=957882;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 9th: 31                                                           data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 9th: \u001b[1;36m31\u001b[0m \u001b]8;id=831237;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=856583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     quantile 10th: 31                                                          data_process.py:1378\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m quantile 10th: \u001b[1;36m31\u001b[0m \u001b]8;id=612081;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=731268;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     at 20 min sequence length, the number of samples to be dropped is 0        data_process.py:1382\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m at \u001b[1;36m20\u001b[0m min sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m \u001b]8;id=102879;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=675812;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1382\u001b\\\u001b[2m1382\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
           WARNING  num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.      arrow_dataset.py:3123\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=881091;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=320766;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "037b02187d454cd880a6d947d3353fb0", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filter (num_proc=3): 0%| | 0/3 [00:00 INFO Samples Previews... data_process.py:1392\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Samples Previews\u001b[33m...\u001b[0m \u001b]8;id=64443;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=282143;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1392\u001b\\\u001b[2m1392\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n" + ] + }, + { + "data": { + "text/html": [ + "
           WARNING  num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.      arrow_dataset.py:3123\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=75781;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=607291;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6a16e69d33d54d0da44ea627def0e9cd", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filtering out pretraining samples (num_proc=3): 0%| | 0/3 [00:00 WARNING num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3. arrow_dataset.py:3123\n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=397717;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=196121;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f72a54e41a794200833ea3262ce90f54", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Filtering out pretraining samples (num_proc=3): 0%| | 0/3 [00:00user\n", + "What is machine learning?<|im_end|>\n", + "<|im_start|>assistant\n", + "\n", + "\n", + "\n", + "\n", + "Machine learning is a subset of artificial intelligence...<|im_end|>\n", + "\n", + "\u001b[0m\n", + "\u001b[33mInstruction ex sample 1: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>Machine learning is a subset of artificial intelligence...<|im_end|><|MASK|>\u001b[0m\n", + "\u001b[35mOriginal Input: <|im_start|>system\n", + "You are a coding assistant.<|im_end|>\n", + "<|im_start|>user\n", + "Write a Python function to calculate factorial<|im_end|>\n", + "<|im_start|>assistant\n", + "\n", + "\n", + "\n", + "\n", + "Here's a Python function to calculate factorial:\n", + "\n", + "```python\n", + "def factorial(n):\n", + " if n == 0 or n == 1:\n", + " return 1\n", + " return n * factorial(n - 1)\n", + "```<|im_end|>\n", + "\n", + "\u001b[0m\n", + "\u001b[33mInstruction ex sample 2: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>Here's a Python function to calculate factorial:\n", + "\n", + "```python\n", + "def factorial(n):\n", + " if n == 0 or n == 1:\n", + " return 1\n", + " return n * factorial(n - 1)\n", + "```<|im_end|><|MASK|>\u001b[0m\n", + "\u001b[35mOriginal Input: <|im_start|>system\n", + "You are a helpful assistant.<|im_end|>\n", + "<|im_start|>user\n", + "Hello, how are you?<|im_end|>\n", + "<|im_start|>assistant\n", + "\n", + "\n", + "\n", + "\n", + "I'm doing well, thank you! How can I help you today?<|im_end|>\n", + "\n", + "\u001b[0m\n", + "\u001b[33mInstruction ex sample 3: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>I'm doing well, thank you! How can I help you today?<|im_end|><|MASK|>\u001b[0m\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:22:08] WARNING  num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.      arrow_dataset.py:3123\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:08]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m. \u001b]8;id=60262;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=179809;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "52504f1d6ee449c6b4e2b8a216b127fe", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Validating unmask tokens not in data (num_proc=3): 0%| | 0/3 [00:00[07:22:09] INFO Running training command as subprocess: torchrun --nproc-per-node=1 --nnodes=1 main_ds.py:794\n", + " --node-rank=0 --rdzv-id=1 --rdzv-endpoint=127.0.0.1:29500 \n", + " /opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py \n", + " --model_name_or_path=/opt/app-root/src/qwen3-0.6b \n", + " --data_path=/dev/shm/data.jsonl \n", + " --output_dir=checkpoints/sft_comprehensive_example_20260323_072149 \n", + " --num_epochs=3 --effective_batch_size=1 --learning_rate=1e-05 \n", + " --num_warmup_steps=100 --save_samples=0 --log_level=INFO --max_batch_len=2048 \n", + " --seed=42 --adamw_weight_decay=0.0 --adamw_beta1=0.9 --adamw_beta2=0.95 \n", + " --adamw_eps=1e-08 --checkpoint_at_epoch --accelerate_full_state_at_epoch \n", + " --disable_flash_attn --distributed_training_framework=fsdp \n", + " --fsdp_sharding_strategy=HYBRID_SHARD \n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:09]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Running training command as subprocess: torchrun --nproc-per-\u001b[33mnode\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mnnodes\u001b[0m=\u001b[1;36m1\u001b[0m \u001b]8;id=725937;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=274171;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#794\u001b\\\u001b[2m794\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m --node-\u001b[33mrank\u001b[0m=\u001b[1;36m0\u001b[0m --rdzv-\u001b[33mid\u001b[0m=\u001b[1;36m1\u001b[0m --rdzv-\u001b[33mendpoint\u001b[0m=\u001b[1;92m127\u001b[0m\u001b[1;92m.0.0.1\u001b[0m:\u001b[1;36m29500\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[35m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/\u001b[0m\u001b[95mmain_ds.py\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mmodel_name_or_path\u001b[0m=\u001b[35m/opt/app-root/src/\u001b[0m\u001b[95mqwen3-0.6b\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mdata_path\u001b[0m=\u001b[35m/dev/shm/\u001b[0m\u001b[95mdata.jsonl\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33moutput_dir\u001b[0m=\u001b[35mcheckpoints\u001b[0m/sft_comprehensive_example_20260323_072149 \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mnum_epochs\u001b[0m=\u001b[1;36m3\u001b[0m --\u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mlearning_rate\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-05\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mnum_warmup_steps\u001b[0m=\u001b[1;36m100\u001b[0m --\u001b[33msave_samples\u001b[0m=\u001b[1;36m0\u001b[0m --\u001b[33mlog_level\u001b[0m=\u001b[35mINFO\u001b[0m --\u001b[33mmax_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mseed\u001b[0m=\u001b[1;36m42\u001b[0m --\u001b[33madamw_weight_decay\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.0\u001b[0m --\u001b[33madamw_beta1\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.9\u001b[0m --\u001b[33madamw_beta2\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.95\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33madamw_eps\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-08\u001b[0m --checkpoint_at_epoch --accelerate_full_state_at_epoch \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --disable_flash_attn --\u001b[33mdistributed_training_framework\u001b[0m=\u001b[35mfsdp\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m --\u001b[33mfsdp_sharding_strategy\u001b[0m=\u001b[35mHYBRID_SHARD\u001b[0m \u001b[2m \u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:22: UserWarning: DeepSpeed CPU Optimizer is not available. Some features may be unavailable.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:34: UserWarning: DeepSpeed is not available. Some features may be unavailable.\n", + " warnings.warn(\n", + "Loading weights: 100%|| 311/311 [00:04<00:00, 64.15it/s]]]]]]]] \n", + "The tied weights mapping and config for this model specifies to tie model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning\n", + "Generating train split: 3 examples [00:00, 565.78 examples/s]\n", + "\u001b[2;36m[07:22:34]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m \u001b[33mnum_gpus\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mavg_sample_len\u001b[0m=\u001b[1;36m50\u001b[0m\u001b[1;36m.000\u001b[0m, \u001b]8;id=234053;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=146316;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#499\u001b\\\u001b[2m499\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m \u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mmax_batch_len_per_gpu\u001b[0m=\u001b[1;36m2048\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mpacking_max_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m, \u001b[33mnum_batches\u001b[0m=\u001b[1;36m3\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mavg_samples_per_batch\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1;36m.000\u001b[0m, \u001b[33mtotal_samples\u001b[0m=\u001b[1;36m3\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m sharding_strategy is deprecated in favor \u001b]8;id=91161;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py\u001b\\\u001b[2mdataclasses.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619176;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py#1962\u001b\\\u001b[2m1962\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m of reshard_after_forward. This will be \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m removed in a future version of \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m Accelerate. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m Detected kernel version \u001b[1;36m3.10\u001b[0m.\u001b[1;36m0\u001b[0m, which is below \u001b]8;id=529903;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py\u001b\\\u001b[2mother.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=631262;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py#513\u001b\\\u001b[2m513\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m the recommended minimum of \u001b[1;36m5.5\u001b[0m.\u001b[1;36m0\u001b[0m; this can \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m cause the process to hang. It is recommended to \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m upgrade the kernel to the minimum version or \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m higher. \u001b[2m \u001b[0m\n", + "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py:444: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.HYBRID_SHARD since the world size is 1.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3ForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3DecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, self_attn.q_norm.weight, self_attn.k_norm.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight.\n", + " warnings.warn(\n", + "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1998: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.\n", + " warnings.warn(\n", + "Epoch 0: 0%| | 0/3 [00:00\n", + "[rank0]: main(args)\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 558, in main\n", + "[rank0]: train(\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 236, in train\n", + "[rank0]: accelerator.take_optimizer_step()\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/accelerator.py\", line 295, in take_optimizer_step\n", + "[rank0]: self.optimizer.step()\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/accelerate/optimizer.py\", line 179, in step\n", + "[rank0]: self.optimizer.step(closure)\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/lr_scheduler.py\", line 140, in wrapper\n", + "[rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs)\n", + "[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/optimizer.py\", line 493, in wrapper\n", + "[rank0]: out = func(*args, **kwargs)\n", + "[rank0]: ^^^^^^^^^^^^^^^^^^^^^\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/optimizer.py\", line 91, in _use_grad\n", + "[rank0]: ret = func(self, *args, **kwargs)\n", + "[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/adamw.py\", line 232, in step\n", + "[rank0]: has_complex = self._init_group(\n", + "[rank0]: ^^^^^^^^^^^^^^^^^\n", + "[rank0]: File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/adamw.py\", line 175, in _init_group\n", + "[rank0]: state[\"exp_avg_sq\"] = torch.zeros_like(\n", + "[rank0]: ^^^^^^^^^^^^^^^^^\n", + "[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB. GPU 0 has a total capacity of 8.00 " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "IOPub message rate exceeded.\n", + "The Jupyter server will temporarily stop sending output\n", + "to the client in order to avoid crashing it.\n", + "To change this limit, set the config variable\n", + "`--ServerApp.iopub_msg_rate_limit`.\n", + "\n", + "Current values:\n", + "ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n", + "ServerApp.rate_limit_window=3.0 (secs)\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "22:50.846035625 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())\n", + "E0323 07:22:53.501000 4822 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 4849) of binary: /opt/app-root/bin/python3\n", + "Traceback (most recent call last):\n", + " File \"/opt/app-root/bin/torchrun\", line 10, in \n", + " sys.exit(main())\n", + " ^^^^^^\n", + " File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n", + " return f(*args, **kwargs)\n", + " ^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py\", line 918, in main\n", + " run(args)\n", + " File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py\", line 909, in run\n", + " elastic_launch(\n", + " File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py\", line 138, in __call__\n", + " return launch_agent(self._config, self._entrypoint, list(args))\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py\", line 269, in launch_agent\n", + " raise ChildFailedError(\n", + "torch.distributed.elastic.multiprocessing.errors.ChildFailedError: \n", + "============================================================\n", + "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py FAILED\n", + "------------------------------------------------------------\n", + "Failures:\n", + " \n", + "------------------------------------------------------------\n", + "Root Cause (first observed failure):\n", + "[0]:\n", + " time : 2026-03-23_07:22:53\n", + " host : ws-wy-training-hub-ldnjr-0\n", + " rank : 0 (local_rank: 0)\n", + " exitcode : 1 (pid: 4849)\n", + " error_file: \n", + " traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html\n", + "============================================================\n" + ] + }, + { + "data": { + "text/html": [ + "
[07:22:55] ERROR    Training subprocess has not exited yet. Sending SIGTERM. Process code: 1         main_ds.py:824\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m[07:22:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31mERROR \u001b[0m Training subprocess has not exited yet. Sending SIGTERM. Process code: \u001b[1;36m1\u001b[0m \u001b]8;id=361092;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=987123;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#824\u001b\\\u001b[2m824\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
           INFO     Waiting for process to exit, 60s...                                              main_ds.py:830\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Waiting for process to exit, 60s\u001b[33m...\u001b[0m \u001b]8;id=771553;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=303331;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#830\u001b\\\u001b[2m830\u001b[0m\u001b]8;;\u001b\\\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "\u274c Training failed after 1.1 minutes\n", + "Error: Suffered a failure during distributed training. Please see the training logs for more context.\n", + "============================================================\n", + "\n", + "\ud83d\udd0d Quick Troubleshooting Checklist:\n", + " \u25a1 Check that model_path exists or is a valid HuggingFace model name\n", + " \u25a1 Verify data_path points to valid JSONL file\n", + " \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\n", + " \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\n", + " \u25a1 For multi-node: verify network connectivity and endpoints\n", + " \u25a1 Check that all file paths are accessible from the training process\n" + ] + }, + { + "ename": "RuntimeError", + "evalue": "Suffered a failure during distributed training. Please see the training logs for more context.", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mRuntimeError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[12]\u001b[39m\u001b[32m, line 59\u001b[39m\n\u001b[32m 56\u001b[39m start_time = time.time()\n\u001b[32m 58\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m result = \u001b[43msft\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mtraining_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 61\u001b[39m end_time = time.time()\n\u001b[32m 62\u001b[39m duration = end_time - start_time\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:355\u001b[39m, in \u001b[36msft\u001b[39m\u001b[34m(model_path, data_path, ckpt_output_dir, backend, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m 352\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m create_algorithm\n\u001b[32m 354\u001b[39m algorithm = create_algorithm(\u001b[33m'\u001b[39m\u001b[33msft\u001b[39m\u001b[33m'\u001b[39m, backend)\n\u001b[32m--> \u001b[39m\u001b[32m355\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43malgorithm\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 356\u001b[39m \u001b[43m \u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 357\u001b[39m \u001b[43m \u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 358\u001b[39m \u001b[43m \u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 359\u001b[39m \u001b[43m \u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 360\u001b[39m \u001b[43m \u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 361\u001b[39m \u001b[43m \u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m=\u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 362\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 363\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 364\u001b[39m \u001b[43m \u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 365\u001b[39m \u001b[43m \u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 366\u001b[39m \u001b[43m \u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 367\u001b[39m \u001b[43m \u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 368\u001b[39m \u001b[43m \u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 369\u001b[39m \u001b[43m \u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m=\u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 370\u001b[39m \u001b[43m \u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 371\u001b[39m \u001b[43m \u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 372\u001b[39m \u001b[43m \u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 373\u001b[39m \u001b[43m \u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 374\u001b[39m \u001b[43m \u001b[49m\u001b[43meps\u001b[49m\u001b[43m=\u001b[49m\u001b[43meps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 375\u001b[39m \u001b[43m \u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m=\u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 376\u001b[39m \u001b[43m \u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 377\u001b[39m \u001b[43m \u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 378\u001b[39m \u001b[43m \u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 379\u001b[39m \u001b[43m \u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 380\u001b[39m \u001b[43m \u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 381\u001b[39m \u001b[43m \u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 382\u001b[39m \u001b[43m \u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 383\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 384\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 385\u001b[39m \u001b[43m \u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 386\u001b[39m \u001b[43m \u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 387\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 388\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 389\u001b[39m \u001b[43m \u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:213\u001b[39m, in \u001b[36mSFTAlgorithm.train\u001b[39m\u001b[34m(self, model_path, data_path, ckpt_output_dir, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m 209\u001b[39m params[key] = value\n\u001b[32m 211\u001b[39m params.update(kwargs)\n\u001b[32m--> \u001b[39m\u001b[32m213\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbackend\u001b[49m\u001b[43m.\u001b[49m\u001b[43mexecute_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mparams\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:70\u001b[39m, in \u001b[36mInstructLabTrainingSFTBackend.execute_training\u001b[39m\u001b[34m(self, algorithm_params)\u001b[39m\n\u001b[32m 67\u001b[39m torchrun_args = TorchrunArgs(**final_torchrun_params)\n\u001b[32m 69\u001b[39m \u001b[38;5;66;03m# Execute training\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m70\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 71\u001b[39m \u001b[43m \u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorchrun_args\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 72\u001b[39m \u001b[43m \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtraining_args\u001b[49m\n\u001b[32m 73\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/__init__.py:41\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m 38\u001b[39m \u001b[38;5;66;03m# Local\u001b[39;00m\n\u001b[32m 39\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[34;01mmain_ds\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m run_training\n\u001b[32m---> \u001b[39m\u001b[32m41\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:841\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m 839\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m interrupt\n\u001b[32m 840\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m failure:\n\u001b[32m--> \u001b[39m\u001b[32m841\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[32m 842\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mSuffered a failure during distributed training. Please see the training logs for more context.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 843\u001b[39m )\n", + "\u001b[31mRuntimeError\u001b[39m: Suffered a failure during distributed training. Please see the training logs for more context." + ] + } + ], + "source": [ + "# =============================================================================\n", + "# TRAINING EXECUTION\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\ude80 Starting SFT Training\")\n", + "print(\"=\" * 60)\n", + "print(f\"Experiment: {full_experiment_name}\")\n", + "print(f\"Model: {selected_example['model_name']}\")\n", + "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n", + "print(f\"Configuration: {dist_config['description']}\")\n", + "print()\n", + "\n", + "# Prepare all training parameters\n", + "training_params = {\n", + " # Required parameters\n", + " 'model_path': model_path,\n", + " 'data_path': data_path,\n", + " 'ckpt_output_dir': ckpt_output_dir,\n", + " \n", + " # Core training parameters\n", + " 'num_epochs': num_epochs,\n", + " 'effective_batch_size': effective_batch_size,\n", + " 'learning_rate': learning_rate,\n", + " 'max_seq_len': max_seq_len,\n", + " 'max_tokens_per_gpu': max_tokens_per_gpu,\n", + " \n", + " # Data and processing parameters\n", + " 'data_output_dir': data_output_dir,\n", + " 'warmup_steps': warmup_steps,\n", + " 'save_samples': save_samples,\n", + " \n", + " # Checkpointing parameters\n", + " 'checkpoint_at_epoch': checkpoint_at_epoch,\n", + " 'accelerate_full_state_at_epoch': accelerate_full_state_at_epoch,\n", + " \n", + " # Distributed training parameters\n", + " 'nproc_per_node': nproc_per_node,\n", + " 'nnodes': nnodes,\n", + " 'node_rank': node_rank,\n", + " 'rdzv_id': rdzv_id,\n", + " 'rdzv_endpoint': rdzv_endpoint,\n", + "\n", + " 'disable_flash_attn': True\n", + "}\n", + "\n", + "# Display final configuration summary\n", + "print(\"\ud83d\udccb Final Training Configuration:\")\n", + "for key, value in training_params.items():\n", + " print(f\" {key}: {value}\")\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\"\u23f3 Training starting...\")\n", + "print(\"=\"*60)\n", + "\n", + "# Execute training\n", + "start_time = time.time()\n", + "\n", + "try:\n", + " result = sft(**training_params)\n", + " \n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(\"\u2705 Training completed successfully!\")\n", + " print(f\"\u23f1\ufe0f Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", + " print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n", + " print(\"=\"*60)\n", + " \n", + "except Exception as e:\n", + " end_time = time.time()\n", + " duration = end_time - start_time\n", + " \n", + " print(\"\\n\" + \"=\"*60)\n", + " print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n", + " print(f\"Error: {e}\")\n", + " print(\"=\"*60)\n", + " \n", + " print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n", + " print(\" \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n", + " print(\" \u25a1 Verify data_path points to valid JSONL file\")\n", + " print(\" \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n", + " print(\" \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n", + " print(\" \u25a1 For multi-node: verify network connectivity and endpoints\")\n", + " print(\" \u25a1 Check that all file paths are accessible from the training process\")\n", + " \n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Training Analysis\n", + "\n", + "After training completes, let's analyze the results and provide guidance for next steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# POST-TRAINING ANALYSIS AND NEXT STEPS\n", + "# =============================================================================\n", + "\n", + "print(\"\ud83d\udcca Post-Training Analysis\")\n", + "print(\"=\" * 50)\n", + "\n", + "# Check for saved checkpoints\n", + "checkpoint_dir = f\"{ckpt_output_dir}/hf_format\"\n", + "\n", + "if os.path.exists(checkpoint_dir):\n", + " checkpoints = [d for d in os.listdir(checkpoint_dir) \n", + " if os.path.isdir(os.path.join(checkpoint_dir, d))]\n", + " \n", + " if checkpoints:\n", + " print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n", + " for ckpt in sorted(checkpoints):\n", + " ckpt_path = os.path.join(checkpoint_dir, ckpt)\n", + " print(f\" \ud83d\udcc1 {ckpt}\")\n", + " \n", + " # Identify the final checkpoint\n", + " final_checkpoint = sorted(checkpoints)[-1]\n", + " final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n", + " \n", + " print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n", + " \n", + " # Provide model loading example\n", + " print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n", + " print(f\"```python\")\n", + " print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n", + " print(f\"\")\n", + " print(f\"# Load your fine-tuned model\")\n", + " print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n", + " print(f\"\")\n", + " print(f\"# Generate text\")\n", + " print(f\"inputs = tokenizer('Your prompt here:', return_tensors='pt')\")\n", + " print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n", + " print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n", + " print(f\"print(response)\")\n", + " print(f\"```\")\n", + " else:\n", + " print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n", + "else:\n", + " print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n", + "\n", + "# Training summary\n", + "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n", + "print(f\" Model: {selected_example['model_name']}\")\n", + "print(f\" Epochs: {num_epochs}\")\n", + "print(f\" Global Batch Size: {effective_batch_size}\")\n", + "print(f\" Learning Rate: {learning_rate}\")\n", + "print(f\" Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n", + "print(f\" Max Sequence Length: {max_seq_len:,}\")\n", + "print(f\" Total GPUs: {total_gpus}\")\n", + "print(f\" Distributed Config: {dist_config['description']}\")\n", + "\n", + "# Next steps recommendations\n", + "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n", + "print(f\" 1. \ud83e\uddea Test your model with sample inputs to verify it's working\")\n", + "print(f\" 2. \ud83d\udcca Evaluate performance on your validation/test datasets\")\n", + "print(f\" 3. \ud83d\udd04 Compare outputs with the original base model\")\n", + "print(f\" 4. \ud83c\udfaf Fine-tune hyperparameters if needed (learning rate, batch size)\")\n", + "print(f\" 5. \ud83d\udcdd Document your configuration and results for reproducibility\")\n", + "print(f\" 6. \ud83d\udea2 Deploy for inference using your preferred serving framework\")\n", + "\n", + "# Performance optimization tips\n", + "print(f\"\\n\u26a1 Performance Optimization Tips:\")\n", + "print(f\" \u2022 If training was slow: increase max_tokens_per_gpu or effective_batch_size\")\n", + "print(f\" \u2022 If you hit OOM errors: reduce max_tokens_per_gpu or effective_batch_size\")\n", + "print(f\" \u2022 For better convergence: try different learning rates or warmup_steps\")\n", + "print(f\" \u2022 For production training: consider using the script version for better logging\")\n", + "\n", + "print(f\"\\n\u2728 SFT Training Complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parameter Reference Summary\n", + "\n", + "Quick reference for all SFT parameters and their purposes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Core Parameters\n", + "\n", + "| Parameter | Required | Description | Example Values |\n", + "|-----------|----------|-------------|----------------|\n", + "| `model_path` | \u2705 | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n", + "| `data_path` | \u2705 | Path to the training data | `\"/path/to/train.jsonl\"` |\n", + "| `ckpt_output_dir` | \u2705 | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n", + "| `num_epochs` | \u274c | Number of training epochs | `1`, `3`, `5` |\n", + "| `effective_batch_size` | \u274c | Effective batch size for training | `64`, `128`, `256` |\n", + "| `learning_rate` | \u274c | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n", + "| `max_seq_len` | \u274c | Maximum sequence length | `2048`, `8192`, `16384` |\n", + "| `max_tokens_per_gpu` | \u274c | Maximum tokens per GPU in a mini-batch (hard-cap for memory) | `15000`, `25000`, `40000` |\n", + "\n", + "### Data Processing Parameters\n", + "\n", + "| Parameter | Description | Default/Example |\n", + "|-----------|-------------|------------------|\n", + "| `data_output_dir` | Directory to save processed data | `\"/dev/shm\"` (RAM disk) |\n", + "| `warmup_steps` | Number of warmup steps | `100`, `500` |\n", + "\n", + "### Checkpointing Parameters\n", + "\n", + "| Parameter | Description | Recommended |\n", + "|-----------|-------------|-------------|\n", + "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n", + "| `accelerate_full_state_at_epoch` | Whether to save full state at epoch for automatic checkpoint resumption | `True` |\n", + "| `save_samples` | Number of samples to save after training (0 disables) | `1000`, `0` (disabled) |\n", + "\n", + "### Distributed Training Parameters\n", + "\n", + "| Parameter | Description | Example Values |\n", + "|-----------|-------------|----------------|\n", + "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n", + "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n", + "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n", + "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n", + "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n", + "\n", + "### Memory Optimization Guidelines\n", + "\n", + "- **Start conservative**: Begin with lower `max_tokens_per_gpu` values and increase gradually\n", + "- **Monitor usage**: Watch GPU memory during training and adjust accordingly\n", + "- **Balance batch size**: Larger `effective_batch_size` can improve training stability\n", + "- **Use RAM disk**: Set `data_output_dir=\"/dev/shm\"` for faster data loading\n", + "\n", + "### Multi-Node Setup Checklist\n", + "\n", + "1. \u2705 Ensure network connectivity between all nodes\n", + "2. \u2705 Use the same `rdzv_id` and `rdzv_endpoint` on all nodes\n", + "3. \u2705 Set unique `node_rank` for each node (0, 1, 2, ...)\n", + "4. \u2705 Verify all nodes can access model and data paths\n", + "5. \u2705 Start training simultaneously on all nodes\n", + "\n", + "### Popular Model Examples\n", + "\n", + "| Model | HuggingFace Path | Example Config |\n", + "|-------|------------------|----------------|\n", + "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | `max_tokens_per_gpu=20000` |\n", + "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | `max_tokens_per_gpu=18000` |\n", + "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | `max_tokens_per_gpu=25000` |\n", + "\n", + "### Script Alternative\n", + "\n", + "For production workloads or long-running training, use the script version:\n", + "\n", + "```bash\n", + "python scripts/sft_qwen_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "\n", + "python scripts/sft_llama_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "\n", + "python scripts/sft_phi_example.py \\\n", + " --data-path /path/to/data.jsonl \\\n", + " --ckpt-output-dir /path/to/checkpoints\n", + "```" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx new file mode 100644 index 0000000..76b1599 --- /dev/null +++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx @@ -0,0 +1,198 @@ +--- +weight: 25 +--- + +# Fine-tuning LLMs with Training Hub + +## Background + +`training_hub` is a Python library that provides a unified, high-level API for running Supervised Fine-Tuning (SFT) and Orthogonal Subspace Fine-Tuning (OSFT) on large language models. It abstracts away the complexity of distributed training configuration, memory management, and backend orchestration, letting you focus on experiment parameters. + +**Key benefits:** + +- **Unified API**: A single function call (`sft(...)` or `osft(...)`) handles single-GPU, multi-GPU, and multi-node training without changing your code. +- **Automatic memory management**: The `max_tokens_per_gpu` parameter caps GPU memory usage and automatically computes micro-batch size and gradient accumulation to maintain your target `effective_batch_size`. +- **OSFT for continual learning**: The `osft` function implements [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), which restricts weight updates to orthogonal subspaces — preventing catastrophic forgetting without replay buffers or supplementary datasets. +- **Production-ready**: Built-in checkpointing, experiment tracking, and Liger kernel support for throughput efficiency. + +### SFT vs OSFT + +| Aspect | SFT | OSFT | +|--------|-----|------| +| **Use case** | Initial instruction tuning, base model fine-tuning | Continual domain adaptation of already-tuned models | +| **Catastrophic forgetting** | Requires mixed/replay data to mitigate | Prevented algorithmically | +| **Key parameter** | Standard hyperparameters | `unfreeze_rank_ratio` (0.0–1.0) | +| **Backend** | instructlab-training | mini-trainer | + +## Requirements + +- **Alauda AI** and **Alauda AI Workbench** must be installed in your cluster. +- A Workbench (Notebook) instance with: + - Access to install Python packages from the internet (or a configured internal PyPI mirror). + - GPU resources attached (at least one NVIDIA GPU). + - Sufficient shared storage for model checkpoints. +- A HuggingFace model (local path or model name resolvable from the instance). +- Training data in **JSONL format** (see [Data Format](#data-format) below). + +## Data Format + +Training data must be a JSON Lines (`.jsonl`) file where each line is a conversation: + +```json +{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of AI..."}]} +``` + +Supported `role` values: `system`, `user`, `assistant`, `pretraining`. + +**Masking behavior:** + +- **SFT (default)** — only assistant responses contribute to the training loss. Add `"unmask": true` to a sample to include all non-system content in the loss (pretraining style). +- **OSFT** — controlled via the `unmask_messages` parameter (`False` by default; set `True` for pretraining style). + +Pre-processed datasets with `input_ids` and `labels` fields are also supported via `use_processed_dataset=True`. + +## Download Notebooks and Run Examples + +Two comprehensive tutorial notebooks are provided. Download them to your Workbench instance and execute them cell by cell. + +| Notebook | Algorithm | Download | +|----------|-----------|----------| +| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](./sft_comprehensive_tutorial.ipynb) | +| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](./osft_comprehensive_tutorial.ipynb) | + +### Step 1 — Install Dependencies + +Open a terminal in your Workbench instance and install `training-hub`: + +```bash +pip install training-hub +``` + +### Step 2 — Upload or Prepare Data + +Place your `.jsonl` training file in a path accessible to the notebook, for example `/data/train.jsonl`. + +### Step 3 — Open and Configure the Notebook + +Open the downloaded notebook in your Workbench instance. The key cells to configure are: + +**Select your model** (both notebooks): + +```python +# Change to your model's HuggingFace name or local path +model_path = "Qwen/Qwen2.5-7B-Instruct" +``` + +Bundled model presets cover Qwen 2.5 7B, Llama 3.1 8B, Phi 4 Mini, and generic 7B/small models. + +**Set required paths** (both notebooks): + +```python +data_path = "/path/to/your/training_data.jsonl" +ckpt_output_dir = "/path/to/checkpoints/my_experiment" +``` + +**OSFT only — set the orthogonality ratio:** + +```python +unfreeze_rank_ratio = 0.25 # 0.1–0.3 conservative, 0.3–0.5 balanced +``` + +**Select distributed configuration:** + +```python +selected_distributed = "single_node_8gpu" # or "single_gpu_dev", "multi_node_master", etc. +``` + +### Step 4 — Execute Training + +Run all cells in sequence. The final training cell calls either: + +```python +# SFT +from training_hub import sft +result = sft( + model_path=model_path, + data_path=data_path, + ckpt_output_dir=ckpt_output_dir, + effective_batch_size=128, + max_tokens_per_gpu=20000, + max_seq_len=16384, + learning_rate=1e-5, + num_epochs=3, + nproc_per_node=8, + ... +) + +# OSFT +from training_hub import osft +result = osft( + model_path=model_path, + data_path=data_path, + ckpt_output_dir=ckpt_output_dir, + unfreeze_rank_ratio=0.25, + effective_batch_size=128, + max_tokens_per_gpu=10000, + max_seq_len=8196, + learning_rate=5e-6, + num_epochs=1, + nproc_per_node=8, + ... +) +``` + +Checkpoints are written to `ckpt_output_dir` at the end of each epoch (configurable via `checkpoint_at_epoch`). + +## Key Parameters + +### Common Parameters (SFT and OSFT) + +| Parameter | Required | Description | +|-----------|----------|-------------| +| `model_path` | Yes | HuggingFace model name or local path | +| `data_path` | Yes | Path to JSONL training data | +| `ckpt_output_dir` | Yes | Directory to save checkpoints | +| `effective_batch_size` | Yes | Global effective batch size | +| `max_tokens_per_gpu` | Yes | Per-GPU token budget; controls memory and auto-computes micro-batch size | +| `max_seq_len` | Yes | Maximum sequence length | +| `learning_rate` | Yes | Optimizer learning rate | +| `num_epochs` | No | Training epochs (default: `1`) | +| `lr_scheduler` | No | Scheduler type, e.g. `"cosine"` | +| `warmup_steps` | No | Linear warmup steps (default: `0`) | +| `use_liger` | No | Enable Liger kernels for efficiency (default: `True` for OSFT) | +| `seed` | No | Random seed (default: `42`) | +| `data_output_dir` | No | Processed data cache dir; use `"/dev/shm"` for RAM-disk speed | +| `use_processed_dataset` | No | Skip tokenization if data has `input_ids`/`labels` | +| `checkpoint_at_epoch` | No | Save checkpoint each epoch (default: `True`) | +| `save_final_checkpoint` | No | Save a final checkpoint after training (default: `True`) | +| `nproc_per_node` | No | GPUs per node | +| `nnodes` | No | Total nodes (default: `1`) | +| `node_rank` | No | This node's rank (default: `0`) | +| `rdzv_id` | No | Rendezvous job ID | +| `rdzv_endpoint` | No | Master node `host:port` for multi-node | + +### OSFT-specific Parameters + +| Parameter | Required | Description | +|-----------|----------|-------------| +| `unfreeze_rank_ratio` | Yes | Fraction of each weight matrix that can be updated (0.0–1.0). Lower = more preservation. | +| `unmask_messages` | No | If `True`, trains on all non-system content (pretraining style) | +| `target_patterns` | No | Substring patterns to restrict OSFT to specific layers (default: `None`, all layers) | + +### Multi-node Training + +For multi-node jobs, run the notebook (or equivalent script) on every node simultaneously with matching `rdzv_id` and `rdzv_endpoint`, varying only `node_rank` per node: + +```python +# Master node (node_rank=0) +nproc_per_node = 8 +nnodes = 2 +node_rank = 0 +rdzv_id = 42 +rdzv_endpoint = "10.0.0.1:29500" + +# Worker node (node_rank=1) +node_rank = 1 # all other params identical +``` + +All nodes must have network connectivity to the `rdzv_endpoint` before training begins. From 96f8d37d3b21ff748775c8e1c209318c3a3be8b2 Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Tue, 24 Mar 2026 09:54:11 +0800 Subject: [PATCH 2/3] fix link --- .../how_to/osft_comprehensive_tutorial.ipynb | 162 +++++++++--------- .../how_to/training_hub_fine_tuning.mdx | 6 +- 2 files changed, 84 insertions(+), 84 deletions(-) diff --git a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb index ebd6edd..2313fd7 100644 --- a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb +++ b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb @@ -142,11 +142,11 @@ "target_patterns = None # Default: applies OSFT to all appropriate layers (RECOMMENDED)\n", "```\n", "\n", - "**\u26a0\ufe0f Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n", + "**⚠️ Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n", "\n", "If you do need to use it, it performs simple substring matching on module names:\n", - "- `target_patterns = [\"attention\"]` \u2192 Targets modules with \"attention\" in the name\n", - "- `target_patterns = [\"mlp\"]` \u2192 Targets modules with \"mlp\" in the name\n", + "- `target_patterns = [\"attention\"]` → Targets modules with \"attention\" in the name\n", + "- `target_patterns = [\"mlp\"]` → Targets modules with \"mlp\" in the name\n", "\n", "**For 99% of users:** Just use the default (`None`) and let OSFT handle layer selection automatically. The algorithm knows what it's doing.\n" ] @@ -292,10 +292,10 @@ "# These are example configurations - adjust based on your hardware and requirements\n", "# =============================================================================\n", "\n", - "# Example 1: Qwen 2.5 7B Instruct\n", + "# Example 1: Qwen 3 0.6B\n", "qwen_example = {\n", - " \"model_name\": \"Qwen 2.5 7B Instruct\",\n", - " \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\", # HuggingFace model name or local path\n", + " \"model_name\": \"Qwen 3 0.6B Instruct\",\n", + " \"model_path\": \"Qwen/Qwen3-0.6B\", # HuggingFace model name or local path\n", " \"example_unfreeze_rank_ratio\": 0.25, # Conservative for preserving multilingual capabilities\n", " \"example_max_tokens_per_gpu\": 2048,\n", " \"example_max_seq_len\": 2048, # Qwen 2.5 supports long context\n", @@ -367,7 +367,7 @@ "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n", "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n", "print(f\"Notes: {selected_example['notes']}\")\n", - "print(\"\\n\ud83d\udca1 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n", + "print(\"\\n💡 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n", "print(\" Adjust unfreeze_rank_ratio based on preservation vs adaptation needs.\")" ] }, @@ -409,15 +409,15 @@ "max_seq_len = selected_example[\"example_max_seq_len\"] # Maximum sequence length\n", "learning_rate = selected_example[\"example_learning_rate\"] # Learning rate for training\n", "\n", - "print(\"\ud83d\udccb Required Parameters (all must be specified):\")\n", - "print(f\" \u2022 model_path: {model_path}\")\n", - "print(f\" \u2022 data_path: {data_path}\")\n", - "print(f\" \u2022 ckpt_output_dir: {ckpt_output_dir}\")\n", - "print(f\" \u2022 unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n", - "print(f\" \u2022 effective_batch_size: {effective_batch_size}\")\n", - "print(f\" \u2022 max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n", - "print(f\" \u2022 max_seq_len: {max_seq_len:,}\")\n", - "print(f\" \u2022 learning_rate: {learning_rate}\")\n", + "print(\"📋 Required Parameters (all must be specified):\")\n", + "print(f\" • model_path: {model_path}\")\n", + "print(f\" • data_path: {data_path}\")\n", + "print(f\" • ckpt_output_dir: {ckpt_output_dir}\")\n", + "print(f\" • unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n", + "print(f\" • effective_batch_size: {effective_batch_size}\")\n", + "print(f\" • max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n", + "print(f\" • max_seq_len: {max_seq_len:,}\")\n", + "print(f\" • learning_rate: {learning_rate}\")\n", "print()\n", "\n", "# =============================================================================\n", @@ -427,11 +427,11 @@ "target_patterns = None # Optional: Patterns to match specific modules for OSFT\n", "# Example: [\"*attention*\", \"*mlp*\"] to target attention and MLP layers\n", "\n", - "print(\"\ud83d\udd27 OSFT-Specific Parameters:\")\n", + "print(\"🔧 OSFT-Specific Parameters:\")\n", "print(f\" unfreeze_rank_ratio: {unfreeze_rank_ratio} - Controls how much of each matrix is unfrozen\")\n", - "print(f\" \u2022 0.1-0.3: Conservative, maximum preservation\")\n", - "print(f\" \u2022 0.3-0.5: Balanced adaptation\")\n", - "print(f\" \u2022 >0.5: Rarely needed for typical use cases\")\n", + "print(f\" • 0.1-0.3: Conservative, maximum preservation\")\n", + "print(f\" • 0.3-0.5: Balanced adaptation\")\n", + "print(f\" • >0.5: Rarely needed for typical use cases\")\n", "print(f\" target_patterns: {target_patterns} - Optional patterns for selecting specific modules\")\n", "print()\n", "\n", @@ -446,7 +446,7 @@ "lr_scheduler_kwargs = {} # Scheduler parameters\n", "warmup_steps = 0 # Number of warmup steps\n", "\n", - "print(\"\ud83c\udfaf Training Hyperparameters:\")\n", + "print(\"🎯 Training Hyperparameters:\")\n", "print(f\" effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n", "print(f\" learning_rate: {learning_rate} - Learning rate for model updates\")\n", "print(f\" num_epochs: {num_epochs} - Number of training epochs\")\n", @@ -462,7 +462,7 @@ "\n", "use_liger = True # Use Liger kernels for efficiency\n", "\n", - "print(\"\u26a1 Memory and Performance Parameters:\")\n", + "print(\"⚡ Memory and Performance Parameters:\")\n", "print(f\" max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU (hard-cap for memory)\")\n", "print(f\" max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n", "print(f\" use_liger: {use_liger} - Use Liger kernels for efficiency\")\n", @@ -476,7 +476,7 @@ "use_processed_dataset = False # Whether data is pre-processed\n", "unmask_messages = False # Whether to unmask all messages for pretraining-style learning\n", "\n", - "print(\"\ud83d\udcbe Data Processing Parameters:\")\n", + "print(\"💾 Data Processing Parameters:\")\n", "print(f\" data_path: '{data_path}' - Path to training data (JSONL format)\")\n", "print(f\" data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n", "print(f\" use_processed_dataset: {use_processed_dataset} - Whether to use pre-processed data\")\n", @@ -490,7 +490,7 @@ "checkpoint_at_epoch = True # Whether to checkpoint at each epoch\n", "save_final_checkpoint = True # Whether to save final checkpoint\n", "\n", - "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n", + "print(\"💾 Checkpointing Parameters:\")\n", "print(f\" ckpt_output_dir: '{ckpt_output_dir}' - Directory to save checkpoints\")\n", "print(f\" checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n", "print(f\" save_final_checkpoint: {save_final_checkpoint} - Whether to save final checkpoint\")\n", @@ -568,7 +568,7 @@ "total_gpus = nproc_per_node * nnodes\n", "per_gpu_batch_size = effective_batch_size // total_gpus\n", "\n", - "print(\"\ud83d\udda5\ufe0f Distributed Training Parameters:\")\n", + "print(\"🖥️ Distributed Training Parameters:\")\n", "print(f\" Configuration: {dist_config['description']}\")\n", "print(f\" nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n", "print(f\" nnodes: {nnodes} - Total number of nodes\")\n", @@ -576,8 +576,8 @@ "print(f\" rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n", "print(f\" rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n", "print()\n", - "print(f\"\ud83d\udcca Resource Calculation:\")\n", - "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n", + "print(f\"📊 Resource Calculation:\")\n", + "print(f\" Total GPUs: {total_gpus} ({nproc_per_node} × {nnodes})\")\n", "print(f\" Effective batch size: {effective_batch_size}\")\n", "print(f\" Approximate per-GPU batch size: {per_gpu_batch_size}\")\n", "print(f\" (Actual micro-batch size determined automatically by gradient accumulation)\")\n", @@ -585,7 +585,7 @@ "\n", "# Multi-node setup instructions\n", "if nnodes > 1:\n", - " print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n", + " print(\"🔧 Multi-Node Setup Instructions:\")\n", " print(f\" 1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n", " print(f\" 2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n", " print(f\" 3. Set node_rank to 0 for master, 1,2,3... for workers\")\n", @@ -593,11 +593,11 @@ " print()\n", "\n", "# OSFT-specific multi-node considerations\n", - "print(\"\ud83d\udcdd OSFT Multi-Node Considerations:\")\n", - "print(\" \u2022 OSFT works seamlessly across multiple nodes\")\n", - "print(\" \u2022 No special replay buffer coordination needed (unlike SFT)\")\n", - "print(\" \u2022 Each node processes its data portion with the same unfreeze_rank_ratio\")\n", - "print(\" \u2022 Gradients are synchronized automatically across all nodes\")\n", + "print(\"📝 OSFT Multi-Node Considerations:\")\n", + "print(\" • OSFT works seamlessly across multiple nodes\")\n", + "print(\" • No special replay buffer coordination needed (unlike SFT)\")\n", + "print(\" • Each node processes its data portion with the same unfreeze_rank_ratio\")\n", + "print(\" • Gradients are synchronized automatically across all nodes\")\n", "print()" ] }, @@ -620,18 +620,18 @@ "# TRAINING EXECUTION\n", "# =============================================================================\n", "\n", - "print(\"\ud83d\ude80 Starting OSFT Training\")\n", + "print(\"🚀 Starting OSFT Training\")\n", "print(\"=\" * 60)\n", "print(f\"Experiment: {full_experiment_name}\")\n", "print(f\"Model: {selected_example['model_name']}\")\n", - "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n", + "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node × {nnodes} nodes)\")\n", "print(f\"Configuration: {dist_config['description']}\")\n", "print(f\"Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", "print()\n", - "print(\"\u2728 OSFT Advantages:\")\n", - "print(\" \u2022 No catastrophic forgetting\")\n", - "print(\" \u2022 No replay buffer needed\")\n", - "print(\" \u2022 Preserves original model capabilities\")\n", + "print(\"✨ OSFT Advantages:\")\n", + "print(\" • No catastrophic forgetting\")\n", + "print(\" • No replay buffer needed\")\n", + "print(\" • Preserves original model capabilities\")\n", "print()\n", "\n", "# Prepare all training parameters\n", @@ -677,13 +677,13 @@ "}\n", "\n", "# Display final configuration summary\n", - "print(\"\ud83d\udccb Final Training Configuration:\")\n", + "print(\"📋 Final Training Configuration:\")\n", "for key, value in training_params.items():\n", " if value is not None: # Only show non-None values\n", " print(f\" {key}: {value}\")\n", "\n", "print(\"\\n\" + \"=\"*60)\n", - "print(\"\u23f3 Training starting...\")\n", + "print(\"⏳ Training starting...\")\n", "print(\"=\"*60)\n", "\n", "# Execute training\n", @@ -696,34 +696,34 @@ " duration = end_time - start_time\n", " \n", " print(\"\\n\" + \"=\"*60)\n", - " print(\"\u2705 OSFT Training completed successfully!\")\n", - " print(f\"\u23f1\ufe0f Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", - " print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n", + " print(\"✅ OSFT Training completed successfully!\")\n", + " print(f\"⏱️ Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n", + " print(f\"📁 Checkpoints saved to: {ckpt_output_dir}\")\n", " print(\"=\"*60)\n", " print()\n", - " print(\"\ud83c\udfaf What you've achieved with OSFT:\")\n", - " print(\" \u2022 Model adapted to new domain/task\")\n", - " print(\" \u2022 Original capabilities preserved\")\n", - " print(\" \u2022 No catastrophic forgetting occurred\")\n", - " print(\" \u2022 Ready for deployment without regression testing!\")\n", + " print(\"🎯 What you've achieved with OSFT:\")\n", + " print(\" • Model adapted to new domain/task\")\n", + " print(\" • Original capabilities preserved\")\n", + " print(\" • No catastrophic forgetting occurred\")\n", + " print(\" • Ready for deployment without regression testing!\")\n", " \n", "except Exception as e:\n", " end_time = time.time()\n", " duration = end_time - start_time\n", " \n", " print(\"\\n\" + \"=\"*60)\n", - " print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n", + " print(f\"❌ Training failed after {duration/60:.1f} minutes\")\n", " print(f\"Error: {e}\")\n", " print(\"=\"*60)\n", " \n", - " print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n", - " print(\" \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n", - " print(\" \u25a1 Verify data_path points to valid JSONL file\")\n", - " print(\" \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n", - " print(\" \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n", - " print(\" \u25a1 Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n", - " print(\" \u25a1 For multi-node: verify network connectivity and endpoints\")\n", - " print(\" \u25a1 Check that mini-trainer backend dependencies are installed\")\n", + " print(\"\\n🔍 Quick Troubleshooting Checklist:\")\n", + " print(\" □ Check that model_path exists or is a valid HuggingFace model name\")\n", + " print(\" □ Verify data_path points to valid JSONL file\")\n", + " print(\" □ Ensure ckpt_output_dir parent directory exists and is writable\")\n", + " print(\" □ Try reducing max_tokens_per_gpu if you see OOM errors\")\n", + " print(\" □ Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n", + " print(\" □ For multi-node: verify network connectivity and endpoints\")\n", + " print(\" □ Check that mini-trainer backend dependencies are installed\")\n", " \n", " raise\n" ] @@ -747,7 +747,7 @@ "# POST-TRAINING ANALYSIS AND NEXT STEPS\n", "# =============================================================================\n", "\n", - "print(\"\ud83d\udcca Post-Training Analysis\")\n", + "print(\"📊 Post-Training Analysis\")\n", "print(\"=\" * 50)\n", "\n", "# Check for saved checkpoints\n", @@ -758,19 +758,19 @@ " if os.path.isdir(os.path.join(checkpoint_dir, d))]\n", " \n", " if checkpoints:\n", - " print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n", + " print(f\"✅ Found {len(checkpoints)} checkpoint(s):\")\n", " for ckpt in sorted(checkpoints):\n", " ckpt_path = os.path.join(checkpoint_dir, ckpt)\n", - " print(f\" \ud83d\udcc1 {ckpt}\")\n", + " print(f\" 📁 {ckpt}\")\n", " \n", " # Identify the final checkpoint\n", " final_checkpoint = sorted(checkpoints)[-1]\n", " final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n", " \n", - " print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n", + " print(f\"\\n🎯 Final model checkpoint: {final_checkpoint_path}\")\n", " \n", " # Provide model loading example\n", - " print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n", + " print(f\"\\n💻 Model Loading Example:\")\n", " print(f\"```python\")\n", " print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n", " print(f\"\")\n", @@ -786,12 +786,12 @@ " print(f\"print(response)\")\n", " print(f\"```\")\n", " else:\n", - " print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n", + " print(f\"❌ No checkpoints found in {checkpoint_dir}\")\n", "else:\n", - " print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n", + " print(f\"❌ Checkpoint directory not found: {checkpoint_dir}\")\n", "\n", "# Training summary\n", - "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n", + "print(f\"\\n📈 Training Summary:\")\n", "print(f\" Model: {selected_example['model_name']}\")\n", "print(f\" Algorithm: OSFT (Orthogonal Subspace Fine-Tuning)\")\n", "print(f\" Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n", @@ -804,7 +804,7 @@ "print(f\" Distributed Config: {dist_config['description']}\")\n", "\n", "# OSFT-specific validation recommendations\n", - "print(f\"\\n\ud83e\uddea OSFT-Specific Validation Steps:\")\n", + "print(f\"\\n🧪 OSFT-Specific Validation Steps:\")\n", "print(f\" 1. **Test Original Capabilities**: Verify the model still performs well on\")\n", "print(f\" general tasks it was originally trained for\")\n", "print(f\" 2. **Test New Domain**: Confirm improved performance on your target domain\")\n", @@ -814,17 +814,17 @@ "print(f\" improvements without degradation\")\n", "\n", "# Next steps recommendations\n", - "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n", - "print(f\" 1. \ud83c\udfaf Test on domain-specific evaluation sets\")\n", - "print(f\" 2. \ud83d\udcca Compare performance with base model on both general and domain tasks\")\n", - "print(f\" 3. \ud83d\udd04 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n", - "print(f\" 4. \ud83d\udca1 If too much change occurred, reduce unfreeze_rank_ratio\")\n", - "print(f\" 5. \ud83d\udcdd Document the unfreeze_rank_ratio that works best for your use case\")\n", - "print(f\" 6. \ud83d\udea2 Deploy with confidence - no catastrophic forgetting!\")\n", + "print(f\"\\n🚀 Recommended Next Steps:\")\n", + "print(f\" 1. 🎯 Test on domain-specific evaluation sets\")\n", + "print(f\" 2. 📊 Compare performance with base model on both general and domain tasks\")\n", + "print(f\" 3. 🔄 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n", + "print(f\" 4. 💡 If too much change occurred, reduce unfreeze_rank_ratio\")\n", + "print(f\" 5. 📝 Document the unfreeze_rank_ratio that works best for your use case\")\n", + "print(f\" 6. 🚢 Deploy with confidence - no catastrophic forgetting!\")\n", "\n", "# Performance optimization tips\n", - "print(f\"\\n\u26a1 OSFT-Specific Optimization Tips:\")\n", - "print(f\" \u2022 Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n", + "print(f\"\\n⚡ OSFT-Specific Optimization Tips:\")\n", + "print(f\" • Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n", "if unfreeze_rank_ratio < 0.2:\n", " print(f\" Very conservative - great preservation, slower adaptation\")\n", " print(f\" Consider increasing to 0.25-0.3 if need more adaptation\")\n", @@ -835,10 +835,10 @@ " print(f\" Aggressive - faster adaptation, slightly less preservation\")\n", " print(f\" Consider reducing if seeing any capability degradation\")\n", "\n", - "print(f\" \u2022 Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n", - "print(f\" \u2022 For production: use the script version for better logging and resumption\")\n", + "print(f\" • Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n", + "print(f\" • For production: use the script version for better logging and resumption\")\n", "\n", - "print(f\"\\n\u2728 OSFT Training Complete!\")\n", + "print(f\"\\n✨ OSFT Training Complete!\")\n", "print(f\"Your model has been successfully adapted without forgetting!\")\n" ] }, @@ -883,7 +883,7 @@ "|-----------|-------------|-----------------|\n", "| `num_epochs` | Number of training epochs | `1` |\n", "| `seed` | Random seed for reproducibility | `42` |\n", - "| `use_liger` | Enable Liger kernels for efficiency | `False` |\n", + "| `use_liger` | Enable Liger kernels for efficiency | `True` |\n", "| `warmup_steps` | Number of warmup steps | `0` |\n", "| `lr_scheduler` | Learning rate scheduler | `\"cosine\"` |\n", "| `lr_scheduler_kwargs` | Additional scheduler parameters | `{\"eta_min\": 1e-6}` |\n", @@ -1004,4 +1004,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} \ No newline at end of file +} diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx index 76b1599..aa7cd66 100644 --- a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx +++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx @@ -57,8 +57,8 @@ Two comprehensive tutorial notebooks are provided. Download them to your Workben | Notebook | Algorithm | Download | |----------|-----------|----------| -| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](./sft_comprehensive_tutorial.ipynb) | -| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](./osft_comprehensive_tutorial.ipynb) | +| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](/workbench/how_to/sft_comprehensive_tutorial.ipynb) | +| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](/workbench/how_to/osft_comprehensive_tutorial.ipynb) | ### Step 1 — Install Dependencies @@ -133,7 +133,7 @@ result = osft( unfreeze_rank_ratio=0.25, effective_batch_size=128, max_tokens_per_gpu=10000, - max_seq_len=8196, + max_seq_len=8192, learning_rate=5e-6, num_epochs=1, nproc_per_node=8, From ce9309d11fd662cbb3515bd85ecd5de8f1ccd34f Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Tue, 24 Mar 2026 11:15:10 +0800 Subject: [PATCH 3/3] fix link --- docs/en/workbench/how_to/training_hub_fine_tuning.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx index aa7cd66..ac63978 100644 --- a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx +++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx @@ -57,8 +57,8 @@ Two comprehensive tutorial notebooks are provided. Download them to your Workben | Notebook | Algorithm | Download | |----------|-----------|----------| -| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](/workbench/how_to/sft_comprehensive_tutorial.ipynb) | -| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](/workbench/how_to/osft_comprehensive_tutorial.ipynb) | +| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [Download sft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) | +| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [Download osft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) | ### Step 1 — Install Dependencies