From 5edfb2915afd0fbe274ee30af7ed8a469fbc4524 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 23 Mar 2026 21:20:07 +0800
Subject: [PATCH 1/3] add docs for training hub

---
 .../how_to/osft_comprehensive_tutorial.ipynb  | 1007 ++++++++++
 .../how_to/sft_comprehensive_tutorial.ipynb   | 1679 +++++++++++++++++
 .../how_to/training_hub_fine_tuning.mdx       |  198 ++
 3 files changed, 2884 insertions(+)
 create mode 100644 docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
 create mode 100644 docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb
 create mode 100644 docs/en/workbench/how_to/training_hub_fine_tuning.mdx

diff --git a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
new file mode 100644
index 0000000..ebd6edd
--- /dev/null
+++ b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
@@ -0,0 +1,1007 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comprehensive OSFT Training Tutorial\n",
+    "\n",
+    "This notebook provides a comprehensive guide to Orthogonal Subspace Fine-Tuning (OSFT) using the training_hub library. We'll cover:\n",
+    "\n",
+    "- **All available parameters** and their detailed explanations\n",
+    "- **Single-node and multi-node training** configurations\n",
+    "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n",
+    "- **Best practices and troubleshooting**\n",
+    "\n",
+    "OSFT (Orthogonal Subspace Fine-Tuning) is an algorithm based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097) that enables continual training of pre-trained or instruction-tuned models **without** catastrophic forgetting and **without** needing replay buffers or supplementary datasets.\n",
+    "\n",
+    "This tutorial serves as both a learning resource and a template you can adapt for your specific continual learning needs.\n",
+    "\n",
+    "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/osft_qwen_example.py`, `scripts/osft_llama_example.py`, and `scripts/osft_phi_example.py` with better logging consistency.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What is OSFT?\n",
+    "\n",
+    "OSFT (Orthogonal Subspace Fine-Tuning) is a continual learning algorithm that allows you to adapt pre-trained or instruction-tuned models to new domains **without catastrophic forgetting**. Based on [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), OSFT fundamentally changes how we approach model adaptation.\n",
+    "\n",
+    "### Key Innovation\n",
+    "\n",
+    "Traditional fine-tuning updates all model parameters, which can overwrite previously learned knowledge. OSFT instead:\n",
+    "1. **Identifies orthogonal subspaces** in the model's weight matrices\n",
+    "2. **Restricts updates to these subspaces**, preserving existing knowledge\n",
+    "3. **Eliminates the need for replay buffers** or supplementary datasets\n",
+    "\n",
+    "### OSFT vs Traditional Fine-Tuning\n",
+    "\n",
+    "| Aspect | Traditional SFT | OSFT |\n",
+    "|--------|----------------|------|\n",
+    "| **Catastrophic Forgetting** | Common problem | Prevented by design |\n",
+    "| **Data Requirements** | Needs replay/mixed data | Only new domain data |\n",
+    "| **Preservation Method** | Data mixing ratios | Algorithm (math guarantees) |\n",
+    "| **Memory Usage** | Similar | Similar |\n",
+    "| **Complexity** | Complex data pipelines | Simple, direct |\n",
+    "\n",
+    "### When to Use OSFT\n",
+    "\n",
+    "**Perfect for:**\n",
+    "- Adding domain-specific knowledge (medical, legal, technical)\n",
+    "- Adapting to new languages or dialects\n",
+    "- Customizing instruction formats\n",
+    "- Continual learning across multiple domains\n",
+    "- Any scenario where you need to preserve existing capabilities\n",
+    "\n",
+    "**Not needed for:**\n",
+    "- Training from scratch\n",
+    "- Base model pre-training\n",
+    "- When you want to completely replace model behavior\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Understanding the Key Parameter: `unfreeze_rank_ratio`\n",
+    "\n",
+    "The `unfreeze_rank_ratio` is the most important OSFT-specific parameter. It controls the balance between preservation and adaptation.\n",
+    "\n",
+    "### What Does It Do?\n",
+    "\n",
+    "- Controls **how much of each weight matrix** can be updated during training\n",
+    "- Range: `0.0` to `1.0`\n",
+    "- Lower values = more preservation, slower adaptation\n",
+    "- Higher values = more adaptation, slightly less preservation\n",
+    "\n",
+    "### Visual Intuition\n",
+    "\n",
+    "Think of a weight matrix as a building:\n",
+    "- `unfreeze_rank_ratio = 0.1`: You can only renovate 10% of the rooms\n",
+    "- `unfreeze_rank_ratio = 0.3`: You can renovate 30% of the rooms\n",
+    "- `unfreeze_rank_ratio = 1.0`: You can renovate the entire building (standard fine-tuning)\n",
+    "\n",
+    "The \"rooms\" you renovate are carefully chosen to be orthogonal to existing knowledge, preventing damage to what's already there.\n",
+    "\n",
+    "### Recommended Settings by Use Case\n",
+    "\n",
+    "| Use Case | Recommended Ratio | Why? |\n",
+    "|----------|-------------------|------|\n",
+    "| **Minor format adjustments** | 0.1-0.15 | Minimal changes needed |\n",
+    "| **Domain vocabulary addition** | 0.15-0.25 | Add terms without losing general knowledge |\n",
+    "| **Domain specialization** | 0.25-0.35 | Balance preservation and new expertise |\n",
+    "| **Major capability expansion** | 0.35-0.5 | Significant new learning required |\n",
+    "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n",
+    "\n",
+    "### Practical Guidelines\n",
+    "\n",
+    "```python\n",
+    "# Conservative: Maximum preservation\n",
+    "unfreeze_rank_ratio = 0.2  # Great for adding specialized knowledge\n",
+    "\n",
+    "# Balanced: Good for most use cases  \n",
+    "unfreeze_rank_ratio = 0.3  # Ideal default for domain adaptation\n",
+    "\n",
+    "# Aggressive: When you need significant changes\n",
+    "unfreeze_rank_ratio = 0.4  # Use when preservation is less critical\n",
+    "```\n",
+    "\n",
+    "**Pro tip:** Start conservative (0.2-0.3) and increase only if needed. It's easier to train again with higher ratio than to recover lost capabilities!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "data_path = \"./test_osft_data.jsonl\"\n",
+    "if not os.path.exists(data_path):\n",
+    "    print(f\"Creating dummy dataset at {data_path}\")\n",
+    "    dummy_data = [\n",
+    "        {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n",
+    "    ] * 10\n",
+    "    with open(data_path, \"w\") as f:\n",
+    "        for d in dummy_data:\n",
+    "            f.write(json.dumps(d) + \"\\n\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The `target_patterns` Parameter (Advanced Users Only)\n",
+    "\n",
+    "There's an optional `target_patterns` parameter that allows targeting specific model layers for OSFT:\n",
+    "\n",
+    "```python\n",
+    "target_patterns = None  # Default: applies OSFT to all appropriate layers (RECOMMENDED)\n",
+    "```\n",
+    "\n",
+    "**\u26a0\ufe0f Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n",
+    "\n",
+    "If you do need to use it, it performs simple substring matching on module names:\n",
+    "- `target_patterns = [\"attention\"]` \u2192 Targets modules with \"attention\" in the name\n",
+    "- `target_patterns = [\"mlp\"]` \u2192 Targets modules with \"mlp\" in the name\n",
+    "\n",
+    "**For 99% of users:** Just use the default (`None`) and let OSFT handle layer selection automatically. The algorithm knows what it's doing.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Before running this notebook, install the required dependencies:\n",
+    "\n",
+    "```bash\n",
+    "# Install training-hub (brings in mini-trainer and other deps)\n",
+    "pip install training-hub\n",
+    "\n",
+    "# Install PyTorch 2.9+ with CUDA 12.9 support (required by mini-trainer 0.7+)\n",
+    "pip install torch==2.9.0+cu129 torchvision==0.24.0+cu129 --index-url https://download.pytorch.org/whl/cu129\n",
+    "\n",
+    "# Install a compatible NCCL build (needed if system NCCL mismatches your CUDA driver)\n",
+    "pip install nvidia-nccl-cu12 nvidia-nvshmem-cu12\n",
+    "```\n",
+    "\n",
+    "> **Note:** If `flash-attn` is not available on your platform, the notebook sets `TESTING=true` to fall back to PyTorch SDPA attention. If `liger-kernel` is not installed, set `use_liger = False` in the parameters cell."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup and Imports\n",
+    "\n",
+    "Let's start by importing the necessary libraries and setting up our environment.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import training_hub for OSFT training\n",
+    "from training_hub import osft\n",
+    "\n",
+    "# Standard library imports\n",
+    "import os\n",
+    "import time\n",
+    "from datetime import datetime\n",
+    "from pathlib import Path\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Format Requirements\n",
+    "\n",
+    "Before configuring your training, ensure your data is in the correct format. OSFT uses the mini-trainer backend, which supports both standard messages format and pre-processed datasets.\n",
+    "\n",
+    "### Required Format: JSONL with Messages\n",
+    "\n",
+    "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n",
+    "\n",
+    "```json\n",
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n",
+    "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n",
+    "```\n",
+    "\n",
+    "### Message Structure\n",
+    "\n",
+    "Each conversation contains a `messages` array with message objects having:\n",
+    "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n",
+    "- **`content`**: The text content of the message\n",
+    "- **`reasoning_content`** (optional): Additional reasoning traces\n",
+    "\n",
+    "### Masking Control with `unmask_messages` Parameter\n",
+    "\n",
+    "Control which parts of the conversation are used for training loss:\n",
+    "\n",
+    "#### Standard Instruction Tuning (default)\n",
+    "```python\n",
+    "osft(..., unmask_messages=False)  # Only assistant responses used for loss\n",
+    "```\n",
+    "- **Trains only on assistant responses** (standard instruction-following)\n",
+    "- System messages are always masked (ignored for loss)\n",
+    "- User messages are masked\n",
+    "- Assistant messages are unmasked (used for loss calculation)\n",
+    "\n",
+    "#### Pretraining Mode\n",
+    "```python\n",
+    "osft(..., unmask_messages=True)   # All content except system messages used for loss\n",
+    "```\n",
+    "- **Trains on all content except system messages**\n",
+    "- System messages are always masked\n",
+    "- User and assistant messages are both unmasked\n",
+    "- Useful for pretraining-style data where the model should learn from all text\n",
+    "\n",
+    "### Pre-processed Dataset Option\n",
+    "\n",
+    "If you have pre-processed data with `input_ids` and `labels` fields:\n",
+    "\n",
+    "```json\n",
+    "{\"input_ids\": [1, 2, 3, ...], \"labels\": [1, 2, 3, ...]}\n",
+    "```\n",
+    "\n",
+    "Use with:\n",
+    "```python\n",
+    "osft(..., use_processed_dataset=True)\n",
+    "```\n",
+    "\n",
+    "### Data Path Configuration\n",
+    "\n",
+    "When configuring your training, point to your JSONL file:\n",
+    "\n",
+    "```python\n",
+    "data_path = \"/path/to/your/training_data.jsonl\"  # Your messages-format JSONL file\n",
+    "```\n",
+    "\n",
+    "The training pipeline will automatically:\n",
+    "1. Load and validate your JSONL data\n",
+    "2. Apply chat templates based on your model\n",
+    "3. Handle masking according to the `unmask_messages` setting\n",
+    "4. Process the data for efficient training\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Configuration Examples\n",
+    "\n",
+    "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and continual learning requirements.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# MODEL CONFIGURATION EXAMPLES FOR OSFT\n",
+    "# These are example configurations - adjust based on your hardware and requirements\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Example 1: Qwen 2.5 7B Instruct\n",
+    "qwen_example = {\n",
+    "    \"model_name\": \"Qwen 2.5 7B Instruct\",\n",
+    "    \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\",  # HuggingFace model name or local path\n",
+    "    \"example_unfreeze_rank_ratio\": 0.25,  # Conservative for preserving multilingual capabilities\n",
+    "    \"example_max_tokens_per_gpu\": 2048,\n",
+    "    \"example_max_seq_len\": 2048,  # Qwen 2.5 supports long context\n",
+    "    \"example_batch_size\": 1,\n",
+    "    \"example_learning_rate\": 5e-6, \n",
+    "    \"notes\": \"Excellent for domain adaptation while preserving multilingual capabilities\"\n",
+    "}\n",
+    "\n",
+    "# Example 2: Llama 3.1 8B Instruct\n",
+    "llama_example = {\n",
+    "    \"model_name\": \"Llama 3.1 8B Instruct\",\n",
+    "    \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",  # HuggingFace model name or local path\n",
+    "    \"example_unfreeze_rank_ratio\": 0.3,  # Slightly higher for more adaptation freedom\n",
+    "    \"example_max_tokens_per_gpu\": 10000,\n",
+    "    \"example_max_seq_len\": 8192,  # Supports up to 128K but 8K is common\n",
+    "    \"example_batch_size\": 128,\n",
+    "    \"example_learning_rate\": 5e-6,\n",
+    "    \"notes\": \"Ideal for adding specialized knowledge without losing general capabilities\"\n",
+    "}\n",
+    "\n",
+    "# Example 3: Phi 4 Mini\n",
+    "phi_example = {\n",
+    "    \"model_name\": \"Phi 4 Mini\",\n",
+    "    \"model_path\": \"microsoft/Phi-4-mini-instruct\",  # HuggingFace model name or local path\n",
+    "    \"example_unfreeze_rank_ratio\": 0.25,  # Conservative for smaller model\n",
+    "    \"example_max_tokens_per_gpu\": 8192,\n",
+    "    \"example_max_seq_len\": 4096,\n",
+    "    \"example_batch_size\": 64,\n",
+    "    \"example_learning_rate\": 5e-6,\n",
+    "    \"notes\": \"Efficient for edge deployment with continual adaptation\"\n",
+    "}\n",
+    "\n",
+    "# Example 4: Generic 7B Base Model\n",
+    "generic_7b_example = {\n",
+    "    \"model_name\": \"Generic 7B Base\",\n",
+    "    \"model_path\": \"/path/to/your-7b-model\",  # Local path to model directory\n",
+    "    \"example_unfreeze_rank_ratio\": 0.3,  # Balanced preservation vs adaptation\n",
+    "    \"example_max_tokens_per_gpu\": 10000,\n",
+    "    \"example_max_seq_len\": 4096,\n",
+    "    \"example_batch_size\": 128,\n",
+    "    \"example_learning_rate\": 5e-6,\n",
+    "    \"notes\": \"Good baseline for most 7B instruction-tuned models\"\n",
+    "}\n",
+    "\n",
+    "# Example 5: Smaller Model (1B-3B)\n",
+    "small_model_example = {\n",
+    "    \"model_name\": \"Small Model (1B-3B)\",\n",
+    "    \"model_path\": \"/path/to/small-model\",  # Local path or HuggingFace name\n",
+    "    \"example_unfreeze_rank_ratio\": 0.4,  # Higher ratio for smaller models\n",
+    "    \"example_max_tokens_per_gpu\": 16_000,\n",
+    "    \"example_max_seq_len\": 4096,\n",
+    "    \"example_batch_size\": 128,\n",
+    "    \"example_learning_rate\": 3e-5,\n",
+    "    \"notes\": \"Smaller models can handle more aggressive adaptation\"\n",
+    "}\n",
+    "\n",
+    "# =============================================================================\n",
+    "# SELECT YOUR CONFIGURATION\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Choose one of the examples above as a starting point\n",
+    "selected_example = qwen_example  # Change this to your preferred example\n",
+    "\n",
+    "print(f\"Selected Example: {selected_example['model_name']}\")\n",
+    "print(f\"Model Path: {selected_example['model_path']}\")\n",
+    "print(f\"OSFT Unfreeze Rank Ratio: {selected_example['example_unfreeze_rank_ratio']}\")\n",
+    "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n",
+    "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n",
+    "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n",
+    "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n",
+    "print(f\"Notes: {selected_example['notes']}\")\n",
+    "print(\"\\n\ud83d\udca1 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n",
+    "print(\"   Adjust unfreeze_rank_ratio based on preservation vs adaptation needs.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Complete Parameter Reference\n",
+    "\n",
+    "Let's configure all available OSFT parameters with detailed explanations.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# COMPLETE OSFT PARAMETER CONFIGURATION\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Experiment identification\n",
+    "experiment_name = \"osft_comprehensive_example\"\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
+    "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n",
+    "\n",
+    "# =============================================================================\n",
+    "# REQUIRED PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "# TODO: revert these overrides after we've concluded training\n",
+    "model_path = selected_example[\"model_path\"]  # HuggingFace model name or local path\n",
+    "data_path = \"./test_osft_data.jsonl\"  # Path to training data in JSONL format\n",
+    "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\"  # Where to save checkpoints\n",
+    "unfreeze_rank_ratio = selected_example[\"example_unfreeze_rank_ratio\"]  # OSFT-specific parameter\n",
+    "effective_batch_size = selected_example[\"example_batch_size\"]  # Effective batch size for training\n",
+    "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"]  # Maximum tokens per GPU (memory limit)\n",
+    "max_seq_len = selected_example[\"example_max_seq_len\"]  # Maximum sequence length\n",
+    "learning_rate = selected_example[\"example_learning_rate\"]  # Learning rate for training\n",
+    "\n",
+    "print(\"\ud83d\udccb Required Parameters (all must be specified):\")\n",
+    "print(f\"  \u2022 model_path: {model_path}\")\n",
+    "print(f\"  \u2022 data_path: {data_path}\")\n",
+    "print(f\"  \u2022 ckpt_output_dir: {ckpt_output_dir}\")\n",
+    "print(f\"  \u2022 unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n",
+    "print(f\"  \u2022 effective_batch_size: {effective_batch_size}\")\n",
+    "print(f\"  \u2022 max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n",
+    "print(f\"  \u2022 max_seq_len: {max_seq_len:,}\")\n",
+    "print(f\"  \u2022 learning_rate: {learning_rate}\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# OSFT-SPECIFIC PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "target_patterns = None  # Optional: Patterns to match specific modules for OSFT\n",
+    "# Example: [\"*attention*\", \"*mlp*\"] to target attention and MLP layers\n",
+    "\n",
+    "print(\"\ud83d\udd27 OSFT-Specific Parameters:\")\n",
+    "print(f\"  unfreeze_rank_ratio: {unfreeze_rank_ratio} - Controls how much of each matrix is unfrozen\")\n",
+    "print(f\"    \u2022 0.1-0.3: Conservative, maximum preservation\")\n",
+    "print(f\"    \u2022 0.3-0.5: Balanced adaptation\")\n",
+    "print(f\"    \u2022 >0.5: Rarely needed for typical use cases\")\n",
+    "print(f\"  target_patterns: {target_patterns} - Optional patterns for selecting specific modules\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# TRAINING HYPERPARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "# num_epochs = 3  # Number of training epochs\n",
+    "num_epochs = 1  # Number of training epochs\n",
+    "seed = 42  # Random seed for reproducibility\n",
+    "lr_scheduler = \"cosine\"  # Learning rate scheduler\n",
+    "lr_scheduler_kwargs = {}  # Scheduler parameters\n",
+    "warmup_steps = 0  # Number of warmup steps\n",
+    "\n",
+    "print(\"\ud83c\udfaf Training Hyperparameters:\")\n",
+    "print(f\"  effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n",
+    "print(f\"  learning_rate: {learning_rate} - Learning rate for model updates\")\n",
+    "print(f\"  num_epochs: {num_epochs} - Number of training epochs\")\n",
+    "print(f\"  lr_scheduler: '{lr_scheduler}' - Learning rate scheduler type\")\n",
+    "print(f\"  lr_scheduler_kwargs: {lr_scheduler_kwargs} - Scheduler parameters\")\n",
+    "print(f\"  warmup_steps: {warmup_steps} - Number of warmup steps\")\n",
+    "print(f\"  seed: {seed} - Random seed for reproducibility\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# MEMORY AND PERFORMANCE PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "use_liger = True  # Use Liger kernels for efficiency\n",
+    "\n",
+    "print(\"\u26a1 Memory and Performance Parameters:\")\n",
+    "print(f\"  max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU (hard-cap for memory)\")\n",
+    "print(f\"  max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n",
+    "print(f\"  use_liger: {use_liger} - Use Liger kernels for efficiency\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# DATA PROCESSING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "data_output_dir = \"/dev/shm/osft_data\"  # Directory for processed data (RAM disk for speed)\n",
+    "use_processed_dataset = False  # Whether data is pre-processed\n",
+    "unmask_messages = False  # Whether to unmask all messages for pretraining-style learning\n",
+    "\n",
+    "print(\"\ud83d\udcbe Data Processing Parameters:\")\n",
+    "print(f\"  data_path: '{data_path}' - Path to training data (JSONL format)\")\n",
+    "print(f\"  data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n",
+    "print(f\"  use_processed_dataset: {use_processed_dataset} - Whether to use pre-processed data\")\n",
+    "print(f\"  unmask_messages: {unmask_messages} - Whether to unmask all messages\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# CHECKPOINTING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "checkpoint_at_epoch = True  # Whether to checkpoint at each epoch\n",
+    "save_final_checkpoint = True  # Whether to save final checkpoint\n",
+    "\n",
+    "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n",
+    "print(f\"  ckpt_output_dir: '{ckpt_output_dir}' - Directory to save checkpoints\")\n",
+    "print(f\"  checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n",
+    "print(f\"  save_final_checkpoint: {save_final_checkpoint} - Whether to save final checkpoint\")\n",
+    "print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Distributed Training Configuration\n",
+    "\n",
+    "Configure distributed training for both single-node and multi-node setups.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# DISTRIBUTED TRAINING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Configuration options for different setups\n",
+    "distributed_configs = {\n",
+    "    \"single_gpu_dev\": {\n",
+    "        \"nproc_per_node\": 1,\n",
+    "        \"nnodes\": 1,\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 1,\n",
+    "        \"rdzv_endpoint\": \"127.0.0.1:29500\",\n",
+    "        \"description\": \"Development setup with single GPU\"\n",
+    "    },\n",
+    "    \"single_node_8gpu\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 1,\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 100,\n",
+    "        \"rdzv_endpoint\": \"127.0.0.1:29500\",\n",
+    "        \"description\": \"Single node with 8 GPUs\"\n",
+    "    },\n",
+    "    \"multi_node_master\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 2,  # 2 nodes\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 42,\n",
+    "        # master node IP\n",
+    "        \"rdzv_endpoint\": \"10.241.128.23:1738\",  # Replace with actual master IP\n",
+    "        \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n",
+    "    },\n",
+    "    \"multi_node_worker\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 2,  # 2 nodes\n",
+    "        \"node_rank\": 1,  # Change this for each worker node (1, 2, 3, ...)\n",
+    "        \"rdzv_id\": 42,\n",
+    "        \"rdzv_endpoint\": \"10.241.128.23:1738\",  # Same as master\n",
+    "        \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "# Select your distributed configuration\n",
+    "selected_distributed = \"single_gpu_dev\"  # Change this to match your setup\n",
+    "dist_config = distributed_configs[selected_distributed]\n",
+    "\n",
+    "# Extract distributed training parameters\n",
+    "nproc_per_node = dist_config[\"nproc_per_node\"]  # Number of processes (GPUs) per node\n",
+    "nnodes = dist_config[\"nnodes\"]  # Total number of nodes\n",
+    "node_rank = dist_config[\"node_rank\"]  # Rank of this node (0 to nnodes-1)\n",
+    "rdzv_id = dist_config[\"rdzv_id\"]  # Unique job ID for rendezvous\n",
+    "rdzv_endpoint = dist_config[\"rdzv_endpoint\"]  # Master node endpoint for multi-node training\n",
+    "\n",
+    "# Calculate total resources\n",
+    "total_gpus = nproc_per_node * nnodes\n",
+    "per_gpu_batch_size = effective_batch_size // total_gpus\n",
+    "\n",
+    "print(\"\ud83d\udda5\ufe0f  Distributed Training Parameters:\")\n",
+    "print(f\"  Configuration: {dist_config['description']}\")\n",
+    "print(f\"  nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n",
+    "print(f\"  nnodes: {nnodes} - Total number of nodes\")\n",
+    "print(f\"  node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n",
+    "print(f\"  rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n",
+    "print(f\"  rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n",
+    "print()\n",
+    "print(f\"\ud83d\udcca Resource Calculation:\")\n",
+    "print(f\"  Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n",
+    "print(f\"  Effective batch size: {effective_batch_size}\")\n",
+    "print(f\"  Approximate per-GPU batch size: {per_gpu_batch_size}\")\n",
+    "print(f\"  (Actual micro-batch size determined automatically by gradient accumulation)\")\n",
+    "print()\n",
+    "\n",
+    "# Multi-node setup instructions\n",
+    "if nnodes > 1:\n",
+    "    print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n",
+    "    print(f\"  1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n",
+    "    print(f\"  2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n",
+    "    print(f\"  3. Set node_rank to 0 for master, 1,2,3... for workers\")\n",
+    "    print(f\"  4. Start training on ALL nodes simultaneously\")\n",
+    "    print()\n",
+    "\n",
+    "# OSFT-specific multi-node considerations\n",
+    "print(\"\ud83d\udcdd OSFT Multi-Node Considerations:\")\n",
+    "print(\"  \u2022 OSFT works seamlessly across multiple nodes\")\n",
+    "print(\"  \u2022 No special replay buffer coordination needed (unlike SFT)\")\n",
+    "print(\"  \u2022 Each node processes its data portion with the same unfreeze_rank_ratio\")\n",
+    "print(\"  \u2022 Gradients are synchronized automatically across all nodes\")\n",
+    "print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Execute Training\n",
+    "\n",
+    "Now let's run the actual OSFT training with all our configured parameters.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# TRAINING EXECUTION\n",
+    "# =============================================================================\n",
+    "\n",
+    "print(\"\ud83d\ude80 Starting OSFT Training\")\n",
+    "print(\"=\" * 60)\n",
+    "print(f\"Experiment: {full_experiment_name}\")\n",
+    "print(f\"Model: {selected_example['model_name']}\")\n",
+    "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n",
+    "print(f\"Configuration: {dist_config['description']}\")\n",
+    "print(f\"Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n",
+    "print()\n",
+    "print(\"\u2728 OSFT Advantages:\")\n",
+    "print(\"  \u2022 No catastrophic forgetting\")\n",
+    "print(\"  \u2022 No replay buffer needed\")\n",
+    "print(\"  \u2022 Preserves original model capabilities\")\n",
+    "print()\n",
+    "\n",
+    "# Prepare all training parameters\n",
+    "training_params = {\n",
+    "    # Required parameters\n",
+    "    'model_path': model_path,\n",
+    "    'data_path': data_path,\n",
+    "    'ckpt_output_dir': ckpt_output_dir,\n",
+    "    'unfreeze_rank_ratio': unfreeze_rank_ratio,\n",
+    "    'effective_batch_size': effective_batch_size,\n",
+    "    'max_tokens_per_gpu': max_tokens_per_gpu,\n",
+    "    'max_seq_len': max_seq_len,\n",
+    "    'learning_rate': learning_rate,\n",
+    "    \n",
+    "    # Optional OSFT-specific parameters\n",
+    "    'target_patterns': target_patterns,\n",
+    "    \n",
+    "    # Training duration\n",
+    "    'num_epochs': num_epochs,\n",
+    "    \n",
+    "    # Data processing parameters\n",
+    "    'data_output_dir': data_output_dir,\n",
+    "    'use_processed_dataset': use_processed_dataset,\n",
+    "    'unmask_messages': unmask_messages,\n",
+    "    'warmup_steps': warmup_steps,\n",
+    "    \n",
+    "    # Optimization parameters\n",
+    "    'use_liger': use_liger,\n",
+    "    'seed': seed,\n",
+    "    'lr_scheduler': lr_scheduler,\n",
+    "    'lr_scheduler_kwargs': lr_scheduler_kwargs,\n",
+    "    \n",
+    "    # Checkpointing parameters\n",
+    "    'checkpoint_at_epoch': checkpoint_at_epoch,\n",
+    "    'save_final_checkpoint': save_final_checkpoint,\n",
+    "    \n",
+    "    # Distributed training parameters\n",
+    "    'nproc_per_node': nproc_per_node,\n",
+    "    'nnodes': nnodes,\n",
+    "    'node_rank': node_rank,\n",
+    "    'rdzv_id': rdzv_id,\n",
+    "    'rdzv_endpoint': rdzv_endpoint,\n",
+    "}\n",
+    "\n",
+    "# Display final configuration summary\n",
+    "print(\"\ud83d\udccb Final Training Configuration:\")\n",
+    "for key, value in training_params.items():\n",
+    "    if value is not None:  # Only show non-None values\n",
+    "        print(f\"  {key}: {value}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"\u23f3 Training starting...\")\n",
+    "print(\"=\"*60)\n",
+    "\n",
+    "# Execute training\n",
+    "start_time = time.time()\n",
+    "\n",
+    "try:\n",
+    "    result = osft(**training_params)\n",
+    "    \n",
+    "    end_time = time.time()\n",
+    "    duration = end_time - start_time\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(\"\u2705 OSFT Training completed successfully!\")\n",
+    "    print(f\"\u23f1\ufe0f  Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n",
+    "    print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n",
+    "    print(\"=\"*60)\n",
+    "    print()\n",
+    "    print(\"\ud83c\udfaf What you've achieved with OSFT:\")\n",
+    "    print(\"  \u2022 Model adapted to new domain/task\")\n",
+    "    print(\"  \u2022 Original capabilities preserved\")\n",
+    "    print(\"  \u2022 No catastrophic forgetting occurred\")\n",
+    "    print(\"  \u2022 Ready for deployment without regression testing!\")\n",
+    "    \n",
+    "except Exception as e:\n",
+    "    end_time = time.time()\n",
+    "    duration = end_time - start_time\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n",
+    "    print(f\"Error: {e}\")\n",
+    "    print(\"=\"*60)\n",
+    "    \n",
+    "    print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n",
+    "    print(\"  \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n",
+    "    print(\"  \u25a1 Verify data_path points to valid JSONL file\")\n",
+    "    print(\"  \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n",
+    "    print(\"  \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n",
+    "    print(\"  \u25a1 Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n",
+    "    print(\"  \u25a1 For multi-node: verify network connectivity and endpoints\")\n",
+    "    print(\"  \u25a1 Check that mini-trainer backend dependencies are installed\")\n",
+    "    \n",
+    "    raise\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Post-Training Analysis\n",
+    "\n",
+    "After training completes, let's analyze the results and provide guidance for next steps.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# POST-TRAINING ANALYSIS AND NEXT STEPS\n",
+    "# =============================================================================\n",
+    "\n",
+    "print(\"\ud83d\udcca Post-Training Analysis\")\n",
+    "print(\"=\" * 50)\n",
+    "\n",
+    "# Check for saved checkpoints\n",
+    "checkpoint_dir = ckpt_output_dir\n",
+    "\n",
+    "if os.path.exists(checkpoint_dir):\n",
+    "    checkpoints = [d for d in os.listdir(checkpoint_dir) \n",
+    "                  if os.path.isdir(os.path.join(checkpoint_dir, d))]\n",
+    "    \n",
+    "    if checkpoints:\n",
+    "        print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n",
+    "        for ckpt in sorted(checkpoints):\n",
+    "            ckpt_path = os.path.join(checkpoint_dir, ckpt)\n",
+    "            print(f\"  \ud83d\udcc1 {ckpt}\")\n",
+    "        \n",
+    "        # Identify the final checkpoint\n",
+    "        final_checkpoint = sorted(checkpoints)[-1]\n",
+    "        final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n",
+    "        \n",
+    "        print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n",
+    "        \n",
+    "        # Provide model loading example\n",
+    "        print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n",
+    "        print(f\"```python\")\n",
+    "        print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n",
+    "        print(f\"\")\n",
+    "        print(f\"# Load your OSFT-adapted model\")\n",
+    "        print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n",
+    "        print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n",
+    "        print(f\"\")\n",
+    "        print(f\"# Test the model - it should maintain original capabilities\")\n",
+    "        print(f\"# while excelling at your new domain/task\")\n",
+    "        print(f\"inputs = tokenizer('Your domain-specific prompt:', return_tensors='pt')\")\n",
+    "        print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n",
+    "        print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n",
+    "        print(f\"print(response)\")\n",
+    "        print(f\"```\")\n",
+    "    else:\n",
+    "        print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n",
+    "else:\n",
+    "    print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n",
+    "\n",
+    "# Training summary\n",
+    "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n",
+    "print(f\"  Model: {selected_example['model_name']}\")\n",
+    "print(f\"  Algorithm: OSFT (Orthogonal Subspace Fine-Tuning)\")\n",
+    "print(f\"  Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n",
+    "print(f\"  Epochs: {num_epochs}\")\n",
+    "print(f\"  Global Batch Size: {effective_batch_size}\")\n",
+    "print(f\"  Learning Rate: {learning_rate}\")\n",
+    "print(f\"  Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n",
+    "print(f\"  Max Sequence Length: {max_seq_len:,}\")\n",
+    "print(f\"  Total GPUs: {total_gpus}\")\n",
+    "print(f\"  Distributed Config: {dist_config['description']}\")\n",
+    "\n",
+    "# OSFT-specific validation recommendations\n",
+    "print(f\"\\n\ud83e\uddea OSFT-Specific Validation Steps:\")\n",
+    "print(f\"  1. **Test Original Capabilities**: Verify the model still performs well on\")\n",
+    "print(f\"     general tasks it was originally trained for\")\n",
+    "print(f\"  2. **Test New Domain**: Confirm improved performance on your target domain\")\n",
+    "print(f\"  3. **No Regression Testing Needed**: Unlike SFT, OSFT preserves capabilities\")\n",
+    "print(f\"     by design, reducing validation overhead\")\n",
+    "print(f\"  4. **Compare with Base Model**: Run side-by-side comparisons to see\")\n",
+    "print(f\"     improvements without degradation\")\n",
+    "\n",
+    "# Next steps recommendations\n",
+    "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n",
+    "print(f\"  1. \ud83c\udfaf Test on domain-specific evaluation sets\")\n",
+    "print(f\"  2. \ud83d\udcca Compare performance with base model on both general and domain tasks\")\n",
+    "print(f\"  3. \ud83d\udd04 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n",
+    "print(f\"  4. \ud83d\udca1 If too much change occurred, reduce unfreeze_rank_ratio\")\n",
+    "print(f\"  5. \ud83d\udcdd Document the unfreeze_rank_ratio that works best for your use case\")\n",
+    "print(f\"  6. \ud83d\udea2 Deploy with confidence - no catastrophic forgetting!\")\n",
+    "\n",
+    "# Performance optimization tips\n",
+    "print(f\"\\n\u26a1 OSFT-Specific Optimization Tips:\")\n",
+    "print(f\"  \u2022 Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n",
+    "if unfreeze_rank_ratio < 0.2:\n",
+    "    print(f\"    Very conservative - great preservation, slower adaptation\")\n",
+    "    print(f\"    Consider increasing to 0.25-0.3 if need more adaptation\")\n",
+    "elif unfreeze_rank_ratio < 0.35:\n",
+    "    print(f\"    Balanced - good preservation with reasonable adaptation\")\n",
+    "    print(f\"    This is ideal for most use cases\")\n",
+    "else:\n",
+    "    print(f\"    Aggressive - faster adaptation, slightly less preservation\")\n",
+    "    print(f\"    Consider reducing if seeing any capability degradation\")\n",
+    "\n",
+    "print(f\"  \u2022 Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n",
+    "print(f\"  \u2022 For production: use the script version for better logging and resumption\")\n",
+    "\n",
+    "print(f\"\\n\u2728 OSFT Training Complete!\")\n",
+    "print(f\"Your model has been successfully adapted without forgetting!\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Parameter Reference Summary\n",
+    "\n",
+    "Quick reference for all OSFT parameters and their purposes.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Required Parameters\n",
+    "\n",
+    "| Parameter | Description | Example Values |\n",
+    "|-----------|-------------|----------------|\n",
+    "| `model_path` | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n",
+    "| `data_path` | Path to the training data | `\"/path/to/train.jsonl\"` |\n",
+    "| `ckpt_output_dir` | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n",
+    "| `unfreeze_rank_ratio` | **OSFT-specific**: Controls preservation vs adaptation | `0.25`, `0.3`, `0.4` |\n",
+    "| `effective_batch_size` | Effective batch size for training | `64`, `128`, `256` |\n",
+    "| `max_tokens_per_gpu` | Maximum tokens per GPU (memory limit) | `16384`, `25000`, `40000` |\n",
+    "| `max_seq_len` | Maximum sequence length | `2048`, `8192`, `32768` |\n",
+    "| `learning_rate` | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n",
+    "\n",
+    "### OSFT-Specific Parameters\n",
+    "\n",
+    "| Parameter | Description | Recommended Values | Use Case |\n",
+    "|-----------|-------------|-------------------|----------|\n",
+    "| `unfreeze_rank_ratio` | Controls how much of each matrix is unfrozen | `0.1-0.3` | Conservative preservation |\n",
+    "|           |             | `0.3-0.5` | Balanced adaptation |\n",
+    "|           |             | `>0.5` | Rarely needed |\n",
+    "| `target_patterns` | Optional patterns to match specific modules | `None` | Default (all modules) |\n",
+    "\n",
+    "### Training Configuration Parameters\n",
+    "\n",
+    "| Parameter | Description | Default/Example |\n",
+    "|-----------|-------------|-----------------|\n",
+    "| `num_epochs` | Number of training epochs | `1` |\n",
+    "| `seed` | Random seed for reproducibility | `42` |\n",
+    "| `use_liger` | Enable Liger kernels for efficiency | `False` |\n",
+    "| `warmup_steps` | Number of warmup steps | `0` |\n",
+    "| `lr_scheduler` | Learning rate scheduler | `\"cosine\"` |\n",
+    "| `lr_scheduler_kwargs` | Additional scheduler parameters | `{\"eta_min\": 1e-6}` |\n",
+    "\n",
+    "### Data Processing Parameters\n",
+    "\n",
+    "| Parameter | Description | Default/Example |\n",
+    "|-----------|-------------|-----------------|\n",
+    "| `data_output_dir` | Directory to save processed data | Defaults to `f\"{ckpt_output_dir}/_internal_data_processing\"`, Recommended value is `\"/dev/shm\"` (shared memory) |\n",
+    "| `use_processed_dataset` | Use pre-processed data with input_ids/labels | `False` |\n",
+    "| `unmask_messages` | Unmask all messages for pretraining-style learning | `False` |\n",
+    "\n",
+    "### Checkpointing Parameters\n",
+    "\n",
+    "| Parameter | Description | Recommended |\n",
+    "|-----------|-------------|-------------|\n",
+    "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n",
+    "| `save_final_checkpoint` | Whether to save final checkpoint | `True` |\n",
+    "\n",
+    "### Distributed Training Parameters\n",
+    "\n",
+    "| Parameter | Description | Example Values |\n",
+    "|-----------|-------------|----------------|\n",
+    "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n",
+    "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n",
+    "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n",
+    "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n",
+    "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n",
+    "\n",
+    "### Unfreeze Rank Ratio Guidelines\n",
+    "\n",
+    "| Use Case | Recommended Ratio | Rationale |\n",
+    "|----------|-------------------|-----------|\n",
+    "| **Minor format changes** | 0.1-0.15 | Maximum preservation, minimal changes |\n",
+    "| **Domain vocabulary addition** | 0.15-0.25 | Add specialized terms without losing general knowledge |\n",
+    "| **Domain specialization** | 0.25-0.35 | Balance between preservation and adaptation |\n",
+    "| **Major capability expansion** | 0.35-0.5 | More freedom for significant new capabilities |\n",
+    "| **Complete repurposing** | >0.5 | Rarely needed, approaching standard fine-tuning |\n",
+    "\n",
+    "### OSFT vs SFT Key Differences\n",
+    "\n",
+    "| Aspect | OSFT | SFT |\n",
+    "|--------|------|-----|\n",
+    "| **Catastrophic Forgetting** | Prevented by design | Requires replay buffers |\n",
+    "| **Data Requirements** | Only new domain data | Needs mixed/replay data |\n",
+    "| **Memory Usage** | Similar to SFT | Similar to OSFT |\n",
+    "| **Key Parameter** | `unfreeze_rank_ratio` | N/A |\n",
+    "| **Backend** | mini-trainer | instructlab-training |\n",
+    "| **Best For** | Continual learning, domain adaptation | Initial fine-tuning |\n",
+    "\n",
+    "### Popular Model Examples for OSFT\n",
+    "\n",
+    "| Model | HuggingFace Path | Recommended `unfreeze_rank_ratio` | `max_tokens_per_gpu` |\n",
+    "|-------|------------------|-----------------------------------|----------------------|\n",
+    "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | 0.25 | 10000 |\n",
+    "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | 0.3 | 10000 |\n",
+    "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | 0.25 | 15000 |\n",
+    "\n",
+    "### Script Alternative\n",
+    "\n",
+    "For production workloads or long-running training, use the script version:\n",
+    "\n",
+    "```bash\n",
+    "# Qwen example\n",
+    "python scripts/osft_qwen_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints \\\n",
+    "  --unfreeze-rank-ratio 0.25\n",
+    "\n",
+    "# Llama example\n",
+    "python scripts/osft_llama_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints \\\n",
+    "  --unfreeze-rank-ratio 0.3\n",
+    "\n",
+    "# Phi example\n",
+    "python scripts/osft_phi_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints \\\n",
+    "  --unfreeze-rank-ratio 0.25\n",
+    "```\n",
+    "\n",
+    "### When to Use OSFT vs SFT\n",
+    "\n",
+    "**Use OSFT when:**\n",
+    "- Adding domain-specific knowledge to an already-trained model\n",
+    "- Need to preserve original capabilities without regression\n",
+    "- Don't have access to original training data for replay\n",
+    "- Want to avoid catastrophic forgetting\n",
+    "- Performing continual learning across multiple domains\n",
+    "\n",
+    "**Use SFT when:**\n",
+    "- Training a model from scratch or base model\n",
+    "- Have comprehensive training data covering all desired capabilities  \n",
+    "- Don't need to preserve specific prior behaviors\n",
+    "- Performing initial instruction tuning\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git a/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb
new file mode 100644
index 0000000..cfb19a3
--- /dev/null
+++ b/docs/en/workbench/how_to/sft_comprehensive_tutorial.ipynb
@@ -0,0 +1,1679 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comprehensive SFT Training Tutorial\n",
+    "\n",
+    "This notebook provides a comprehensive guide to Supervised Fine-Tuning (SFT) using the training_hub library. We'll cover:\n",
+    "\n",
+    "- **All available parameters** and their detailed explanations\n",
+    "- **Single-node and multi-node training** configurations\n",
+    "- **Popular model examples** (Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Phi 4 Mini, etc.)\n",
+    "- **Best practices and troubleshooting**\n",
+    "\n",
+    "This tutorial serves as both a learning resource and a template you can adapt for your specific fine-tuning needs.\n",
+    "\n",
+    "**Note:** For production workflows, we also provide focused example scripts for popular models: `scripts/sft_qwen_example.py`, `scripts/sft_llama_example.py`, and `scripts/sft_phi_example.py` with better logging consistency."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup and Imports\n",
+    "\n",
+    "Let's start by importing the necessary libraries and setting up our environment.\n",
+    "\n",
+    "Install `training-hub` if it's not installed yet.\n",
+    "```\n",
+    "export UV_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple\n",
+    "export UV_HTTP_TIMEOUT=300\n",
+    "pip install uv -i https://pypi.tuna.tsinghua.edu.cn/simple\n",
+    "uv pip install -q training-hub -i https://pypi.tuna.tsinghua.edu.cn/simple\n",
+    "```\n",
+    "\n",
+    "Reinstall pytorch to fit your current CUDA versions (e.g. to fit cuda-12.4, install torch==2.6.0 with cu124 support):\n",
+    "\n",
+    "```\n",
+    "uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import training_hub for SFT training\n",
+    "from training_hub import sft\n",
+    "\n",
+    "# Standard library imports\n",
+    "import os\n",
+    "import time\n",
+    "from datetime import datetime\n",
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "data_path = \"./test_sft_data.jsonl\"\n",
+    "if not os.path.exists(data_path):\n",
+    "    print(f\"Creating dummy dataset at {data_path}\")\n",
+    "    dummy_data = [\n",
+    "        {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I am doing well, thank you! How can I help you today?\"}]}\n",
+    "    ] * 10\n",
+    "    with open(data_path, \"w\") as f:\n",
+    "        for d in dummy_data:\n",
+    "            f.write(json.dumps(d) + \"\\n\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Format Requirements\n",
+    "\n",
+    "Before configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n",
+    "\n",
+    "### Required Format: JSONL with Messages\n",
+    "\n",
+    "Your training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n",
+    "\n",
+    "```json\n",
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n",
+    "{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n",
+    "```\n",
+    "\n",
+    "### Message Structure\n",
+    "\n",
+    "Each conversation contains a `messages` array with message objects having:\n",
+    "- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n",
+    "- **`content`**: The text content of the message\n",
+    "- **`reasoning_content`** (optional): Additional reasoning traces\n",
+    "\n",
+    "### Masking Behavior with `unmask` Field\n",
+    "\n",
+    "You can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n",
+    "\n",
+    "#### Standard Instruction Tuning (default)\n",
+    "```json\n",
+    "{\"messages\": [...]}\n",
+    "```\n",
+    "or\n",
+    "```json\n",
+    "{\"messages\": [...], \"unmask\": false}\n",
+    "```\n",
+    "- **Trains only on assistant responses** (standard instruction-following)\n",
+    "- System messages are always masked (ignored for loss)\n",
+    "- User messages are masked\n",
+    "- Assistant messages are unmasked (used for loss calculation)\n",
+    "\n",
+    "#### Pretraining Mode\n",
+    "```json\n",
+    "{\"messages\": [...], \"unmask\": true}\n",
+    "```\n",
+    "- **Trains on all content except system messages**\n",
+    "- System messages are always masked\n",
+    "- User and assistant messages are both unmasked\n",
+    "- Useful for pretraining-style data where the model should learn from all text\n",
+    "\n",
+    "### Example Data Formats\n",
+    "\n",
+    "**Standard SFT (instruction-following):**\n",
+    "```json\n",
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n    if n == 0 or n == 1:\\n        return 1\\n    return n * factorial(n - 1)\\n```\"}]}\n",
+    "```\n",
+    "\n",
+    "**Pretraining-style (learn from all content):**\n",
+    "```json\n",
+    "{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n",
+    "```\n",
+    "\n",
+    "### Data Path Configuration\n",
+    "\n",
+    "When configuring your training, point to your JSONL file:\n",
+    "\n",
+    "```python\n",
+    "data_path = \"/path/to/your/training_data.jsonl\"  # Your messages-format JSONL file\n",
+    "```\n",
+    "\n",
+    "The training pipeline will automatically:\n",
+    "1. Load and validate your JSONL data\n",
+    "2. Apply chat templates based on your model\n",
+    "3. Handle masking according to the `unmask` setting\n",
+    "4. Process the data for efficient training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Configuration Examples\n",
+    "\n",
+    "Here are configuration examples for popular models. These serve as starting points - adjust based on your specific hardware and requirements."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Selected Example: Qwen 3 0.6B\n",
+      "Model Path: /opt/app-root/src/qwen3-0.6b\n",
+      "Example Max Tokens per GPU: 2,048\n",
+      "Example Max Sequence Length: 2,048\n",
+      "Example Batch Size: 1\n",
+      "Example Learning Rate: 1e-05\n",
+      "\n",
+      "\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# =============================================================================\n",
+    "# MODEL CONFIGURATION EXAMPLES\n",
+    "# These are example configurations - adjust based on your hardware and requirements\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Example 1: Qwen 2.5 7B Instruct\n",
+    "qwen_example = {\n",
+    "    \"model_name\": \"Qwen 3 0.6B\",\n",
+    "    \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\",  # HuggingFace model name or local path\n",
+    "    \"example_max_tokens_per_gpu\": 2048,\n",
+    "    \"example_max_seq_len\": 2048,\n",
+    "    \"example_batch_size\": 1,\n",
+    "    \"example_learning_rate\": 1e-5,\n",
+    "}\n",
+    "\n",
+    "# Example 2: Llama 3.1 8B Instruct\n",
+    "llama_example = {\n",
+    "    \"model_name\": \"Llama 3.1 8B Instruct\",\n",
+    "    \"model_path\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",  # HuggingFace model name or local path\n",
+    "    \"example_max_tokens_per_gpu\": 18000,\n",
+    "    \"example_max_seq_len\": 16384,\n",
+    "    \"example_batch_size\": 128,\n",
+    "    \"example_learning_rate\": 1e-5,\n",
+    "}\n",
+    "\n",
+    "# Example 3: Phi 4 Mini\n",
+    "phi_example = {\n",
+    "    \"model_name\": \"Phi 4 Mini\",\n",
+    "    \"model_path\": \"microsoft/Phi-4-mini-instruct\",  # HuggingFace model name or local path\n",
+    "    \"example_max_tokens_per_gpu\": 25000,\n",
+    "    \"example_max_seq_len\": 8192,\n",
+    "    \"example_batch_size\": 64,\n",
+    "    \"example_learning_rate\": 5e-6,\n",
+    "}\n",
+    "\n",
+    "# Example 4: Generic 7B Base Model\n",
+    "generic_7b_example = {\n",
+    "    \"model_name\": \"Generic 7B Base\",\n",
+    "    \"model_path\": \"/path/to/your-7b-model\",  # Local path to model directory\n",
+    "    \"example_max_tokens_per_gpu\": 25000,\n",
+    "    \"example_max_seq_len\": 20000,\n",
+    "    \"example_batch_size\": 256,\n",
+    "    \"example_learning_rate\": 2e-5,\n",
+    "}\n",
+    "\n",
+    "# Example 5: Smaller Model (1B-3B)\n",
+    "small_model_example = {\n",
+    "    \"model_name\": \"Small Model (1B-3B)\",\n",
+    "    \"model_path\": \"/path/to/small-model\",  # Local path or HuggingFace name\n",
+    "    \"example_max_tokens_per_gpu\": 40000,\n",
+    "    \"example_max_seq_len\": 32768,\n",
+    "    \"example_batch_size\": 512,\n",
+    "    \"example_learning_rate\": 3e-5,\n",
+    "}\n",
+    "\n",
+    "# =============================================================================\n",
+    "# SELECT YOUR CONFIGURATION\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Choose one of the examples above as a starting point\n",
+    "selected_example = qwen_example  # Change this to your preferred example\n",
+    "\n",
+    "print(f\"Selected Example: {selected_example['model_name']}\")\n",
+    "print(f\"Model Path: {selected_example['model_path']}\")\n",
+    "print(f\"Example Max Tokens per GPU: {selected_example['example_max_tokens_per_gpu']:,}\")\n",
+    "print(f\"Example Max Sequence Length: {selected_example['example_max_seq_len']:,}\")\n",
+    "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n",
+    "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n",
+    "print(\"\\n\ud83d\udca1 Remember: These are example configurations. Adjust based on your hardware and requirements.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Complete Parameter Reference\n",
+    "\n",
+    "Let's configure all available SFT parameters with detailed explanations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\ud83d\udccb Required Parameters:\n",
+      "  model_path: Path to the model to fine-tune (HuggingFace name or local path)\n",
+      "  data_path: Path to the training data (JSONL format)\n",
+      "  ckpt_output_dir: Directory to save checkpoints\n",
+      "\n",
+      "\ud83c\udfaf Core Training Parameters:\n",
+      "  num_epochs: 3 - Number of training epochs\n",
+      "  effective_batch_size: 1 - Effective batch size for training\n",
+      "  learning_rate: 1e-05 - Learning rate for training\n",
+      "  max_seq_len: 2,048 - Maximum sequence length\n",
+      "  max_tokens_per_gpu: 2,048 - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\n",
+      "\n",
+      "\ud83d\udcbe Data Processing Parameters:\n",
+      "  data_output_dir: '/dev/shm' - Directory to save processed data\n",
+      "  warmup_steps: 100 - Number of warmup steps\n",
+      "\n",
+      "\ud83d\udcbe Checkpointing Parameters:\n",
+      "  save_samples: 0 - Number of samples to save after training (0 disables saving based on sample count)\n",
+      "  checkpoint_at_epoch: True - Whether to checkpoint at each epoch\n",
+      "  accelerate_full_state_at_epoch: True - Whether to save full state at epoch for automatic checkpoint resumption\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# =============================================================================\n",
+    "# COMPLETE SFT PARAMETER CONFIGURATION\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Experiment identification\n",
+    "experiment_name = \"sft_comprehensive_example\"\n",
+    "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
+    "full_experiment_name = f\"{experiment_name}_{timestamp}\"\n",
+    "\n",
+    "# =============================================================================\n",
+    "# REQUIRED PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "model_path = selected_example[\"model_path\"]  # HuggingFace model name or local path\n",
+    "data_path = \"./test_sft_data.jsonl\"  # Path to training data in JSONL format\n",
+    "ckpt_output_dir = f\"checkpoints/{full_experiment_name}\"  # Where to save checkpoints\n",
+    "\n",
+    "print(\"\ud83d\udccb Required Parameters:\")\n",
+    "print(f\"  model_path: Path to the model to fine-tune (HuggingFace name or local path)\")\n",
+    "print(f\"  data_path: Path to the training data (JSONL format)\")\n",
+    "print(f\"  ckpt_output_dir: Directory to save checkpoints\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# CORE TRAINING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "num_epochs = 3  # Number of training epochs\n",
+    "effective_batch_size = selected_example[\"example_batch_size\"]  # Effective batch size for training\n",
+    "learning_rate = selected_example[\"example_learning_rate\"]  # Learning rate for training\n",
+    "max_seq_len = selected_example[\"example_max_seq_len\"]  # Maximum sequence length\n",
+    "max_tokens_per_gpu = selected_example[\"example_max_tokens_per_gpu\"]  # Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs)\n",
+    "\n",
+    "print(\"\ud83c\udfaf Core Training Parameters:\")\n",
+    "print(f\"  num_epochs: {num_epochs} - Number of training epochs\")\n",
+    "print(f\"  effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n",
+    "print(f\"  learning_rate: {learning_rate} - Learning rate for training\")\n",
+    "print(f\"  max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n",
+    "print(f\"  max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU in a mini-batch (hard-cap for memory to avoid OOMs). Used to automatically calculate mini-batch size and gradient accumulation to maintain the desired effective_batch_size while staying within memory limits.\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# DATA AND PROCESSING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "data_output_dir = \"/dev/shm\"  # Directory to save processed data\n",
+    "warmup_steps = 100  # Number of warmup steps\n",
+    "\n",
+    "print(\"\ud83d\udcbe Data Processing Parameters:\")\n",
+    "print(f\"  data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n",
+    "print(f\"  warmup_steps: {warmup_steps} - Number of warmup steps\")\n",
+    "print()\n",
+    "\n",
+    "# =============================================================================\n",
+    "# CHECKPOINTING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "save_samples = 0  # Number of samples to save after training (0 disables saving based on sample count)\n",
+    "checkpoint_at_epoch = True  # Whether to checkpoint at each epoch\n",
+    "accelerate_full_state_at_epoch = True  # Whether to save full state at epoch for automatic checkpoint resumption\n",
+    "\n",
+    "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n",
+    "print(f\"  save_samples: {save_samples} - Number of samples to save after training (0 disables saving based on sample count)\")\n",
+    "print(f\"  checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n",
+    "print(f\"  accelerate_full_state_at_epoch: {accelerate_full_state_at_epoch} - Whether to save full state at epoch for automatic checkpoint resumption\")\n",
+    "print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Distributed Training Configuration\n",
+    "\n",
+    "Configure distributed training for both single-node and multi-node setups."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\ud83d\udda5\ufe0f  Distributed Training Parameters:\n",
+      "  Configuration: Development setup with single GPU\n",
+      "  nproc_per_node: 1 - Number of processes (GPUs) per node\n",
+      "  nnodes: 1 - Total number of nodes\n",
+      "  node_rank: 0 - Rank of this node (0 to nnodes-1)\n",
+      "  rdzv_id: 1 - Unique job ID for rendezvous\n",
+      "  rdzv_endpoint: '127.0.0.1:29500' - Master node endpoint for multi-node training\n",
+      "\n",
+      "\ud83d\udcca Resource Calculation:\n",
+      "  Total GPUs: 1 (1 \u00d7 1)\n",
+      "  Effective batch size: 1\n",
+      "  Approximate per-GPU batch size: 1\n",
+      "  (Actual micro-batch size determined automatically by gradient accumulation)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# =============================================================================\n",
+    "# DISTRIBUTED TRAINING PARAMETERS\n",
+    "# =============================================================================\n",
+    "\n",
+    "# Configuration options for different setups\n",
+    "distributed_configs = {\n",
+    "    \"single_gpu_dev\": {\n",
+    "        \"nproc_per_node\": 1,\n",
+    "        \"nnodes\": 1,\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 1,\n",
+    "        \"rdzv_endpoint\": \"127.0.0.1:29500\",\n",
+    "        \"description\": \"Development setup with single GPU\"\n",
+    "    },\n",
+    "    \"single_node_8gpu\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 1,\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 100,\n",
+    "        \"rdzv_endpoint\": \"127.0.0.1:29500\",\n",
+    "        \"description\": \"Single node with 8 GPUs\"\n",
+    "    },\n",
+    "    \"multi_node_master\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 4,\n",
+    "        \"node_rank\": 0,\n",
+    "        \"rdzv_id\": 42,\n",
+    "        \"rdzv_endpoint\": \"10.0.0.1:29500\",  # Replace with actual master IP\n",
+    "        \"description\": \"Multi-node master (rank 0) - 4 nodes total\"\n",
+    "    },\n",
+    "    \"multi_node_worker\": {\n",
+    "        \"nproc_per_node\": 8,\n",
+    "        \"nnodes\": 4,\n",
+    "        \"node_rank\": 1,  # Change this for each worker node (1, 2, 3, ...)\n",
+    "        \"rdzv_id\": 42,\n",
+    "        \"rdzv_endpoint\": \"10.0.0.1:29500\",  # Same as master\n",
+    "        \"description\": \"Multi-node worker (rank 1) - change rank for each worker\"\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "# Select your distributed configuration\n",
+    "selected_distributed = \"single_gpu_dev\"  # Change this to match your setup\n",
+    "dist_config = distributed_configs[selected_distributed]\n",
+    "\n",
+    "# Extract distributed training parameters\n",
+    "nproc_per_node = dist_config[\"nproc_per_node\"]  # Number of processes (GPUs) per node\n",
+    "nnodes = dist_config[\"nnodes\"]  # Total number of nodes\n",
+    "node_rank = dist_config[\"node_rank\"]  # Rank of this node (0 to nnodes-1)\n",
+    "rdzv_id = dist_config[\"rdzv_id\"]  # Unique job ID for rendezvous\n",
+    "rdzv_endpoint = dist_config[\"rdzv_endpoint\"]  # Master node endpoint for multi-node training\n",
+    "\n",
+    "# Calculate total resources\n",
+    "total_gpus = nproc_per_node * nnodes\n",
+    "per_gpu_batch_size = effective_batch_size // total_gpus\n",
+    "\n",
+    "print(\"\ud83d\udda5\ufe0f  Distributed Training Parameters:\")\n",
+    "print(f\"  Configuration: {dist_config['description']}\")\n",
+    "print(f\"  nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n",
+    "print(f\"  nnodes: {nnodes} - Total number of nodes\")\n",
+    "print(f\"  node_rank: {node_rank} - Rank of this node (0 to nnodes-1)\")\n",
+    "print(f\"  rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n",
+    "print(f\"  rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n",
+    "print()\n",
+    "print(f\"\ud83d\udcca Resource Calculation:\")\n",
+    "print(f\"  Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n",
+    "print(f\"  Effective batch size: {effective_batch_size}\")\n",
+    "print(f\"  Approximate per-GPU batch size: {per_gpu_batch_size}\")\n",
+    "print(f\"  (Actual micro-batch size determined automatically by gradient accumulation)\")\n",
+    "print()\n",
+    "\n",
+    "# Multi-node setup instructions\n",
+    "if nnodes > 1:\n",
+    "    print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n",
+    "    print(f\"  1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n",
+    "    print(f\"  2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n",
+    "    print(f\"  3. Set node_rank to 0 for master, 1,2,3... for workers\")\n",
+    "    print(f\"  4. Start training on ALL nodes simultaneously\")\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Execute Training\n",
+    "\n",
+    "Now let's run the actual SFT training with all our configured parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\ud83d\ude80 Starting SFT Training\n",
+      "============================================================\n",
+      "Experiment: sft_comprehensive_example_20260323_072149\n",
+      "Model: Qwen 3 0.6B\n",
+      "Total GPUs: 1 (1 per node \u00d7 1 nodes)\n",
+      "Configuration: Development setup with single GPU\n",
+      "\n",
+      "\ud83d\udccb Final Training Configuration:\n",
+      "  model_path: /opt/app-root/src/qwen3-0.6b\n",
+      "  data_path: ./test_sft_data.jsonl\n",
+      "  ckpt_output_dir: checkpoints/sft_comprehensive_example_20260323_072149\n",
+      "  num_epochs: 3\n",
+      "  effective_batch_size: 1\n",
+      "  learning_rate: 1e-05\n",
+      "  max_seq_len: 2048\n",
+      "  max_tokens_per_gpu: 2048\n",
+      "  data_output_dir: /dev/shm\n",
+      "  warmup_steps: 100\n",
+      "  save_samples: 0\n",
+      "  checkpoint_at_epoch: True\n",
+      "  accelerate_full_state_at_epoch: True\n",
+      "  nproc_per_node: 1\n",
+      "  nnodes: 1\n",
+      "  node_rank: 0\n",
+      "  rdzv_id: 1\n",
+      "  rdzv_endpoint: 127.0.0.1:29500\n",
+      "  disable_flash_attn: True\n",
+      "\n",
+      "============================================================\n",
+      "\u23f3 Training starting...\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:21:52] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> Starting training setup<span style=\"color: #808000; text-decoration-color: #808000\">...</span>                                                       <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">main_ds.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#591\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">591</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:21:52]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m Starting training setup\u001b[33m...\u001b[0m                                                       \u001b]8;id=399409;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=389230;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#591\u001b\\\u001b[2m591\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=592580;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152011;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:21:55] </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:21:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=209439;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=307155;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9ca4f757279149fdb5e5bbeb03e9bfd7",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Converting samples into input_ids and labels... (num_proc=3): 100%|##########| 3/3 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:22:07] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> ten largest length percentiles:                                            <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1355\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1355</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:22:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m ten largest length percentiles:                                            \u001b]8;id=670115;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=495753;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1355\u001b\\\u001b[2m1355\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 90th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">70</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 90th: \u001b[1;36m70\u001b[0m                                                          \u001b]8;id=657600;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=118760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 91th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">70</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 91th: \u001b[1;36m70\u001b[0m                                                          \u001b]8;id=996028;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=273551;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 92th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">71</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 92th: \u001b[1;36m71\u001b[0m                                                          \u001b]8;id=658621;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=94733;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 93th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">71</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 93th: \u001b[1;36m71\u001b[0m                                                          \u001b]8;id=246082;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=931645;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 94th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">72</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 94th: \u001b[1;36m72\u001b[0m                                                          \u001b]8;id=452894;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=347117;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 95th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">73</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 95th: \u001b[1;36m73\u001b[0m                                                          \u001b]8;id=626760;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619372;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 96th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">73</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 96th: \u001b[1;36m73\u001b[0m                                                          \u001b]8;id=265428;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=452333;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 97th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">74</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 97th: \u001b[1;36m74\u001b[0m                                                          \u001b]8;id=841787;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=465368;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 98th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">74</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 98th: \u001b[1;36m74\u001b[0m                                                          \u001b]8;id=252732;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=895643;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 99th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">75</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 99th: \u001b[1;36m75\u001b[0m                                                          \u001b]8;id=731576;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=8096;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 100th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">76</span>                                                         <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1358</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 100th: \u001b[1;36m76\u001b[0m                                                         \u001b]8;id=559594;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=56715;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1358\u001b\\\u001b[2m1358\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> at <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2048</span> max sequence length, the number of samples to be dropped is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1362\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1362</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m at \u001b[1;36m2048\u001b[0m max sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m      \u001b]8;id=663979;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=152605;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1362\u001b\\\u001b[2m1362\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.00</span> of total<span style=\"font-weight: bold\">)</span>                                                            <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1367\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1367</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m0.00\u001b[0m of total\u001b[1m)\u001b[0m                                                            \u001b]8;id=915262;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=203311;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1367\u001b\\\u001b[2m1367\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 0th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 0th: \u001b[1;36m28\u001b[0m                                                           \u001b]8;id=674658;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=54929;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 1th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 1th: \u001b[1;36m28\u001b[0m                                                           \u001b]8;id=856039;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=914854;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 2th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 2th: \u001b[1;36m28\u001b[0m                                                           \u001b]8;id=556445;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=385141;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 3th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">29</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 3th: \u001b[1;36m29\u001b[0m                                                           \u001b]8;id=738851;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=147455;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 4th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">29</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 4th: \u001b[1;36m29\u001b[0m                                                           \u001b]8;id=628289;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=34700;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 5th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">29</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 5th: \u001b[1;36m29\u001b[0m                                                           \u001b]8;id=243709;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=947817;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 6th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">30</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 6th: \u001b[1;36m30\u001b[0m                                                           \u001b]8;id=318583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=443338;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 7th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">30</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 7th: \u001b[1;36m30\u001b[0m                                                           \u001b]8;id=143291;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=65864;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 8th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">30</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 8th: \u001b[1;36m30\u001b[0m                                                           \u001b]8;id=411537;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=957882;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 9th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">31</span>                                                           <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 9th: \u001b[1;36m31\u001b[0m                                                           \u001b]8;id=831237;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=856583;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> quantile 10th: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">31</span>                                                          <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1378</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m quantile 10th: \u001b[1;36m31\u001b[0m                                                          \u001b]8;id=612081;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=731268;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1378\u001b\\\u001b[2m1378\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> at <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20</span> min sequence length, the number of samples to be dropped is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>        <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1382\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1382</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m at \u001b[1;36m20\u001b[0m min sequence length, the number of samples to be dropped is \u001b[1;36m0\u001b[0m        \u001b]8;id=102879;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=675812;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1382\u001b\\\u001b[2m1382\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=881091;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=320766;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "037b02187d454cd880a6d947d3353fb0",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Filter (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> Samples Previews<span style=\"color: #808000; text-decoration-color: #808000\">...</span>                                                        <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">data_process.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1392\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1392</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m Samples Previews\u001b[33m...\u001b[0m                                                        \u001b]8;id=64443;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py\u001b\\\u001b[2mdata_process.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=282143;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/data_process.py#1392\u001b\\\u001b[2m1392\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=75781;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=607291;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "6a16e69d33d54d0da44ea627def0e9cd",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Filtering out pretraining samples (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=397717;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=196121;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f72a54e41a794200833ea3262ce90f54",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Filtering out pretraining samples (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[35mOriginal Input: <|im_start|>user\n",
+      "What is machine learning?<|im_end|>\n",
+      "<|im_start|>assistant\n",
+      "<think>\n",
+      "\n",
+      "</think>\n",
+      "\n",
+      "Machine learning is a subset of artificial intelligence...<|im_end|>\n",
+      "\n",
+      "\u001b[0m\n",
+      "\u001b[33mInstruction ex sample 1: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>Machine learning is a subset of artificial intelligence...<|im_end|><|MASK|>\u001b[0m\n",
+      "\u001b[35mOriginal Input: <|im_start|>system\n",
+      "You are a coding assistant.<|im_end|>\n",
+      "<|im_start|>user\n",
+      "Write a Python function to calculate factorial<|im_end|>\n",
+      "<|im_start|>assistant\n",
+      "<think>\n",
+      "\n",
+      "</think>\n",
+      "\n",
+      "Here's a Python function to calculate factorial:\n",
+      "\n",
+      "```python\n",
+      "def factorial(n):\n",
+      "    if n == 0 or n == 1:\n",
+      "        return 1\n",
+      "    return n * factorial(n - 1)\n",
+      "```<|im_end|>\n",
+      "\n",
+      "\u001b[0m\n",
+      "\u001b[33mInstruction ex sample 2: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>Here's a Python function to calculate factorial:\n",
+      "\n",
+      "```python\n",
+      "def factorial(n):\n",
+      "    if n == 0 or n == 1:\n",
+      "        return 1\n",
+      "    return n * factorial(n - 1)\n",
+      "```<|im_end|><|MASK|>\u001b[0m\n",
+      "\u001b[35mOriginal Input: <|im_start|>system\n",
+      "You are a helpful assistant.<|im_end|>\n",
+      "<|im_start|>user\n",
+      "Hello, how are you?<|im_end|>\n",
+      "<|im_start|>assistant\n",
+      "<think>\n",
+      "\n",
+      "</think>\n",
+      "\n",
+      "I'm doing well, thank you! How can I help you today?<|im_end|>\n",
+      "\n",
+      "\u001b[0m\n",
+      "\u001b[33mInstruction ex sample 3: <|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|><|MASK|>I'm doing well, thank you! How can I help you today?<|im_end|><|MASK|>\u001b[0m\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:22:08] </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> num_proc must be &lt;= <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Reducing num_proc to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> for dataset of size <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>.      <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">arrow_dataset.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">3123</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:22:08]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m num_proc must be <= \u001b[1;36m3\u001b[0m. Reducing num_proc to \u001b[1;36m3\u001b[0m for dataset of size \u001b[1;36m3\u001b[0m.      \u001b]8;id=60262;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py\u001b\\\u001b[2marrow_dataset.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=179809;file:///opt/app-root/lib64/python3.12/site-packages/datasets/arrow_dataset.py#3123\u001b\\\u001b[2m3123\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "52504f1d6ee449c6b4e2b8a216b127fe",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validating unmask tokens not in data (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "95f860ee16624fa5a74029005869986b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:22:09] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> Running training command as subprocess: torchrun --nproc-per-<span style=\"color: #808000; text-decoration-color: #808000\">node</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> --<span style=\"color: #808000; text-decoration-color: #808000\">nnodes</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>   <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">main_ds.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#794\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">794</span></a>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --node-<span style=\"color: #808000; text-decoration-color: #808000\">rank</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span> --rdzv-<span style=\"color: #808000; text-decoration-color: #808000\">id</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> --rdzv-<span style=\"color: #808000; text-decoration-color: #808000\">endpoint</span>=<span style=\"color: #00ff00; text-decoration-color: #00ff00; font-weight: bold\">127</span><span style=\"color: #00ff00; text-decoration-color: #00ff00; font-weight: bold\">.0.0.1</span>:<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">29500</span>                        <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         <span style=\"color: #800080; text-decoration-color: #800080\">/opt/app-root/lib64/python3.12/site-packages/instructlab/training/</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff\">main_ds.py</span>     <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">model_name_or_path</span>=<span style=\"color: #800080; text-decoration-color: #800080\">/opt/app-root/src/</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff\">qwen3-0.6b</span>                                <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">data_path</span>=<span style=\"color: #800080; text-decoration-color: #800080\">/dev/shm/</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff\">data.jsonl</span>                                                  <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">output_dir</span>=<span style=\"color: #800080; text-decoration-color: #800080\">checkpoints</span>/sft_comprehensive_example_20260323_072149               <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">num_epochs</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> --<span style=\"color: #808000; text-decoration-color: #808000\">effective_batch_size</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> --<span style=\"color: #808000; text-decoration-color: #808000\">learning_rate</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1e-05</span>                    <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">num_warmup_steps</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">100</span> --<span style=\"color: #808000; text-decoration-color: #808000\">save_samples</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span> --<span style=\"color: #808000; text-decoration-color: #808000\">log_level</span>=<span style=\"color: #800080; text-decoration-color: #800080\">INFO</span> --<span style=\"color: #808000; text-decoration-color: #808000\">max_batch_len</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2048</span>    <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">seed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">42</span> --<span style=\"color: #808000; text-decoration-color: #808000\">adamw_weight_decay</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.0</span> --<span style=\"color: #808000; text-decoration-color: #808000\">adamw_beta1</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.9</span> --<span style=\"color: #808000; text-decoration-color: #808000\">adamw_beta2</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.95</span>          <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">adamw_eps</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1e-08</span> --checkpoint_at_epoch --accelerate_full_state_at_epoch         <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --disable_flash_attn --<span style=\"color: #808000; text-decoration-color: #808000\">distributed_training_framework</span>=<span style=\"color: #800080; text-decoration-color: #800080\">fsdp</span>                       <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span>         --<span style=\"color: #808000; text-decoration-color: #808000\">fsdp_sharding_strategy</span>=<span style=\"color: #800080; text-decoration-color: #800080\">HYBRID_SHARD</span>                                            <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">              </span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:22:09]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m Running training command as subprocess: torchrun --nproc-per-\u001b[33mnode\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mnnodes\u001b[0m=\u001b[1;36m1\u001b[0m   \u001b]8;id=725937;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=274171;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#794\u001b\\\u001b[2m794\u001b[0m\u001b]8;;\u001b\\\n",
+       "\u001b[2;36m           \u001b[0m         --node-\u001b[33mrank\u001b[0m=\u001b[1;36m0\u001b[0m --rdzv-\u001b[33mid\u001b[0m=\u001b[1;36m1\u001b[0m --rdzv-\u001b[33mendpoint\u001b[0m=\u001b[1;92m127\u001b[0m\u001b[1;92m.0.0.1\u001b[0m:\u001b[1;36m29500\u001b[0m                        \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         \u001b[35m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/\u001b[0m\u001b[95mmain_ds.py\u001b[0m     \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mmodel_name_or_path\u001b[0m=\u001b[35m/opt/app-root/src/\u001b[0m\u001b[95mqwen3-0.6b\u001b[0m                                \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mdata_path\u001b[0m=\u001b[35m/dev/shm/\u001b[0m\u001b[95mdata.jsonl\u001b[0m                                                  \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33moutput_dir\u001b[0m=\u001b[35mcheckpoints\u001b[0m/sft_comprehensive_example_20260323_072149               \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mnum_epochs\u001b[0m=\u001b[1;36m3\u001b[0m --\u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m --\u001b[33mlearning_rate\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-05\u001b[0m                    \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mnum_warmup_steps\u001b[0m=\u001b[1;36m100\u001b[0m --\u001b[33msave_samples\u001b[0m=\u001b[1;36m0\u001b[0m --\u001b[33mlog_level\u001b[0m=\u001b[35mINFO\u001b[0m --\u001b[33mmax_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m    \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mseed\u001b[0m=\u001b[1;36m42\u001b[0m --\u001b[33madamw_weight_decay\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.0\u001b[0m --\u001b[33madamw_beta1\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.9\u001b[0m --\u001b[33madamw_beta2\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.95\u001b[0m          \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33madamw_eps\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-08\u001b[0m --checkpoint_at_epoch --accelerate_full_state_at_epoch         \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --disable_flash_attn --\u001b[33mdistributed_training_framework\u001b[0m=\u001b[35mfsdp\u001b[0m                       \u001b[2m              \u001b[0m\n",
+       "\u001b[2;36m           \u001b[0m         --\u001b[33mfsdp_sharding_strategy\u001b[0m=\u001b[35mHYBRID_SHARD\u001b[0m                                            \u001b[2m              \u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:22: UserWarning: DeepSpeed CPU Optimizer is not available. Some features may be unavailable.\n",
+      "  warnings.warn(\n",
+      "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:34: UserWarning: DeepSpeed is not available. Some features may be unavailable.\n",
+      "  warnings.warn(\n",
+      "Loading weights: 100%|| 311/311 [00:04<00:00, 64.15it/s]]]]]]]] \n",
+      "The tied weights mapping and config for this model specifies to tie model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning\n",
+      "Generating train split: 3 examples [00:00, 565.78 examples/s]\n",
+      "\u001b[2;36m[07:22:34]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m \u001b[33mnum_gpus\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mavg_sample_len\u001b[0m=\u001b[1;36m50\u001b[0m\u001b[1;36m.000\u001b[0m,            \u001b]8;id=234053;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=146316;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#499\u001b\\\u001b[2m499\u001b[0m\u001b]8;;\u001b\\\n",
+      "\u001b[2;36m           \u001b[0m         \u001b[33meffective_batch_size\u001b[0m=\u001b[1;36m1\u001b[0m,                       \u001b[2m              \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         \u001b[33mmax_batch_len_per_gpu\u001b[0m=\u001b[1;36m2048\u001b[0m,                   \u001b[2m              \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         \u001b[33mpacking_max_batch_len\u001b[0m=\u001b[1;36m2048\u001b[0m, \u001b[33mnum_batches\u001b[0m=\u001b[1;36m3\u001b[0m,    \u001b[2m              \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         \u001b[33mavg_samples_per_batch\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1;36m.000\u001b[0m, \u001b[33mtotal_samples\u001b[0m=\u001b[1;36m3\u001b[0m  \u001b[2m              \u001b[0m\n",
+      "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m sharding_strategy is deprecated in favor \u001b]8;id=91161;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py\u001b\\\u001b[2mdataclasses.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=619176;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/dataclasses.py#1962\u001b\\\u001b[2m1962\u001b[0m\u001b]8;;\u001b\\\n",
+      "\u001b[2;36m           \u001b[0m         of reshard_after_forward. This will be   \u001b[2m                   \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         removed in a future version of           \u001b[2m                   \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         Accelerate.                              \u001b[2m                   \u001b[0m\n",
+      "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m Detected kernel version \u001b[1;36m3.10\u001b[0m.\u001b[1;36m0\u001b[0m, which is below  \u001b]8;id=529903;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py\u001b\\\u001b[2mother.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=631262;file:///opt/app-root/lib64/python3.12/site-packages/accelerate/utils/other.py#513\u001b\\\u001b[2m513\u001b[0m\u001b]8;;\u001b\\\n",
+      "\u001b[2;36m           \u001b[0m         the recommended minimum of \u001b[1;36m5.5\u001b[0m.\u001b[1;36m0\u001b[0m; this can      \u001b[2m            \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         cause the process to hang. It is recommended to \u001b[2m            \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         upgrade the kernel to the minimum version or    \u001b[2m            \u001b[0m\n",
+      "\u001b[2;36m           \u001b[0m         higher.                                         \u001b[2m            \u001b[0m\n",
+      "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py:444: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.HYBRID_SHARD since the world size is 1.\n",
+      "  warnings.warn(\n",
+      "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3ForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight.\n",
+      "  warnings.warn(\n",
+      "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1992: UserWarning: Upcasted low precision parameters in Qwen3DecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, self_attn.q_norm.weight, self_attn.k_norm.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight.\n",
+      "  warnings.warn(\n",
+      "/opt/app-root/lib64/python3.12/site-packages/accelerate/accelerator.py:1998: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.\n",
+      "  warnings.warn(\n",
+      "Epoch 0:   0%|          | 0/3 [00:00<?, ?it/s][HAMI-core ERROR (pid:4849 thread=139739523683904 allocator.c:54)]: Device 0 OOM 8619558912 / 8589934592\n",
+      "[HAMI-core ERROR (pid:4849 thread=139739523683904 allocator.c:54)]: Device 0 OOM 8860731392 / 8589934592\n",
+      "\u001b[2;36m[07:22:48]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m Epoch: \u001b[1;36m0\u001b[0m, Step: \u001b[1;36m1\u001b[0m, Rank: \u001b[1;36m0\u001b[0m, loss = \u001b[1;36m0.601831\u001b[0m,  \u001b]8;id=735392;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=571412;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#231\u001b\\\u001b[2m231\u001b[0m\u001b]8;;\u001b\\\n",
+      "\u001b[2;36m           \u001b[0m         grad_accum_steps = \u001b[1;36m1\u001b[0m                          \u001b[2m              \u001b[0m\n",
+      "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 8825079808 / 8589934592\n",
+      "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 9447933952 / 8589934592\n",
+      "[HAMI-core ERROR (pid:4849 thread=139741838102784 allocator.c:54)]: Device 0 OOM 9447933952 / 8589934592\n",
+      "[rank0]: Traceback (most recent call last):\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 1111, in <module>\n",
+      "[rank0]:     main(args)\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 558, in main\n",
+      "[rank0]:     train(\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\", line 236, in train\n",
+      "[rank0]:     accelerator.take_optimizer_step()\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/instructlab/training/accelerator.py\", line 295, in take_optimizer_step\n",
+      "[rank0]:     self.optimizer.step()\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/accelerate/optimizer.py\", line 179, in step\n",
+      "[rank0]:     self.optimizer.step(closure)\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/lr_scheduler.py\", line 140, in wrapper\n",
+      "[rank0]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)\n",
+      "[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/optimizer.py\", line 493, in wrapper\n",
+      "[rank0]:     out = func(*args, **kwargs)\n",
+      "[rank0]:           ^^^^^^^^^^^^^^^^^^^^^\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/optimizer.py\", line 91, in _use_grad\n",
+      "[rank0]:     ret = func(self, *args, **kwargs)\n",
+      "[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/adamw.py\", line 232, in step\n",
+      "[rank0]:     has_complex = self._init_group(\n",
+      "[rank0]:                   ^^^^^^^^^^^^^^^^^\n",
+      "[rank0]:   File \"/opt/app-root/lib64/python3.12/site-packages/torch/optim/adamw.py\", line 175, in _init_group\n",
+      "[rank0]:     state[\"exp_avg_sq\"] = torch.zeros_like(\n",
+      "[rank0]:                           ^^^^^^^^^^^^^^^^^\n",
+      "[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB. GPU 0 has a total capacity of 8.00 "
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "IOPub message rate exceeded.\n",
+      "The Jupyter server will temporarily stop sending output\n",
+      "to the client in order to avoid crashing it.\n",
+      "To change this limit, set the config variable\n",
+      "`--ServerApp.iopub_msg_rate_limit`.\n",
+      "\n",
+      "Current values:\n",
+      "ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+      "ServerApp.rate_limit_window=3.0 (secs)\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "22:50.846035625 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())\n",
+      "E0323 07:22:53.501000 4822 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 4849) of binary: /opt/app-root/bin/python3\n",
+      "Traceback (most recent call last):\n",
+      "  File \"/opt/app-root/bin/torchrun\", line 10, in <module>\n",
+      "    sys.exit(main())\n",
+      "             ^^^^^^\n",
+      "  File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n",
+      "    return f(*args, **kwargs)\n",
+      "           ^^^^^^^^^^^^^^^^^^\n",
+      "  File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py\", line 918, in main\n",
+      "    run(args)\n",
+      "  File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py\", line 909, in run\n",
+      "    elastic_launch(\n",
+      "  File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py\", line 138, in __call__\n",
+      "    return launch_agent(self._config, self._entrypoint, list(args))\n",
+      "           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "  File \"/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py\", line 269, in launch_agent\n",
+      "    raise ChildFailedError(\n",
+      "torch.distributed.elastic.multiprocessing.errors.ChildFailedError: \n",
+      "============================================================\n",
+      "/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py FAILED\n",
+      "------------------------------------------------------------\n",
+      "Failures:\n",
+      "  <NO_OTHER_FAILURES>\n",
+      "------------------------------------------------------------\n",
+      "Root Cause (first observed failure):\n",
+      "[0]:\n",
+      "  time      : 2026-03-23_07:22:53\n",
+      "  host      : ws-wy-training-hub-ldnjr-0\n",
+      "  rank      : 0 (local_rank: 0)\n",
+      "  exitcode  : 1 (pid: 4849)\n",
+      "  error_file: <N/A>\n",
+      "  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[07:22:55] </span><span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">ERROR   </span> Training subprocess has not exited yet. Sending SIGTERM. Process code: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>         <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">main_ds.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#824\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">824</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m[07:22:55]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31mERROR   \u001b[0m Training subprocess has not exited yet. Sending SIGTERM. Process code: \u001b[1;36m1\u001b[0m         \u001b]8;id=361092;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=987123;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#824\u001b\\\u001b[2m824\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">           </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO    </span> Waiting for process to exit, 60s<span style=\"color: #808000; text-decoration-color: #808000\">...</span>                                              <a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">main_ds.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#830\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">830</span></a>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[2;36m          \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m Waiting for process to exit, 60s\u001b[33m...\u001b[0m                                              \u001b]8;id=771553;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py\u001b\\\u001b[2mmain_ds.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=303331;file:///opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py#830\u001b\\\u001b[2m830\u001b[0m\u001b]8;;\u001b\\\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "\u274c Training failed after 1.1 minutes\n",
+      "Error: Suffered a failure during distributed training. Please see the training logs for more context.\n",
+      "============================================================\n",
+      "\n",
+      "\ud83d\udd0d Quick Troubleshooting Checklist:\n",
+      "  \u25a1 Check that model_path exists or is a valid HuggingFace model name\n",
+      "  \u25a1 Verify data_path points to valid JSONL file\n",
+      "  \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\n",
+      "  \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\n",
+      "  \u25a1 For multi-node: verify network connectivity and endpoints\n",
+      "  \u25a1 Check that all file paths are accessible from the training process\n"
+     ]
+    },
+    {
+     "ename": "RuntimeError",
+     "evalue": "Suffered a failure during distributed training. Please see the training logs for more context.",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mRuntimeError\u001b[39m                              Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[12]\u001b[39m\u001b[32m, line 59\u001b[39m\n\u001b[32m     56\u001b[39m start_time = time.time()\n\u001b[32m     58\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m     result = \u001b[43msft\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mtraining_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     61\u001b[39m     end_time = time.time()\n\u001b[32m     62\u001b[39m     duration = end_time - start_time\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:355\u001b[39m, in \u001b[36msft\u001b[39m\u001b[34m(model_path, data_path, ckpt_output_dir, backend, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m    352\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m create_algorithm\n\u001b[32m    354\u001b[39m algorithm = create_algorithm(\u001b[33m'\u001b[39m\u001b[33msft\u001b[39m\u001b[33m'\u001b[39m, backend)\n\u001b[32m--> \u001b[39m\u001b[32m355\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43malgorithm\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    356\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    357\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    358\u001b[39m \u001b[43m    \u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mckpt_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    359\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnum_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    360\u001b[39m \u001b[43m    \u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43meffective_batch_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    361\u001b[39m \u001b[43m    \u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m=\u001b[49m\u001b[43mlearning_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    362\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_seq_len\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    363\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmax_tokens_per_gpu\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    364\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_output_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    365\u001b[39m \u001b[43m    \u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43msave_samples\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    366\u001b[39m \u001b[43m    \u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwarmup_steps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    367\u001b[39m \u001b[43m    \u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43maccelerate_full_state_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    368\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcheckpoint_at_epoch\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    369\u001b[39m \u001b[43m    \u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m=\u001b[49m\u001b[43mis_pretraining\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    370\u001b[39m \u001b[43m    \u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m=\u001b[49m\u001b[43mblock_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    371\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdocument_column_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    372\u001b[39m \u001b[43m    \u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta1\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    373\u001b[39m \u001b[43m    \u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbeta2\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    374\u001b[39m \u001b[43m    \u001b[49m\u001b[43meps\u001b[49m\u001b[43m=\u001b[49m\u001b[43meps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    375\u001b[39m \u001b[43m    \u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m=\u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    376\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnproc_per_node\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    377\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnnodes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    378\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnode_rank\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    379\u001b[39m \u001b[43m    \u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_id\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    380\u001b[39m \u001b[43m    \u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrdzv_endpoint\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    381\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_addr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    382\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmaster_port\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    383\u001b[39m \u001b[43m    \u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_project\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    384\u001b[39m \u001b[43m    \u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_entity\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    385\u001b[39m \u001b[43m    \u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mwandb_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    386\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtensorboard_log_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    387\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_tracking_uri\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    388\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_experiment_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    389\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmlflow_run_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    390\u001b[39m \u001b[43m    \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\n\u001b[32m    391\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:213\u001b[39m, in \u001b[36mSFTAlgorithm.train\u001b[39m\u001b[34m(self, model_path, data_path, ckpt_output_dir, num_epochs, effective_batch_size, learning_rate, max_seq_len, max_tokens_per_gpu, data_output_dir, save_samples, warmup_steps, accelerate_full_state_at_epoch, checkpoint_at_epoch, is_pretraining, block_size, document_column_name, beta1, beta2, eps, weight_decay, nproc_per_node, nnodes, node_rank, rdzv_id, rdzv_endpoint, master_addr, master_port, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir, mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, **kwargs)\u001b[39m\n\u001b[32m    209\u001b[39m         params[key] = value\n\u001b[32m    211\u001b[39m params.update(kwargs)\n\u001b[32m--> \u001b[39m\u001b[32m213\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbackend\u001b[49m\u001b[43m.\u001b[49m\u001b[43mexecute_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mparams\u001b[49m\u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/training_hub/algorithms/sft.py:70\u001b[39m, in \u001b[36mInstructLabTrainingSFTBackend.execute_training\u001b[39m\u001b[34m(self, algorithm_params)\u001b[39m\n\u001b[32m     67\u001b[39m torchrun_args = TorchrunArgs(**final_torchrun_params)\n\u001b[32m     69\u001b[39m \u001b[38;5;66;03m# Execute training\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m70\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m     71\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorchrun_args\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     72\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtraining_args\u001b[49m\n\u001b[32m     73\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/__init__.py:41\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m     38\u001b[39m \u001b[38;5;66;03m# Local\u001b[39;00m\n\u001b[32m     39\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01m.\u001b[39;00m\u001b[34;01mmain_ds\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m run_training\n\u001b[32m---> \u001b[39m\u001b[32m41\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrun_training\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorch_args\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtrain_args\u001b[49m\u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/opt/app-root/lib64/python3.12/site-packages/instructlab/training/main_ds.py:841\u001b[39m, in \u001b[36mrun_training\u001b[39m\u001b[34m(torch_args, train_args)\u001b[39m\n\u001b[32m    839\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m interrupt\n\u001b[32m    840\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m failure:\n\u001b[32m--> \u001b[39m\u001b[32m841\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[32m    842\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mSuffered a failure during distributed training. Please see the training logs for more context.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    843\u001b[39m     )\n",
+      "\u001b[31mRuntimeError\u001b[39m: Suffered a failure during distributed training. Please see the training logs for more context."
+     ]
+    }
+   ],
+   "source": [
+    "# =============================================================================\n",
+    "# TRAINING EXECUTION\n",
+    "# =============================================================================\n",
+    "\n",
+    "print(\"\ud83d\ude80 Starting SFT Training\")\n",
+    "print(\"=\" * 60)\n",
+    "print(f\"Experiment: {full_experiment_name}\")\n",
+    "print(f\"Model: {selected_example['model_name']}\")\n",
+    "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n",
+    "print(f\"Configuration: {dist_config['description']}\")\n",
+    "print()\n",
+    "\n",
+    "# Prepare all training parameters\n",
+    "training_params = {\n",
+    "    # Required parameters\n",
+    "    'model_path': model_path,\n",
+    "    'data_path': data_path,\n",
+    "    'ckpt_output_dir': ckpt_output_dir,\n",
+    "    \n",
+    "    # Core training parameters\n",
+    "    'num_epochs': num_epochs,\n",
+    "    'effective_batch_size': effective_batch_size,\n",
+    "    'learning_rate': learning_rate,\n",
+    "    'max_seq_len': max_seq_len,\n",
+    "    'max_tokens_per_gpu': max_tokens_per_gpu,\n",
+    "    \n",
+    "    # Data and processing parameters\n",
+    "    'data_output_dir': data_output_dir,\n",
+    "    'warmup_steps': warmup_steps,\n",
+    "    'save_samples': save_samples,\n",
+    "    \n",
+    "    # Checkpointing parameters\n",
+    "    'checkpoint_at_epoch': checkpoint_at_epoch,\n",
+    "    'accelerate_full_state_at_epoch': accelerate_full_state_at_epoch,\n",
+    "    \n",
+    "    # Distributed training parameters\n",
+    "    'nproc_per_node': nproc_per_node,\n",
+    "    'nnodes': nnodes,\n",
+    "    'node_rank': node_rank,\n",
+    "    'rdzv_id': rdzv_id,\n",
+    "    'rdzv_endpoint': rdzv_endpoint,\n",
+    "\n",
+    "    'disable_flash_attn': True\n",
+    "}\n",
+    "\n",
+    "# Display final configuration summary\n",
+    "print(\"\ud83d\udccb Final Training Configuration:\")\n",
+    "for key, value in training_params.items():\n",
+    "    print(f\"  {key}: {value}\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"\u23f3 Training starting...\")\n",
+    "print(\"=\"*60)\n",
+    "\n",
+    "# Execute training\n",
+    "start_time = time.time()\n",
+    "\n",
+    "try:\n",
+    "    result = sft(**training_params)\n",
+    "    \n",
+    "    end_time = time.time()\n",
+    "    duration = end_time - start_time\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(\"\u2705 Training completed successfully!\")\n",
+    "    print(f\"\u23f1\ufe0f  Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n",
+    "    print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n",
+    "    print(\"=\"*60)\n",
+    "    \n",
+    "except Exception as e:\n",
+    "    end_time = time.time()\n",
+    "    duration = end_time - start_time\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n",
+    "    print(f\"Error: {e}\")\n",
+    "    print(\"=\"*60)\n",
+    "    \n",
+    "    print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n",
+    "    print(\"  \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n",
+    "    print(\"  \u25a1 Verify data_path points to valid JSONL file\")\n",
+    "    print(\"  \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n",
+    "    print(\"  \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n",
+    "    print(\"  \u25a1 For multi-node: verify network connectivity and endpoints\")\n",
+    "    print(\"  \u25a1 Check that all file paths are accessible from the training process\")\n",
+    "    \n",
+    "    raise"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Post-Training Analysis\n",
+    "\n",
+    "After training completes, let's analyze the results and provide guidance for next steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# POST-TRAINING ANALYSIS AND NEXT STEPS\n",
+    "# =============================================================================\n",
+    "\n",
+    "print(\"\ud83d\udcca Post-Training Analysis\")\n",
+    "print(\"=\" * 50)\n",
+    "\n",
+    "# Check for saved checkpoints\n",
+    "checkpoint_dir = f\"{ckpt_output_dir}/hf_format\"\n",
+    "\n",
+    "if os.path.exists(checkpoint_dir):\n",
+    "    checkpoints = [d for d in os.listdir(checkpoint_dir) \n",
+    "                  if os.path.isdir(os.path.join(checkpoint_dir, d))]\n",
+    "    \n",
+    "    if checkpoints:\n",
+    "        print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n",
+    "        for ckpt in sorted(checkpoints):\n",
+    "            ckpt_path = os.path.join(checkpoint_dir, ckpt)\n",
+    "            print(f\"  \ud83d\udcc1 {ckpt}\")\n",
+    "        \n",
+    "        # Identify the final checkpoint\n",
+    "        final_checkpoint = sorted(checkpoints)[-1]\n",
+    "        final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n",
+    "        \n",
+    "        print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n",
+    "        \n",
+    "        # Provide model loading example\n",
+    "        print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n",
+    "        print(f\"```python\")\n",
+    "        print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n",
+    "        print(f\"\")\n",
+    "        print(f\"# Load your fine-tuned model\")\n",
+    "        print(f\"model = AutoModelForCausalLM.from_pretrained('{final_checkpoint_path}')\")\n",
+    "        print(f\"tokenizer = AutoTokenizer.from_pretrained('{final_checkpoint_path}')\")\n",
+    "        print(f\"\")\n",
+    "        print(f\"# Generate text\")\n",
+    "        print(f\"inputs = tokenizer('Your prompt here:', return_tensors='pt')\")\n",
+    "        print(f\"outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)\")\n",
+    "        print(f\"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\")\n",
+    "        print(f\"print(response)\")\n",
+    "        print(f\"```\")\n",
+    "    else:\n",
+    "        print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n",
+    "else:\n",
+    "    print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n",
+    "\n",
+    "# Training summary\n",
+    "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n",
+    "print(f\"  Model: {selected_example['model_name']}\")\n",
+    "print(f\"  Epochs: {num_epochs}\")\n",
+    "print(f\"  Global Batch Size: {effective_batch_size}\")\n",
+    "print(f\"  Learning Rate: {learning_rate}\")\n",
+    "print(f\"  Max Tokens per GPU: {max_tokens_per_gpu:,}\")\n",
+    "print(f\"  Max Sequence Length: {max_seq_len:,}\")\n",
+    "print(f\"  Total GPUs: {total_gpus}\")\n",
+    "print(f\"  Distributed Config: {dist_config['description']}\")\n",
+    "\n",
+    "# Next steps recommendations\n",
+    "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n",
+    "print(f\"  1. \ud83e\uddea Test your model with sample inputs to verify it's working\")\n",
+    "print(f\"  2. \ud83d\udcca Evaluate performance on your validation/test datasets\")\n",
+    "print(f\"  3. \ud83d\udd04 Compare outputs with the original base model\")\n",
+    "print(f\"  4. \ud83c\udfaf Fine-tune hyperparameters if needed (learning rate, batch size)\")\n",
+    "print(f\"  5. \ud83d\udcdd Document your configuration and results for reproducibility\")\n",
+    "print(f\"  6. \ud83d\udea2 Deploy for inference using your preferred serving framework\")\n",
+    "\n",
+    "# Performance optimization tips\n",
+    "print(f\"\\n\u26a1 Performance Optimization Tips:\")\n",
+    "print(f\"  \u2022 If training was slow: increase max_tokens_per_gpu or effective_batch_size\")\n",
+    "print(f\"  \u2022 If you hit OOM errors: reduce max_tokens_per_gpu or effective_batch_size\")\n",
+    "print(f\"  \u2022 For better convergence: try different learning rates or warmup_steps\")\n",
+    "print(f\"  \u2022 For production training: consider using the script version for better logging\")\n",
+    "\n",
+    "print(f\"\\n\u2728 SFT Training Complete!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Parameter Reference Summary\n",
+    "\n",
+    "Quick reference for all SFT parameters and their purposes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Core Parameters\n",
+    "\n",
+    "| Parameter | Required | Description | Example Values |\n",
+    "|-----------|----------|-------------|----------------|\n",
+    "| `model_path` | \u2705 | Path to the model to fine-tune | `\"Qwen/Qwen2.5-7B\"`, `\"/path/to/model\"` |\n",
+    "| `data_path` | \u2705 | Path to the training data | `\"/path/to/train.jsonl\"` |\n",
+    "| `ckpt_output_dir` | \u2705 | Directory to save checkpoints | `\"/path/to/checkpoints\"` |\n",
+    "| `num_epochs` | \u274c | Number of training epochs | `1`, `3`, `5` |\n",
+    "| `effective_batch_size` | \u274c | Effective batch size for training | `64`, `128`, `256` |\n",
+    "| `learning_rate` | \u274c | Learning rate for training | `1e-5`, `2e-5`, `5e-6` |\n",
+    "| `max_seq_len` | \u274c | Maximum sequence length | `2048`, `8192`, `16384` |\n",
+    "| `max_tokens_per_gpu` | \u274c | Maximum tokens per GPU in a mini-batch (hard-cap for memory) | `15000`, `25000`, `40000` |\n",
+    "\n",
+    "### Data Processing Parameters\n",
+    "\n",
+    "| Parameter | Description | Default/Example |\n",
+    "|-----------|-------------|------------------|\n",
+    "| `data_output_dir` | Directory to save processed data | `\"/dev/shm\"` (RAM disk) |\n",
+    "| `warmup_steps` | Number of warmup steps | `100`, `500` |\n",
+    "\n",
+    "### Checkpointing Parameters\n",
+    "\n",
+    "| Parameter | Description | Recommended |\n",
+    "|-----------|-------------|-------------|\n",
+    "| `checkpoint_at_epoch` | Whether to checkpoint at each epoch | `True` |\n",
+    "| `accelerate_full_state_at_epoch` | Whether to save full state at epoch for automatic checkpoint resumption | `True` |\n",
+    "| `save_samples` | Number of samples to save after training (0 disables) | `1000`, `0` (disabled) |\n",
+    "\n",
+    "### Distributed Training Parameters\n",
+    "\n",
+    "| Parameter | Description | Example Values |\n",
+    "|-----------|-------------|----------------|\n",
+    "| `nproc_per_node` | Number of processes (GPUs) per node | `1`, `4`, `8` |\n",
+    "| `nnodes` | Total number of nodes | `1`, `2`, `4` |\n",
+    "| `node_rank` | Rank of this node (0 to nnodes-1) | `0` (master), `1`, `2`... |\n",
+    "| `rdzv_id` | Unique job ID for rendezvous | `42`, `100` |\n",
+    "| `rdzv_endpoint` | Master node endpoint for multi-node training | `\"127.0.0.1:29500\"` |\n",
+    "\n",
+    "### Memory Optimization Guidelines\n",
+    "\n",
+    "- **Start conservative**: Begin with lower `max_tokens_per_gpu` values and increase gradually\n",
+    "- **Monitor usage**: Watch GPU memory during training and adjust accordingly\n",
+    "- **Balance batch size**: Larger `effective_batch_size` can improve training stability\n",
+    "- **Use RAM disk**: Set `data_output_dir=\"/dev/shm\"` for faster data loading\n",
+    "\n",
+    "### Multi-Node Setup Checklist\n",
+    "\n",
+    "1. \u2705 Ensure network connectivity between all nodes\n",
+    "2. \u2705 Use the same `rdzv_id` and `rdzv_endpoint` on all nodes\n",
+    "3. \u2705 Set unique `node_rank` for each node (0, 1, 2, ...)\n",
+    "4. \u2705 Verify all nodes can access model and data paths\n",
+    "5. \u2705 Start training simultaneously on all nodes\n",
+    "\n",
+    "### Popular Model Examples\n",
+    "\n",
+    "| Model | HuggingFace Path | Example Config |\n",
+    "|-------|------------------|----------------|\n",
+    "| Qwen 2.5 7B | `Qwen/Qwen2.5-7B-Instruct` | `max_tokens_per_gpu=20000` |\n",
+    "| Llama 3.1 8B | `meta-llama/Meta-Llama-3.1-8B-Instruct` | `max_tokens_per_gpu=18000` |\n",
+    "| Phi 4 Mini | `microsoft/Phi-4-mini-instruct` | `max_tokens_per_gpu=25000` |\n",
+    "\n",
+    "### Script Alternative\n",
+    "\n",
+    "For production workloads or long-running training, use the script version:\n",
+    "\n",
+    "```bash\n",
+    "python scripts/sft_qwen_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints\n",
+    "\n",
+    "python scripts/sft_llama_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints\n",
+    "\n",
+    "python scripts/sft_phi_example.py \\\n",
+    "  --data-path /path/to/data.jsonl \\\n",
+    "  --ckpt-output-dir /path/to/checkpoints\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
new file mode 100644
index 0000000..76b1599
--- /dev/null
+++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
@@ -0,0 +1,198 @@
+---
+weight: 25
+---
+
+# Fine-tuning LLMs with Training Hub
+
+## Background
+
+`training_hub` is a Python library that provides a unified, high-level API for running Supervised Fine-Tuning (SFT) and Orthogonal Subspace Fine-Tuning (OSFT) on large language models. It abstracts away the complexity of distributed training configuration, memory management, and backend orchestration, letting you focus on experiment parameters.
+
+**Key benefits:**
+
+- **Unified API**: A single function call (`sft(...)` or `osft(...)`) handles single-GPU, multi-GPU, and multi-node training without changing your code.
+- **Automatic memory management**: The `max_tokens_per_gpu` parameter caps GPU memory usage and automatically computes micro-batch size and gradient accumulation to maintain your target `effective_batch_size`.
+- **OSFT for continual learning**: The `osft` function implements [Nayak et al. (2025), arXiv:2504.07097](https://arxiv.org/abs/2504.07097), which restricts weight updates to orthogonal subspaces — preventing catastrophic forgetting without replay buffers or supplementary datasets.
+- **Production-ready**: Built-in checkpointing, experiment tracking, and Liger kernel support for throughput efficiency.
+
+### SFT vs OSFT
+
+| Aspect | SFT | OSFT |
+|--------|-----|------|
+| **Use case** | Initial instruction tuning, base model fine-tuning | Continual domain adaptation of already-tuned models |
+| **Catastrophic forgetting** | Requires mixed/replay data to mitigate | Prevented algorithmically |
+| **Key parameter** | Standard hyperparameters | `unfreeze_rank_ratio` (0.0–1.0) |
+| **Backend** | instructlab-training | mini-trainer |
+
+## Requirements
+
+- **Alauda AI** and **Alauda AI Workbench** must be installed in your cluster.
+- A Workbench (Notebook) instance with:
+  - Access to install Python packages from the internet (or a configured internal PyPI mirror).
+  - GPU resources attached (at least one NVIDIA GPU).
+  - Sufficient shared storage for model checkpoints.
+- A HuggingFace model (local path or model name resolvable from the instance).
+- Training data in **JSONL format** (see [Data Format](#data-format) below).
+
+## Data Format
+
+Training data must be a JSON Lines (`.jsonl`) file where each line is a conversation:
+
+```json
+{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of AI..."}]}
+```
+
+Supported `role` values: `system`, `user`, `assistant`, `pretraining`.
+
+**Masking behavior:**
+
+- **SFT (default)** — only assistant responses contribute to the training loss. Add `"unmask": true` to a sample to include all non-system content in the loss (pretraining style).
+- **OSFT** — controlled via the `unmask_messages` parameter (`False` by default; set `True` for pretraining style).
+
+Pre-processed datasets with `input_ids` and `labels` fields are also supported via `use_processed_dataset=True`.
+
+## Download Notebooks and Run Examples
+
+Two comprehensive tutorial notebooks are provided. Download them to your Workbench instance and execute them cell by cell.
+
+| Notebook | Algorithm | Download |
+|----------|-----------|----------|
+| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](./sft_comprehensive_tutorial.ipynb) |
+| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](./osft_comprehensive_tutorial.ipynb) |
+
+### Step 1 — Install Dependencies
+
+Open a terminal in your Workbench instance and install `training-hub`:
+
+```bash
+pip install training-hub
+```
+
+### Step 2 — Upload or Prepare Data
+
+Place your `.jsonl` training file in a path accessible to the notebook, for example `/data/train.jsonl`.
+
+### Step 3 — Open and Configure the Notebook
+
+Open the downloaded notebook in your Workbench instance. The key cells to configure are:
+
+**Select your model** (both notebooks):
+
+```python
+# Change to your model's HuggingFace name or local path
+model_path = "Qwen/Qwen2.5-7B-Instruct"
+```
+
+Bundled model presets cover Qwen 2.5 7B, Llama 3.1 8B, Phi 4 Mini, and generic 7B/small models.
+
+**Set required paths** (both notebooks):
+
+```python
+data_path       = "/path/to/your/training_data.jsonl"
+ckpt_output_dir = "/path/to/checkpoints/my_experiment"
+```
+
+**OSFT only — set the orthogonality ratio:**
+
+```python
+unfreeze_rank_ratio = 0.25  # 0.1–0.3 conservative, 0.3–0.5 balanced
+```
+
+**Select distributed configuration:**
+
+```python
+selected_distributed = "single_node_8gpu"  # or "single_gpu_dev", "multi_node_master", etc.
+```
+
+### Step 4 — Execute Training
+
+Run all cells in sequence. The final training cell calls either:
+
+```python
+# SFT
+from training_hub import sft
+result = sft(
+    model_path=model_path,
+    data_path=data_path,
+    ckpt_output_dir=ckpt_output_dir,
+    effective_batch_size=128,
+    max_tokens_per_gpu=20000,
+    max_seq_len=16384,
+    learning_rate=1e-5,
+    num_epochs=3,
+    nproc_per_node=8,
+    ...
+)
+
+# OSFT
+from training_hub import osft
+result = osft(
+    model_path=model_path,
+    data_path=data_path,
+    ckpt_output_dir=ckpt_output_dir,
+    unfreeze_rank_ratio=0.25,
+    effective_batch_size=128,
+    max_tokens_per_gpu=10000,
+    max_seq_len=8196,
+    learning_rate=5e-6,
+    num_epochs=1,
+    nproc_per_node=8,
+    ...
+)
+```
+
+Checkpoints are written to `ckpt_output_dir` at the end of each epoch (configurable via `checkpoint_at_epoch`).
+
+## Key Parameters
+
+### Common Parameters (SFT and OSFT)
+
+| Parameter | Required | Description |
+|-----------|----------|-------------|
+| `model_path` | Yes | HuggingFace model name or local path |
+| `data_path` | Yes | Path to JSONL training data |
+| `ckpt_output_dir` | Yes | Directory to save checkpoints |
+| `effective_batch_size` | Yes | Global effective batch size |
+| `max_tokens_per_gpu` | Yes | Per-GPU token budget; controls memory and auto-computes micro-batch size |
+| `max_seq_len` | Yes | Maximum sequence length |
+| `learning_rate` | Yes | Optimizer learning rate |
+| `num_epochs` | No | Training epochs (default: `1`) |
+| `lr_scheduler` | No | Scheduler type, e.g. `"cosine"` |
+| `warmup_steps` | No | Linear warmup steps (default: `0`) |
+| `use_liger` | No | Enable Liger kernels for efficiency (default: `True` for OSFT) |
+| `seed` | No | Random seed (default: `42`) |
+| `data_output_dir` | No | Processed data cache dir; use `"/dev/shm"` for RAM-disk speed |
+| `use_processed_dataset` | No | Skip tokenization if data has `input_ids`/`labels` |
+| `checkpoint_at_epoch` | No | Save checkpoint each epoch (default: `True`) |
+| `save_final_checkpoint` | No | Save a final checkpoint after training (default: `True`) |
+| `nproc_per_node` | No | GPUs per node |
+| `nnodes` | No | Total nodes (default: `1`) |
+| `node_rank` | No | This node's rank (default: `0`) |
+| `rdzv_id` | No | Rendezvous job ID |
+| `rdzv_endpoint` | No | Master node `host:port` for multi-node |
+
+### OSFT-specific Parameters
+
+| Parameter | Required | Description |
+|-----------|----------|-------------|
+| `unfreeze_rank_ratio` | Yes | Fraction of each weight matrix that can be updated (0.0–1.0). Lower = more preservation. |
+| `unmask_messages` | No | If `True`, trains on all non-system content (pretraining style) |
+| `target_patterns` | No | Substring patterns to restrict OSFT to specific layers (default: `None`, all layers) |
+
+### Multi-node Training
+
+For multi-node jobs, run the notebook (or equivalent script) on every node simultaneously with matching `rdzv_id` and `rdzv_endpoint`, varying only `node_rank` per node:
+
+```python
+# Master node (node_rank=0)
+nproc_per_node = 8
+nnodes         = 2
+node_rank      = 0
+rdzv_id        = 42
+rdzv_endpoint  = "10.0.0.1:29500"
+
+# Worker node (node_rank=1)
+node_rank = 1  # all other params identical
+```
+
+All nodes must have network connectivity to the `rdzv_endpoint` before training begins.

From 96f8d37d3b21ff748775c8e1c209318c3a3be8b2 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 24 Mar 2026 09:54:11 +0800
Subject: [PATCH 2/3] fix link

---
 .../how_to/osft_comprehensive_tutorial.ipynb  | 162 +++++++++---------
 .../how_to/training_hub_fine_tuning.mdx       |   6 +-
 2 files changed, 84 insertions(+), 84 deletions(-)

diff --git a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
index ebd6edd..2313fd7 100644
--- a/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
+++ b/docs/en/workbench/how_to/osft_comprehensive_tutorial.ipynb
@@ -142,11 +142,11 @@
     "target_patterns = None  # Default: applies OSFT to all appropriate layers (RECOMMENDED)\n",
     "```\n",
     "\n",
-    "**\u26a0\ufe0f Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n",
+    "**⚠️ Important:** This is an expert-level parameter. Unless you have deep knowledge of model architecture and a specific reason to limit OSFT to certain layers, **leave it as `None`**.\n",
     "\n",
     "If you do need to use it, it performs simple substring matching on module names:\n",
-    "- `target_patterns = [\"attention\"]` \u2192 Targets modules with \"attention\" in the name\n",
-    "- `target_patterns = [\"mlp\"]` \u2192 Targets modules with \"mlp\" in the name\n",
+    "- `target_patterns = [\"attention\"]` → Targets modules with \"attention\" in the name\n",
+    "- `target_patterns = [\"mlp\"]` → Targets modules with \"mlp\" in the name\n",
     "\n",
     "**For 99% of users:** Just use the default (`None`) and let OSFT handle layer selection automatically. The algorithm knows what it's doing.\n"
    ]
@@ -292,10 +292,10 @@
     "# These are example configurations - adjust based on your hardware and requirements\n",
     "# =============================================================================\n",
     "\n",
-    "# Example 1: Qwen 2.5 7B Instruct\n",
+    "# Example 1: Qwen 3 0.6B\n",
     "qwen_example = {\n",
-    "    \"model_name\": \"Qwen 2.5 7B Instruct\",\n",
-    "    \"model_path\": \"/opt/app-root/src/Qwen3-0.6B\",  # HuggingFace model name or local path\n",
+    "    \"model_name\": \"Qwen 3 0.6B Instruct\",\n",
+    "    \"model_path\": \"Qwen/Qwen3-0.6B\",  # HuggingFace model name or local path\n",
     "    \"example_unfreeze_rank_ratio\": 0.25,  # Conservative for preserving multilingual capabilities\n",
     "    \"example_max_tokens_per_gpu\": 2048,\n",
     "    \"example_max_seq_len\": 2048,  # Qwen 2.5 supports long context\n",
@@ -367,7 +367,7 @@
     "print(f\"Example Batch Size: {selected_example['example_batch_size']:,}\")\n",
     "print(f\"Example Learning Rate: {selected_example['example_learning_rate']}\")\n",
     "print(f\"Notes: {selected_example['notes']}\")\n",
-    "print(\"\\n\ud83d\udca1 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n",
+    "print(\"\\n💡 Remember: OSFT preserves original capabilities without needing replay buffers!\")\n",
     "print(\"   Adjust unfreeze_rank_ratio based on preservation vs adaptation needs.\")"
    ]
   },
@@ -409,15 +409,15 @@
     "max_seq_len = selected_example[\"example_max_seq_len\"]  # Maximum sequence length\n",
     "learning_rate = selected_example[\"example_learning_rate\"]  # Learning rate for training\n",
     "\n",
-    "print(\"\ud83d\udccb Required Parameters (all must be specified):\")\n",
-    "print(f\"  \u2022 model_path: {model_path}\")\n",
-    "print(f\"  \u2022 data_path: {data_path}\")\n",
-    "print(f\"  \u2022 ckpt_output_dir: {ckpt_output_dir}\")\n",
-    "print(f\"  \u2022 unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n",
-    "print(f\"  \u2022 effective_batch_size: {effective_batch_size}\")\n",
-    "print(f\"  \u2022 max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n",
-    "print(f\"  \u2022 max_seq_len: {max_seq_len:,}\")\n",
-    "print(f\"  \u2022 learning_rate: {learning_rate}\")\n",
+    "print(\"📋 Required Parameters (all must be specified):\")\n",
+    "print(f\"  • model_path: {model_path}\")\n",
+    "print(f\"  • data_path: {data_path}\")\n",
+    "print(f\"  • ckpt_output_dir: {ckpt_output_dir}\")\n",
+    "print(f\"  • unfreeze_rank_ratio: {unfreeze_rank_ratio}\")\n",
+    "print(f\"  • effective_batch_size: {effective_batch_size}\")\n",
+    "print(f\"  • max_tokens_per_gpu: {max_tokens_per_gpu:,}\")\n",
+    "print(f\"  • max_seq_len: {max_seq_len:,}\")\n",
+    "print(f\"  • learning_rate: {learning_rate}\")\n",
     "print()\n",
     "\n",
     "# =============================================================================\n",
@@ -427,11 +427,11 @@
     "target_patterns = None  # Optional: Patterns to match specific modules for OSFT\n",
     "# Example: [\"*attention*\", \"*mlp*\"] to target attention and MLP layers\n",
     "\n",
-    "print(\"\ud83d\udd27 OSFT-Specific Parameters:\")\n",
+    "print(\"🔧 OSFT-Specific Parameters:\")\n",
     "print(f\"  unfreeze_rank_ratio: {unfreeze_rank_ratio} - Controls how much of each matrix is unfrozen\")\n",
-    "print(f\"    \u2022 0.1-0.3: Conservative, maximum preservation\")\n",
-    "print(f\"    \u2022 0.3-0.5: Balanced adaptation\")\n",
-    "print(f\"    \u2022 >0.5: Rarely needed for typical use cases\")\n",
+    "print(f\"    • 0.1-0.3: Conservative, maximum preservation\")\n",
+    "print(f\"    • 0.3-0.5: Balanced adaptation\")\n",
+    "print(f\"    • >0.5: Rarely needed for typical use cases\")\n",
     "print(f\"  target_patterns: {target_patterns} - Optional patterns for selecting specific modules\")\n",
     "print()\n",
     "\n",
@@ -446,7 +446,7 @@
     "lr_scheduler_kwargs = {}  # Scheduler parameters\n",
     "warmup_steps = 0  # Number of warmup steps\n",
     "\n",
-    "print(\"\ud83c\udfaf Training Hyperparameters:\")\n",
+    "print(\"🎯 Training Hyperparameters:\")\n",
     "print(f\"  effective_batch_size: {effective_batch_size} - Effective batch size for training\")\n",
     "print(f\"  learning_rate: {learning_rate} - Learning rate for model updates\")\n",
     "print(f\"  num_epochs: {num_epochs} - Number of training epochs\")\n",
@@ -462,7 +462,7 @@
     "\n",
     "use_liger = True  # Use Liger kernels for efficiency\n",
     "\n",
-    "print(\"\u26a1 Memory and Performance Parameters:\")\n",
+    "print(\"⚡ Memory and Performance Parameters:\")\n",
     "print(f\"  max_tokens_per_gpu: {max_tokens_per_gpu:,} - Maximum tokens per GPU (hard-cap for memory)\")\n",
     "print(f\"  max_seq_len: {max_seq_len:,} - Maximum sequence length\")\n",
     "print(f\"  use_liger: {use_liger} - Use Liger kernels for efficiency\")\n",
@@ -476,7 +476,7 @@
     "use_processed_dataset = False  # Whether data is pre-processed\n",
     "unmask_messages = False  # Whether to unmask all messages for pretraining-style learning\n",
     "\n",
-    "print(\"\ud83d\udcbe Data Processing Parameters:\")\n",
+    "print(\"💾 Data Processing Parameters:\")\n",
     "print(f\"  data_path: '{data_path}' - Path to training data (JSONL format)\")\n",
     "print(f\"  data_output_dir: '{data_output_dir}' - Directory to save processed data\")\n",
     "print(f\"  use_processed_dataset: {use_processed_dataset} - Whether to use pre-processed data\")\n",
@@ -490,7 +490,7 @@
     "checkpoint_at_epoch = True  # Whether to checkpoint at each epoch\n",
     "save_final_checkpoint = True  # Whether to save final checkpoint\n",
     "\n",
-    "print(\"\ud83d\udcbe Checkpointing Parameters:\")\n",
+    "print(\"💾 Checkpointing Parameters:\")\n",
     "print(f\"  ckpt_output_dir: '{ckpt_output_dir}' - Directory to save checkpoints\")\n",
     "print(f\"  checkpoint_at_epoch: {checkpoint_at_epoch} - Whether to checkpoint at each epoch\")\n",
     "print(f\"  save_final_checkpoint: {save_final_checkpoint} - Whether to save final checkpoint\")\n",
@@ -568,7 +568,7 @@
     "total_gpus = nproc_per_node * nnodes\n",
     "per_gpu_batch_size = effective_batch_size // total_gpus\n",
     "\n",
-    "print(\"\ud83d\udda5\ufe0f  Distributed Training Parameters:\")\n",
+    "print(\"🖥️  Distributed Training Parameters:\")\n",
     "print(f\"  Configuration: {dist_config['description']}\")\n",
     "print(f\"  nproc_per_node: {nproc_per_node} - Number of processes (GPUs) per node\")\n",
     "print(f\"  nnodes: {nnodes} - Total number of nodes\")\n",
@@ -576,8 +576,8 @@
     "print(f\"  rdzv_id: {rdzv_id} - Unique job ID for rendezvous\")\n",
     "print(f\"  rdzv_endpoint: '{rdzv_endpoint}' - Master node endpoint for multi-node training\")\n",
     "print()\n",
-    "print(f\"\ud83d\udcca Resource Calculation:\")\n",
-    "print(f\"  Total GPUs: {total_gpus} ({nproc_per_node} \u00d7 {nnodes})\")\n",
+    "print(f\"📊 Resource Calculation:\")\n",
+    "print(f\"  Total GPUs: {total_gpus} ({nproc_per_node} × {nnodes})\")\n",
     "print(f\"  Effective batch size: {effective_batch_size}\")\n",
     "print(f\"  Approximate per-GPU batch size: {per_gpu_batch_size}\")\n",
     "print(f\"  (Actual micro-batch size determined automatically by gradient accumulation)\")\n",
@@ -585,7 +585,7 @@
     "\n",
     "# Multi-node setup instructions\n",
     "if nnodes > 1:\n",
-    "    print(\"\ud83d\udd27 Multi-Node Setup Instructions:\")\n",
+    "    print(\"🔧 Multi-Node Setup Instructions:\")\n",
     "    print(f\"  1. Ensure all nodes can reach the master at {rdzv_endpoint}\")\n",
     "    print(f\"  2. Use the same rdzv_id ({rdzv_id}) on all nodes\")\n",
     "    print(f\"  3. Set node_rank to 0 for master, 1,2,3... for workers\")\n",
@@ -593,11 +593,11 @@
     "    print()\n",
     "\n",
     "# OSFT-specific multi-node considerations\n",
-    "print(\"\ud83d\udcdd OSFT Multi-Node Considerations:\")\n",
-    "print(\"  \u2022 OSFT works seamlessly across multiple nodes\")\n",
-    "print(\"  \u2022 No special replay buffer coordination needed (unlike SFT)\")\n",
-    "print(\"  \u2022 Each node processes its data portion with the same unfreeze_rank_ratio\")\n",
-    "print(\"  \u2022 Gradients are synchronized automatically across all nodes\")\n",
+    "print(\"📝 OSFT Multi-Node Considerations:\")\n",
+    "print(\"  • OSFT works seamlessly across multiple nodes\")\n",
+    "print(\"  • No special replay buffer coordination needed (unlike SFT)\")\n",
+    "print(\"  • Each node processes its data portion with the same unfreeze_rank_ratio\")\n",
+    "print(\"  • Gradients are synchronized automatically across all nodes\")\n",
     "print()"
    ]
   },
@@ -620,18 +620,18 @@
     "# TRAINING EXECUTION\n",
     "# =============================================================================\n",
     "\n",
-    "print(\"\ud83d\ude80 Starting OSFT Training\")\n",
+    "print(\"🚀 Starting OSFT Training\")\n",
     "print(\"=\" * 60)\n",
     "print(f\"Experiment: {full_experiment_name}\")\n",
     "print(f\"Model: {selected_example['model_name']}\")\n",
-    "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node \u00d7 {nnodes} nodes)\")\n",
+    "print(f\"Total GPUs: {total_gpus} ({nproc_per_node} per node × {nnodes} nodes)\")\n",
     "print(f\"Configuration: {dist_config['description']}\")\n",
     "print(f\"Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n",
     "print()\n",
-    "print(\"\u2728 OSFT Advantages:\")\n",
-    "print(\"  \u2022 No catastrophic forgetting\")\n",
-    "print(\"  \u2022 No replay buffer needed\")\n",
-    "print(\"  \u2022 Preserves original model capabilities\")\n",
+    "print(\"✨ OSFT Advantages:\")\n",
+    "print(\"  • No catastrophic forgetting\")\n",
+    "print(\"  • No replay buffer needed\")\n",
+    "print(\"  • Preserves original model capabilities\")\n",
     "print()\n",
     "\n",
     "# Prepare all training parameters\n",
@@ -677,13 +677,13 @@
     "}\n",
     "\n",
     "# Display final configuration summary\n",
-    "print(\"\ud83d\udccb Final Training Configuration:\")\n",
+    "print(\"📋 Final Training Configuration:\")\n",
     "for key, value in training_params.items():\n",
     "    if value is not None:  # Only show non-None values\n",
     "        print(f\"  {key}: {value}\")\n",
     "\n",
     "print(\"\\n\" + \"=\"*60)\n",
-    "print(\"\u23f3 Training starting...\")\n",
+    "print(\"⏳ Training starting...\")\n",
     "print(\"=\"*60)\n",
     "\n",
     "# Execute training\n",
@@ -696,34 +696,34 @@
     "    duration = end_time - start_time\n",
     "    \n",
     "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"\u2705 OSFT Training completed successfully!\")\n",
-    "    print(f\"\u23f1\ufe0f  Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n",
-    "    print(f\"\ud83d\udcc1 Checkpoints saved to: {ckpt_output_dir}\")\n",
+    "    print(\"✅ OSFT Training completed successfully!\")\n",
+    "    print(f\"⏱️  Total duration: {duration/3600:.2f} hours ({duration/60:.1f} minutes)\")\n",
+    "    print(f\"📁 Checkpoints saved to: {ckpt_output_dir}\")\n",
     "    print(\"=\"*60)\n",
     "    print()\n",
-    "    print(\"\ud83c\udfaf What you've achieved with OSFT:\")\n",
-    "    print(\"  \u2022 Model adapted to new domain/task\")\n",
-    "    print(\"  \u2022 Original capabilities preserved\")\n",
-    "    print(\"  \u2022 No catastrophic forgetting occurred\")\n",
-    "    print(\"  \u2022 Ready for deployment without regression testing!\")\n",
+    "    print(\"🎯 What you've achieved with OSFT:\")\n",
+    "    print(\"  • Model adapted to new domain/task\")\n",
+    "    print(\"  • Original capabilities preserved\")\n",
+    "    print(\"  • No catastrophic forgetting occurred\")\n",
+    "    print(\"  • Ready for deployment without regression testing!\")\n",
     "    \n",
     "except Exception as e:\n",
     "    end_time = time.time()\n",
     "    duration = end_time - start_time\n",
     "    \n",
     "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(f\"\u274c Training failed after {duration/60:.1f} minutes\")\n",
+    "    print(f\"❌ Training failed after {duration/60:.1f} minutes\")\n",
     "    print(f\"Error: {e}\")\n",
     "    print(\"=\"*60)\n",
     "    \n",
-    "    print(\"\\n\ud83d\udd0d Quick Troubleshooting Checklist:\")\n",
-    "    print(\"  \u25a1 Check that model_path exists or is a valid HuggingFace model name\")\n",
-    "    print(\"  \u25a1 Verify data_path points to valid JSONL file\")\n",
-    "    print(\"  \u25a1 Ensure ckpt_output_dir parent directory exists and is writable\")\n",
-    "    print(\"  \u25a1 Try reducing max_tokens_per_gpu if you see OOM errors\")\n",
-    "    print(\"  \u25a1 Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n",
-    "    print(\"  \u25a1 For multi-node: verify network connectivity and endpoints\")\n",
-    "    print(\"  \u25a1 Check that mini-trainer backend dependencies are installed\")\n",
+    "    print(\"\\n🔍 Quick Troubleshooting Checklist:\")\n",
+    "    print(\"  □ Check that model_path exists or is a valid HuggingFace model name\")\n",
+    "    print(\"  □ Verify data_path points to valid JSONL file\")\n",
+    "    print(\"  □ Ensure ckpt_output_dir parent directory exists and is writable\")\n",
+    "    print(\"  □ Try reducing max_tokens_per_gpu if you see OOM errors\")\n",
+    "    print(\"  □ Try adjusting unfreeze_rank_ratio (lower = more preservation)\")\n",
+    "    print(\"  □ For multi-node: verify network connectivity and endpoints\")\n",
+    "    print(\"  □ Check that mini-trainer backend dependencies are installed\")\n",
     "    \n",
     "    raise\n"
    ]
@@ -747,7 +747,7 @@
     "# POST-TRAINING ANALYSIS AND NEXT STEPS\n",
     "# =============================================================================\n",
     "\n",
-    "print(\"\ud83d\udcca Post-Training Analysis\")\n",
+    "print(\"📊 Post-Training Analysis\")\n",
     "print(\"=\" * 50)\n",
     "\n",
     "# Check for saved checkpoints\n",
@@ -758,19 +758,19 @@
     "                  if os.path.isdir(os.path.join(checkpoint_dir, d))]\n",
     "    \n",
     "    if checkpoints:\n",
-    "        print(f\"\u2705 Found {len(checkpoints)} checkpoint(s):\")\n",
+    "        print(f\"✅ Found {len(checkpoints)} checkpoint(s):\")\n",
     "        for ckpt in sorted(checkpoints):\n",
     "            ckpt_path = os.path.join(checkpoint_dir, ckpt)\n",
-    "            print(f\"  \ud83d\udcc1 {ckpt}\")\n",
+    "            print(f\"  📁 {ckpt}\")\n",
     "        \n",
     "        # Identify the final checkpoint\n",
     "        final_checkpoint = sorted(checkpoints)[-1]\n",
     "        final_checkpoint_path = os.path.join(checkpoint_dir, final_checkpoint)\n",
     "        \n",
-    "        print(f\"\\n\ud83c\udfaf Final model checkpoint: {final_checkpoint_path}\")\n",
+    "        print(f\"\\n🎯 Final model checkpoint: {final_checkpoint_path}\")\n",
     "        \n",
     "        # Provide model loading example\n",
-    "        print(f\"\\n\ud83d\udcbb Model Loading Example:\")\n",
+    "        print(f\"\\n💻 Model Loading Example:\")\n",
     "        print(f\"```python\")\n",
     "        print(f\"from transformers import AutoModelForCausalLM, AutoTokenizer\")\n",
     "        print(f\"\")\n",
@@ -786,12 +786,12 @@
     "        print(f\"print(response)\")\n",
     "        print(f\"```\")\n",
     "    else:\n",
-    "        print(f\"\u274c No checkpoints found in {checkpoint_dir}\")\n",
+    "        print(f\"❌ No checkpoints found in {checkpoint_dir}\")\n",
     "else:\n",
-    "    print(f\"\u274c Checkpoint directory not found: {checkpoint_dir}\")\n",
+    "    print(f\"❌ Checkpoint directory not found: {checkpoint_dir}\")\n",
     "\n",
     "# Training summary\n",
-    "print(f\"\\n\ud83d\udcc8 Training Summary:\")\n",
+    "print(f\"\\n📈 Training Summary:\")\n",
     "print(f\"  Model: {selected_example['model_name']}\")\n",
     "print(f\"  Algorithm: OSFT (Orthogonal Subspace Fine-Tuning)\")\n",
     "print(f\"  Unfreeze Rank Ratio: {unfreeze_rank_ratio}\")\n",
@@ -804,7 +804,7 @@
     "print(f\"  Distributed Config: {dist_config['description']}\")\n",
     "\n",
     "# OSFT-specific validation recommendations\n",
-    "print(f\"\\n\ud83e\uddea OSFT-Specific Validation Steps:\")\n",
+    "print(f\"\\n🧪 OSFT-Specific Validation Steps:\")\n",
     "print(f\"  1. **Test Original Capabilities**: Verify the model still performs well on\")\n",
     "print(f\"     general tasks it was originally trained for\")\n",
     "print(f\"  2. **Test New Domain**: Confirm improved performance on your target domain\")\n",
@@ -814,17 +814,17 @@
     "print(f\"     improvements without degradation\")\n",
     "\n",
     "# Next steps recommendations\n",
-    "print(f\"\\n\ud83d\ude80 Recommended Next Steps:\")\n",
-    "print(f\"  1. \ud83c\udfaf Test on domain-specific evaluation sets\")\n",
-    "print(f\"  2. \ud83d\udcca Compare performance with base model on both general and domain tasks\")\n",
-    "print(f\"  3. \ud83d\udd04 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n",
-    "print(f\"  4. \ud83d\udca1 If too much change occurred, reduce unfreeze_rank_ratio\")\n",
-    "print(f\"  5. \ud83d\udcdd Document the unfreeze_rank_ratio that works best for your use case\")\n",
-    "print(f\"  6. \ud83d\udea2 Deploy with confidence - no catastrophic forgetting!\")\n",
+    "print(f\"\\n🚀 Recommended Next Steps:\")\n",
+    "print(f\"  1. 🎯 Test on domain-specific evaluation sets\")\n",
+    "print(f\"  2. 📊 Compare performance with base model on both general and domain tasks\")\n",
+    "print(f\"  3. 🔄 If more adaptation needed, slightly increase unfreeze_rank_ratio\")\n",
+    "print(f\"  4. 💡 If too much change occurred, reduce unfreeze_rank_ratio\")\n",
+    "print(f\"  5. 📝 Document the unfreeze_rank_ratio that works best for your use case\")\n",
+    "print(f\"  6. 🚢 Deploy with confidence - no catastrophic forgetting!\")\n",
     "\n",
     "# Performance optimization tips\n",
-    "print(f\"\\n\u26a1 OSFT-Specific Optimization Tips:\")\n",
-    "print(f\"  \u2022 Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n",
+    "print(f\"\\n⚡ OSFT-Specific Optimization Tips:\")\n",
+    "print(f\"  • Current unfreeze_rank_ratio ({unfreeze_rank_ratio}):\")\n",
     "if unfreeze_rank_ratio < 0.2:\n",
     "    print(f\"    Very conservative - great preservation, slower adaptation\")\n",
     "    print(f\"    Consider increasing to 0.25-0.3 if need more adaptation\")\n",
@@ -835,10 +835,10 @@
     "    print(f\"    Aggressive - faster adaptation, slightly less preservation\")\n",
     "    print(f\"    Consider reducing if seeing any capability degradation\")\n",
     "\n",
-    "print(f\"  \u2022 Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n",
-    "print(f\"  \u2022 For production: use the script version for better logging and resumption\")\n",
+    "print(f\"  • Memory usage is similar to SFT - adjust max_tokens_per_gpu as needed\")\n",
+    "print(f\"  • For production: use the script version for better logging and resumption\")\n",
     "\n",
-    "print(f\"\\n\u2728 OSFT Training Complete!\")\n",
+    "print(f\"\\n✨ OSFT Training Complete!\")\n",
     "print(f\"Your model has been successfully adapted without forgetting!\")\n"
    ]
   },
@@ -883,7 +883,7 @@
     "|-----------|-------------|-----------------|\n",
     "| `num_epochs` | Number of training epochs | `1` |\n",
     "| `seed` | Random seed for reproducibility | `42` |\n",
-    "| `use_liger` | Enable Liger kernels for efficiency | `False` |\n",
+    "| `use_liger` | Enable Liger kernels for efficiency | `True` |\n",
     "| `warmup_steps` | Number of warmup steps | `0` |\n",
     "| `lr_scheduler` | Learning rate scheduler | `\"cosine\"` |\n",
     "| `lr_scheduler_kwargs` | Additional scheduler parameters | `{\"eta_min\": 1e-6}` |\n",
@@ -1004,4 +1004,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
\ No newline at end of file
+}
diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
index 76b1599..aa7cd66 100644
--- a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
+++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
@@ -57,8 +57,8 @@ Two comprehensive tutorial notebooks are provided. Download them to your Workben
 
 | Notebook | Algorithm | Download |
 |----------|-----------|----------|
-| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](./sft_comprehensive_tutorial.ipynb) |
-| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](./osft_comprehensive_tutorial.ipynb) |
+| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](/workbench/how_to/sft_comprehensive_tutorial.ipynb) |
+| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](/workbench/how_to/osft_comprehensive_tutorial.ipynb) |
 
 ### Step 1 — Install Dependencies
 
@@ -133,7 +133,7 @@ result = osft(
     unfreeze_rank_ratio=0.25,
     effective_batch_size=128,
     max_tokens_per_gpu=10000,
-    max_seq_len=8196,
+    max_seq_len=8192,
     learning_rate=5e-6,
     num_epochs=1,
     nproc_per_node=8,

From ce9309d11fd662cbb3515bd85ecd5de8f1ccd34f Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 24 Mar 2026 11:15:10 +0800
Subject: [PATCH 3/3] fix link

---
 docs/en/workbench/how_to/training_hub_fine_tuning.mdx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
index aa7cd66..ac63978 100644
--- a/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
+++ b/docs/en/workbench/how_to/training_hub_fine_tuning.mdx
@@ -57,8 +57,8 @@ Two comprehensive tutorial notebooks are provided. Download them to your Workben
 
 | Notebook | Algorithm | Download |
 |----------|-----------|----------|
-| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [sft_comprehensive_tutorial.ipynb](/workbench/how_to/sft_comprehensive_tutorial.ipynb) |
-| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [osft_comprehensive_tutorial.ipynb](/workbench/how_to/osft_comprehensive_tutorial.ipynb) |
+| SFT Comprehensive Tutorial | Supervised Fine-Tuning | [Download sft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) |
+| OSFT Comprehensive Tutorial | Orthogonal Subspace Fine-Tuning | [Download osft_comprehensive_tutorial.ipynb](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) |
 
 ### Step 1 — Install Dependencies