Skip to content

Commit 1a9fd83

Browse files
fix: update notebook for packing training + HuggingFace data + base model
1 parent 345629a commit 1a9fd83

1 file changed

Lines changed: 73 additions & 41 deletions

File tree

notebooks/colab_128k_training.ipynb

Lines changed: 73 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -5,36 +5,41 @@
55
"metadata": {},
66
"source": [
77
"# 🎯 Stack 2.9 — 128K Context Fine-tuning\n",
8-
"Fine-tune Qwen2.5-Coder-1.5B from 32K → 128K context\n",
98
"\n",
10-
"**Runtime:** GPU (T4 16GB recommended) | **Time:** ~2-3 hours"
9+
"Fine-tune **Qwen2.5-Coder-1.5B** with **packed 128K context windows**.\n",
10+
"\n",
11+
"**Key innovation:** Instead of training on short ~500-token examples, we **pack 200+ examples** into each 128K window. This multiplies training signal and teaches the model to track tool state across long, multi-turn interactions.\n",
12+
"\n",
13+
"**Runtime:** Runtime → Change runtime type → **GPU (T4 16GB recommended)**\n",
14+
"**Time:** ~6-8 hours on free Colab T4"
1115
]
1216
},
1317
{
1418
"cell_type": "markdown",
1519
"metadata": {},
16-
"source": [
17-
"## Step 1: Clone Stack 2.9 & Install Dependencies"
18-
]
20+
"source": ["## Step 1: Clone Stack 2.9 & Install Dependencies"]
1921
},
2022
{
2123
"cell_type": "code",
2224
"execution_count": null,
2325
"metadata": {},
2426
"outputs": [],
2527
"source": [
28+
"# Clone the repo (gets the fixed training script)\n",
2629
"!git clone https://github.com/my-ai-stack/stack-2.9.git\n",
2730
"cd stack-2.9\n",
28-
"!pip install -q transformers peft datasets bitsandbytes accelerate huggingface_hub\n",
29-
"!pip install -q scipy torch --upgrade"
31+
"\n",
32+
"# Install all dependencies\n",
33+
"!pip install -q transformers peft datasets bitsandbytes>=0.46.1 accelerate huggingface_hub scipy\n",
34+
"!pip install -q torch --upgrade\n",
35+
"\n",
36+
"print('✅ Dependencies installed')"
3037
]
3138
},
3239
{
3340
"cell_type": "markdown",
3441
"metadata": {},
35-
"source": [
36-
"## Step 2: Login to HuggingFace (push weights later)"
37-
]
42+
"source": ["## Step 2: Login to HuggingFace\n\nGet your token at: https://huggingface.co/settings/tokens"]
3843
},
3944
{
4045
"cell_type": "code",
@@ -43,15 +48,16 @@
4348
"outputs": [],
4449
"source": [
4550
"from huggingface_hub import login\n",
46-
"# Get your token at: https://huggingface.co/settings/tokens\n",
47-
"login(token=\"YOUR_HF_TOKEN\") # ← Replace with your token"
51+
"# 👇 Replace with YOUR HuggingFace token\n",
52+
"login(token=\"YOUR_HF_TOKEN_HERE\") # ← 🔴 PUT YOUR HF TOKEN HERE\n",
53+
"print('✅ Logged into HuggingFace')"
4854
]
4955
},
5056
{
5157
"cell_type": "markdown",
5258
"metadata": {},
5359
"source": [
54-
"## Step 3: Mount Google Drive (optional — for saving checkpoints)"
60+
"## Step 3: Mount Google Drive\n\nTraining checkpoints and the final adapter will be saved here."
5561
]
5662
},
5763
{
@@ -62,14 +68,16 @@
6268
"source": [
6369
"from google.colab import drive\n",
6470
"drive.mount('/content/drive')\n",
65-
"OUTPUT_DIR = \"/content/drive/MyDrive/stack-2.9-128k-output\""
71+
"OUTPUT_DIR = '/content/drive/MyDrive/stack-2.9-128k-output'\n",
72+
"import os; os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
73+
"print(f'📁 Output directory: {OUTPUT_DIR}')"
6674
]
6775
},
6876
{
6977
"cell_type": "markdown",
7078
"metadata": {},
7179
"source": [
72-
"## Step 4: Run 128K Context Fine-tuning"
80+
"## Step 4: Download Training Data\n\nWe use the dataset uploaded to HuggingFace Hub — 1500 tool-calling examples, packed into 128K sequences."
7381
]
7482
},
7583
{
@@ -78,31 +86,39 @@
7886
"metadata": {},
7987
"outputs": [],
8088
"source": [
81-
"import subprocess\n",
82-
"result = subprocess.run([\n",
83-
" \"python3\", \"training/train_extended_context.py\",\n",
84-
" \"--model-path\", \"my-ai-stack/Stack-2-9-finetuned\",\n",
85-
" \"--data-path\", \"training/training-data/tool_examples_combined.jsonl\",\n",
86-
" \"--output-dir\", OUTPUT_DIR,\n",
87-
" \"--context-length\", \"131072\",\n",
88-
" \"--lora-rank\", \"64\",\n",
89-
" \"--epochs\", \"3\",\n",
90-
" \"--push-to-hub\",\n",
91-
" \"--hub-model-id\", \"YOUR_USERNAME/stack-2.9-128k\"\n",
92-
"], cwd=\"/content/stack-2.9\")\n",
93-
"print(result.stdout)\n",
94-
"print(result.stderr)"
89+
"import huggingface_hub\n",
90+
"\n",
91+
"DATA_FILE = '/content/tool_examples.jsonl'\n",
92+
"\n",
93+
"print('Downloading training data from HuggingFace...')\n",
94+
"hf_id = 'walidsobhie/stack-2-9-tool-examples'\n",
95+
"path = huggingface_hub.hf_hub_download(\n",
96+
" repo_id=hf_id,\n",
97+
" filename='tool_examples_combined.jsonl',\n",
98+
" repo_type='dataset',\n",
99+
" local_dir='/content/',\n",
100+
" local_dir_use_symlinks=False,\n",
101+
")\n",
102+
"import shutil\n",
103+
"shutil.move(path, DATA_FILE)\n",
104+
"print(f'✅ Dataset ready: {DATA_FILE}')\n",
105+
"\n",
106+
"# Quick sanity check\n",
107+
"import json\n",
108+
"with open(DATA_FILE) as f:\n",
109+
" lines = f.readlines()\n",
110+
"print(f' Total examples: {len(lines)}')\n",
111+
"ex = json.loads(lines[0])\n",
112+
"print(f' Keys: {list(ex.keys())}')"
95113
]
96114
},
97115
{
98116
"cell_type": "markdown",
99117
"metadata": {},
100118
"source": [
101-
"---\n",
119+
"## Step 5: Run 128K Packed Context Fine-tuning\n\n**This cell runs the full training. On free Colab T4 it takes ~6-8 hours.**\n",
102120
"\n",
103-
"## Alternative: Run on Base Qwen Model (if HF model not loaded)\n",
104-
"\n",
105-
"If the fine-tuned model isn't available, use the base model:"
121+
"If Colab disconnects, your checkpoints are safe in Google Drive. Reconnect and re-run this cell — it will resume from the last checkpoint."
106122
]
107123
},
108124
{
@@ -111,22 +127,38 @@
111127
"metadata": {},
112128
"outputs": [],
113129
"source": [
114-
"# Change --model-path to:\n",
115-
"# \"Qwen/Qwen2.5-Coder-1.5B\"\n",
116-
"# And add --push-to-hub with your own model ID"
130+
"import subprocess\n",
131+
"\n",
132+
"# Run the fixed training script with packing enabled\n",
133+
"result = subprocess.run([\n",
134+
" \"python3\", \"training/train_extended_context.py\",\n",
135+
" \"--model-path\", \"Qwen/Qwen2.5-Coder-1.5B\",\n",
136+
" \"--data-path\", \"/content/tool_examples.jsonl\",\n",
137+
" \"--output-dir\", OUTPUT_DIR,\n",
138+
" \"--context-length\", \"131072\",\n",
139+
" \"--lora-rank\", \"32\",\n",
140+
" \"--epochs\", \"3\",\n",
141+
" \"--batch-size\", \"1\",\n",
142+
" \"--grad-accum\", \"16\",\n",
143+
" \"--lr\", \"2e-4\",\n",
144+
" \"--use-packing\",\n",
145+
" \"--push-to-hub\",\n",
146+
" \"--hub-model-id\", \"walidsobhie/stack-2.9-128k-context\"\n",
147+
"], cwd=\"/content/stack-2.9\")\n",
148+
"\n",
149+
"print('STDOUT:', result.stdout)\n",
150+
"print('STDERR:', result.stderr[-3000:] if result.stderr else '(none)')"
117151
]
118152
}
119153
],
120154
"metadata": {
121-
"accelerator": "GPU",
122155
"colab": {
123156
"provenance": [],
124-
"machine_shape": "hm"
157+
"name": "stack-2.9-128k-packed-training"
125158
},
126159
"kernelspec": {
127-
"display_name": "Python 3",
128-
"language": "python",
129-
"name": "python3"
160+
"name": "python3",
161+
"display_name": "Python 3"
130162
},
131163
"language_info": {
132164
"name": "python",

0 commit comments

Comments
 (0)