TrainScale

A Scalable SOTA PyTorch Training Framework — SOTA-level capabilities with 100% YAML-driven configuration.

🌟 Why TrainScale?

TrainScale isn't just another training script. It's a comprehensive, modular architecture designed to solve the "last mile" problem in LLM training: Data Engineering.

Most frameworks treat data loading as an afterthought. TrainScale makes it a first-class citizen with SOTA preprocessing features usually found only in proprietary codebases (like flexible packing, token-aware distribution, and thorough dataset introspection).

Key Differentiators

Zero Hardcoding: Every aspect of the pipeline is controlled via YAML.
SOTA Data Pipeline: Smart truncation, content-aware token distribution, and dynamic packing.
Rust-Inspired Reliability: Uses Result<T, E> patterns for robust error handling.
Hardware Optimized: Built-in support for Flash Attention 2, Triton kernels, and 8-bit optimizers.

🏗️ Architecture

The TrainScale pipeline operates in distinct, modular stages to ensure scalability and reproducibility.

graph LR
    A[YAML Configuration] --> B[Dataset Introspector]
    B --> C[Dataset Loader]
    C --> D[Prompt Engine]
    D --> E[Length Manager]
    E --> F[Tokenizer Wrapper]
    F --> G[DataLoader Builder]
    G --> H[SOTATrainer]
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px
    style H fill:#bfb,stroke:#333,stroke-width:2px

Component Deep Dive

1. Dataset Introspector

Problem: Hardcoding split names (train, validation) and columns (text, input) makes code brittle.
Solution: Automatically inspects HuggingFace datasets to discover available splits and columns, mapping them to a standardized schema defined in your YAML.

2. Prompt Engine & Length Manager (SOTA)

Problem: Simple truncation cuts off important context; "max_length" is a blunt instrument.
Solution:
- Smart Truncation: Respects sentence and word boundaries.
- Content Distribution: Allocates token budgets intelligently (e.g., "Give 60% to context, 40% to history").
- Priority Trimming: Drops least important columns first when context window is exceeded.

3. SOTA Trainer

Problem: Training scripts are often monolithic and hard to extend.
Solution: A modular trainer supporting multiple backends (FSDP, DDP, QLoRA) and advanced features like:
- Optimizers: Adam8bit, Lion, SophiaG, Prodigy.
- Schedulers: Cosine, WSD (Warmup-Stable-Decay), REX.
- Loss Functions: Fused CrossEntropy, DPO, SimPO.

🚀 Quick Start

1. Installation

# Clone repository
git clone https://github.com/generalaimodels/TrainScale.git
cd TrainScale



# Optional: Flash Attention 2 (Recommended for Ampere+)
pip install flash-attn --no-build-isolation

2. Run the SOTA Demo

We provide a production-ready example in examples/. This script auto-detects your GPU setup (ROCm/CUDA) and launches a DDP training run.

# Single GPU
python examples/rocm_sota_demo_ddp.py --config examples/rocm_sota_config.yaml

# Multi-GPU (e.g., 4 GPUs)
torchrun --nproc_per_node=4 examples/rocm_sota_demo_ddp.py --config examples/rocm_sota_config.yaml

🔧 Hardware & Performance

TrainScale is optimized for a wide range of hardware, from consumer GPUs to H100 clusters.

GPU	VRAM	Mode	Max Context	Batch Size	Technique
RTX 3090	24GB	QLoRA	2048	4	4-bit NF4 + Gradient Checkpointing
RTX 4090	24GB	QLoRA	4096	4	4-bit NF4 + Flash Attn 2
A100 40GB	40GB	LoRA	8192	8	BF16 + Flash Attn 2
A100 80GB	80GB	Full	8192	16	BF16 + FSDP
H100	80GB	Full	16384	32	FP8 + Transformer Engine
Mac M1/M2	Unified	MPS	2048	1-2	FP16 (Experimental)

� Efficient Training Guide (The Brutal Truth)

Let's be honest: training LLMs efficiently is hard. If you don't optimize, you are burning money and time. Here is the technical reality of high-performance training with TrainScale:

1. Precision: Float32 is Dead

Stop using fp32. It consumes 2x memory and 2x bandwidth for zero perceptible gain in SFT.

Mandatory: Use bf16 (Brain Float 16) on Ampere/MI300+ hardware. It prevents overflow/underflow issues common in fp16 without the cost of fp32.
Next-Gen: If you have H100s or MI300X, use fp8 via Transformer Engine (supported in TrainScale).

2. Kernels: Python Loops are Forbidden

Native PyTorch layers have overhead. We stripped them out.

Flash Attention 2: Non-negotiable for sequences > 2048. We enforce this by default.
Triton Kernels: We implemented custom fused kernels for RMSNorm, RoPE, and CrossEntropy. If you disable these, your throughput will drop by 30-40%.

3. Distributed Training: Choose Wisely

DDP (Distributed Data Parallel): Perfect for LoRA/QLoRA on < 8 GPUs. Fast, simple, robust.
FSDP (Fully Sharded Data Parallel): The only way to do Full Fine-Tuning on huge models (70B+). If you try DDP for full fine-tuning a 70B model on 24GB cards, you will OOM instantly.
ZeRO-3: We support it, but it adds communication overhead. Use only if FSDP doesn't fit.

4. Data Loading: The Silent Killer

Most training runs are bottlenecked by CPU data processing, not GPU compute.

TrainScale Solution: We pre-tokenize and "pack" datasets. We don't just truncate; we fill context windows (e.g., 4096) completely with multiple samples. This increases effective throughput by 2x-3x compared to naive padding.

5. Optimizers

Standard AdamW: Memory hog. Avoid for models > 7B unless you have 80GB VRAM.
AdamW-8bit: Recommended. Same convergence as 32-bit but uses 75% less memory for optimizer states.
Lion: Great for throughput (simpler math than Adam), but requires careful hyper-parameter tuning.

TL;DR: Use BF16 + FlashAttn-2 + Packed Data + 8-bit Optimizer. Anything else is suboptimal.

�🛠️ Configuration Guide

1. Preprocessing (SOTA)

Control how your data is processed with granular detail:

preprocessing:
  length_manager:
    enabled: true
    max_total_length: 4096
    truncation_strategy: "smart"  # smart, sentence_boundary, word_boundary
    
    # Precise control over character limits per column
    per_column_limits:
      instruction: 500
      input: 2000
      output: 1500

  content_distribution:
    enabled: true
    mode: "proportional" # or 'priority', 'ratio'
    column_ratios:
      instruction: 0.2
      input: 0.3
      output: 0.5

2. Training (SOTA)

Switch between training modes and hardware optimizations instantly:

training:
  mode: "qlora" # full, lora, qlora
  
  hardware:
    precision: "bf16"
    compile_model: true  # torch.compile
  
  optimizer:
    type: "adamw_8bit"   # 75% memory saving over standard AdamW
    learning_rate: 2e-4
  
  scheduler:
    type: "wsd"          # Warmup-Stable-Decay (LLaMA-3 style)

🤝 How to Contribute

We welcome contributions! Whether you're fixing a bug, adding a new feature, or improving documentation, here's how you can help:

Areas for Contribution

New Data Connectors: Support for SQL, S3, or Arrow datasets.
Additional Kernels: Implement optimized Triton kernels for new attention mechanisms.
Model Support: Add configs for new architectures (Mistral, Gemma, Phi).
Benchmarks: Run hardware benchmarks and update the README table.

Development Standards

Type Hints: All code must be fully type-hinted (mypy compliant).
Error Handling: Use the Result type from core/types.py instead of raising raw exceptions where possible.
Config-First: Avoid hardcoding. If a value might change, put it in the YAML schema.
Tests: Add unit tests for new modules. Run existing tests before pushing.

Submission Process

Fork the repo.
Create a branch: git checkout -b feature/my-cool-feature.
Commit your changes.
Push to your fork and submit a Pull Request.

🗺️ Roadmap

Phase 1: Foundation (Complete) ✅
- End-to-end YAML pipeline
- SOTA preprocessing module
- QLoRA/LoRA support
Phase 2: Scale (In Progress) 🚧
- Multi-node FSDP training
- DeepSpeed integration
- Streaming dataset support for infinite datasets
Phase 3: Multimodal (Planned) 🔮
- Image/Video tokenization support
- Audio processing pipeline

📄 License

MIT License — see LICENSE for details.

TrainScale — Train Smarter, Scale Faster 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data_pipeline		data_pipeline
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrainScale

🌟 Why TrainScale?

Key Differentiators

🏗️ Architecture

Component Deep Dive

1. Dataset Introspector

2. Prompt Engine & Length Manager (SOTA)

3. SOTA Trainer

🚀 Quick Start

1. Installation

2. Run the SOTA Demo

🔧 Hardware & Performance

� Efficient Training Guide (The Brutal Truth)

1. Precision: Float32 is Dead

2. Kernels: Python Loops are Forbidden

3. Distributed Training: Choose Wisely

4. Data Loading: The Silent Killer

5. Optimizers

�🛠️ Configuration Guide

1. Preprocessing (SOTA)

2. Training (SOTA)

🤝 How to Contribute

Areas for Contribution

Development Standards

Submission Process

🗺️ Roadmap

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrainScale

🌟 Why TrainScale?

Key Differentiators

🏗️ Architecture

Component Deep Dive

1. Dataset Introspector

2. Prompt Engine & Length Manager (SOTA)

3. SOTA Trainer

🚀 Quick Start

1. Installation

2. Run the SOTA Demo

🔧 Hardware & Performance

� Efficient Training Guide (The Brutal Truth)

1. Precision: Float32 is Dead

2. Kernels: Python Loops are Forbidden

3. Distributed Training: Choose Wisely

4. Data Loading: The Silent Killer

5. Optimizers

�🛠️ Configuration Guide

1. Preprocessing (SOTA)

2. Training (SOTA)

🤝 How to Contribute

Areas for Contribution

Development Standards

Submission Process

🗺️ Roadmap

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages