Discussion: testing strategy for instructlab/training as a backend library

Picking up from the abandoned #404 per @RobotSail's invitation in that thread. The original PR drafted a four-tier testing methodology when this repo was the testing surface for the `ilab` CLI; that framing is stale now that `training_hub` is the user-facing front door and `instructlab/training` is positioned as one of several training backends.

This issue is to open a fresh discussion of what testing in this repo should actually look like given the current scope. Sketching a starting position — please push back, this is not a proposal yet.

### What changed for testing

- **User-facing tests should live in `training_hub`,** not here. Anything that exercises the `run_training()` interface contract from a user perspective belongs at the algorithm-level interface, not the SFT backend.
- **This repo's job is now narrower:** correctness of SFT mechanics on a given backend matrix (FSDP/DeepSpeed × accelerator variant × feature flag set). The integration story is `training_hub`'s problem.
- **Convergence is the harder problem.** Smoke tests can verify "it doesn't crash"; what `training_hub` can't do for us is "this backend actually produces a model whose loss curve looks right on a fixed-seed run."

### Sketched tiers

This is the part most likely to need revision — calling out the shape, not the details.

1. **Linting / type checks.** Pre-test gate. Run on every PR.
2. **Unit tests.** No GPU. Verify isolated mechanics — config serialization, batch metric accumulation, the `BatchLossManager` class contracts, checkpoint round-trip on a CPU-faked model. Always run.
3. **Smoke tests.** GPU-required, per matrix entry. Verify that training runs end-to-end without crashing on a verified configuration list (a handful of named configs that cover the unique code paths — NVIDIA FSDP, NVIDIA FSDP+LoRA, AMD DeepSpeed, etc.). Hard cap somewhere around 30 minutes per entry, target under 10. Block PRs.
4. **Convergence checks.** Per matrix entry, short fixed-seed runs on a fixed dataset, loss curve exported to a stable artifact store, compared against a baseline. The baseline refreshes deliberately on new arch, major dep bump, or training-loop change. Catches the "garbage gradients on accelerator X" case that smoke tests miss.

Downstream model benchmarks (MMLU, MT-Bench) live in `training_hub` or release qualification, not here.

### Open questions

1. **`training_hub` integration tests:** what does that side already cover, and where does it currently break / need this repo's tests to catch failures before integration?
2. **Verified configurations list:** who maintains it? Living in this repo seems right but it has to be reviewable by anyone adding a new accelerator or feature.
3. **Convergence baseline storage:** where does the baseline live? GitHub Actions artifact cache (with retention rules) is one option; an external S3 bucket is another. Different cost / reproducibility / sharing tradeoffs.
4. **Existing e2e job:** #328 ("e2e test takes over an hour") has prior discussion about caching SDG-CI's pre-generated dataset. That predates the training_hub split — is it still the right path, or does training_hub's user-level e2e make the in-repo long-running e2e job mostly redundant?

Happy to draft a more concrete proposal once there's directional agreement, or to pick up a single piece (e.g. convergence-check prototype, smoke matrix definition) if that's more useful than a big-doc-first approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: testing strategy for instructlab/training as a backend library #705

What changed for testing

Sketched tiers

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: testing strategy for instructlab/training as a backend library #705

Description

What changed for testing

Sketched tiers

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions