You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Picking up from the abandoned #404 per @RobotSail's invitation in that thread. The original PR drafted a four-tier testing methodology when this repo was the testing surface for the ilab CLI; that framing is stale now that training_hub is the user-facing front door and instructlab/training is positioned as one of several training backends.
This issue is to open a fresh discussion of what testing in this repo should actually look like given the current scope. Sketching a starting position — please push back, this is not a proposal yet.
What changed for testing
User-facing tests should live in training_hub, not here. Anything that exercises the run_training() interface contract from a user perspective belongs at the algorithm-level interface, not the SFT backend.
This repo's job is now narrower: correctness of SFT mechanics on a given backend matrix (FSDP/DeepSpeed × accelerator variant × feature flag set). The integration story is training_hub's problem.
Convergence is the harder problem. Smoke tests can verify "it doesn't crash"; what training_hub can't do for us is "this backend actually produces a model whose loss curve looks right on a fixed-seed run."
Sketched tiers
This is the part most likely to need revision — calling out the shape, not the details.
Linting / type checks. Pre-test gate. Run on every PR.
Unit tests. No GPU. Verify isolated mechanics — config serialization, batch metric accumulation, the BatchLossManager class contracts, checkpoint round-trip on a CPU-faked model. Always run.
Smoke tests. GPU-required, per matrix entry. Verify that training runs end-to-end without crashing on a verified configuration list (a handful of named configs that cover the unique code paths — NVIDIA FSDP, NVIDIA FSDP+LoRA, AMD DeepSpeed, etc.). Hard cap somewhere around 30 minutes per entry, target under 10. Block PRs.
Convergence checks. Per matrix entry, short fixed-seed runs on a fixed dataset, loss curve exported to a stable artifact store, compared against a baseline. The baseline refreshes deliberately on new arch, major dep bump, or training-loop change. Catches the "garbage gradients on accelerator X" case that smoke tests miss.
Downstream model benchmarks (MMLU, MT-Bench) live in training_hub or release qualification, not here.
Open questions
training_hub integration tests: what does that side already cover, and where does it currently break / need this repo's tests to catch failures before integration?
Verified configurations list: who maintains it? Living in this repo seems right but it has to be reviewable by anyone adding a new accelerator or feature.
Convergence baseline storage: where does the baseline live? GitHub Actions artifact cache (with retention rules) is one option; an external S3 bucket is another. Different cost / reproducibility / sharing tradeoffs.
Existing e2e job:e2e test takes over an hour #328 ("e2e test takes over an hour") has prior discussion about caching SDG-CI's pre-generated dataset. That predates the training_hub split — is it still the right path, or does training_hub's user-level e2e make the in-repo long-running e2e job mostly redundant?
Happy to draft a more concrete proposal once there's directional agreement, or to pick up a single piece (e.g. convergence-check prototype, smoke matrix definition) if that's more useful than a big-doc-first approach.
Picking up from the abandoned #404 per @RobotSail's invitation in that thread. The original PR drafted a four-tier testing methodology when this repo was the testing surface for the
ilabCLI; that framing is stale now thattraining_hubis the user-facing front door andinstructlab/trainingis positioned as one of several training backends.This issue is to open a fresh discussion of what testing in this repo should actually look like given the current scope. Sketching a starting position — please push back, this is not a proposal yet.
What changed for testing
training_hub, not here. Anything that exercises therun_training()interface contract from a user perspective belongs at the algorithm-level interface, not the SFT backend.training_hub's problem.training_hubcan't do for us is "this backend actually produces a model whose loss curve looks right on a fixed-seed run."Sketched tiers
This is the part most likely to need revision — calling out the shape, not the details.
BatchLossManagerclass contracts, checkpoint round-trip on a CPU-faked model. Always run.Downstream model benchmarks (MMLU, MT-Bench) live in
training_hubor release qualification, not here.Open questions
training_hubintegration tests: what does that side already cover, and where does it currently break / need this repo's tests to catch failures before integration?Happy to draft a more concrete proposal once there's directional agreement, or to pick up a single piece (e.g. convergence-check prototype, smoke matrix definition) if that's more useful than a big-doc-first approach.