Skip to content

Discussion: testing strategy for instructlab/training as a backend library #705

@gkneighb

Description

@gkneighb

Picking up from the abandoned #404 per @RobotSail's invitation in that thread. The original PR drafted a four-tier testing methodology when this repo was the testing surface for the ilab CLI; that framing is stale now that training_hub is the user-facing front door and instructlab/training is positioned as one of several training backends.

This issue is to open a fresh discussion of what testing in this repo should actually look like given the current scope. Sketching a starting position — please push back, this is not a proposal yet.

What changed for testing

  • User-facing tests should live in training_hub, not here. Anything that exercises the run_training() interface contract from a user perspective belongs at the algorithm-level interface, not the SFT backend.
  • This repo's job is now narrower: correctness of SFT mechanics on a given backend matrix (FSDP/DeepSpeed × accelerator variant × feature flag set). The integration story is training_hub's problem.
  • Convergence is the harder problem. Smoke tests can verify "it doesn't crash"; what training_hub can't do for us is "this backend actually produces a model whose loss curve looks right on a fixed-seed run."

Sketched tiers

This is the part most likely to need revision — calling out the shape, not the details.

  1. Linting / type checks. Pre-test gate. Run on every PR.
  2. Unit tests. No GPU. Verify isolated mechanics — config serialization, batch metric accumulation, the BatchLossManager class contracts, checkpoint round-trip on a CPU-faked model. Always run.
  3. Smoke tests. GPU-required, per matrix entry. Verify that training runs end-to-end without crashing on a verified configuration list (a handful of named configs that cover the unique code paths — NVIDIA FSDP, NVIDIA FSDP+LoRA, AMD DeepSpeed, etc.). Hard cap somewhere around 30 minutes per entry, target under 10. Block PRs.
  4. Convergence checks. Per matrix entry, short fixed-seed runs on a fixed dataset, loss curve exported to a stable artifact store, compared against a baseline. The baseline refreshes deliberately on new arch, major dep bump, or training-loop change. Catches the "garbage gradients on accelerator X" case that smoke tests miss.

Downstream model benchmarks (MMLU, MT-Bench) live in training_hub or release qualification, not here.

Open questions

  1. training_hub integration tests: what does that side already cover, and where does it currently break / need this repo's tests to catch failures before integration?
  2. Verified configurations list: who maintains it? Living in this repo seems right but it has to be reviewable by anyone adding a new accelerator or feature.
  3. Convergence baseline storage: where does the baseline live? GitHub Actions artifact cache (with retention rules) is one option; an external S3 bucket is another. Different cost / reproducibility / sharing tradeoffs.
  4. Existing e2e job: e2e test takes over an hour #328 ("e2e test takes over an hour") has prior discussion about caching SDG-CI's pre-generated dataset. That predates the training_hub split — is it still the right path, or does training_hub's user-level e2e make the in-repo long-running e2e job mostly redundant?

Happy to draft a more concrete proposal once there's directional agreement, or to pick up a single piece (e.g. convergence-check prototype, smoke matrix definition) if that's more useful than a big-doc-first approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions