Clean up repo by PatrickRMiles · Pull Request #81 · LBANN/ScaFFold

PatrickRMiles · 2026-06-18T15:49:20Z

This branch removes deprecated execution paths and tightens several pieces of benchmark plumbing before publication. The changes reduce unsupported configuration surface area, make logging and failure behavior more consistent across ranks, and remove special cases that no longer represent supported benchmark modes.

Key Changes

Distributed execution is now required

Removed the user-facing dist option from the default config and README examples.
Updated benchmark, worker, and trainer paths to assume distributed execution.
Kept one-rank runs supported as distributed singleton jobs rather than a separate non-distributed mode.
Added a clear config error if an old config explicitly sets dist: 0.
Simplified trainer sampler setup, warmup barriers, checkpoint coordination, validation reduction, and DistConv/DDP setup now that those paths no longer branch on config.dist.
Improved the DistConv world-size check by raising a ValueError when world_size is not divisible by the total DistConv shard count.

Deprecated zero-fractal-class behavior was removed

n_categories must now be a positive integer at config load and CLI override time.
Removed the BCEWithLogitsLoss fallback used only by the deprecated zero-category case.
Cross entropy is now always used for the benchmark segmentation task.
Foreground Dice calculations now consistently exclude the background channel with [:, 1:].
Volume generation now sizes fractal colors from n_categories directly instead of guarding for the old zero-category case.

CLI and config cleanup

Added a benchmark CLI override for --dc-shard-dims, matching the existing dc_shard_dims config option.
Cleaned up DistConv option names and help text so config, CLI, and runtime logging use the same terminology.
Made generate_fractals --datagen-batch-size drive category search batch size instead of relying on a hard-coded value.
Removed the stale ScaFFold/configs/benchmark_testing.yml config.
Removed unused benchmark sweep helper functions that were no longer called.
Fixed small documentation and help-text issues in the README and CLI.

Logging and diagnostics are more consistent

Standardized many benchmark, CLI, data-generation, trainer, evaluation, worker, and checkpoint messages on setup_mpi_logger.
Replaced high-traffic print statements with rank-aware logger calls.
Made MPI logger setup idempotent so repeated setup calls do not attach duplicate handlers.
Kept logger imports lighter by moving heavy imports such as Torch, NumPy, Matplotlib, and Torch distributed into the functions that actually need them.
Converted routine multi-rank diagnostic noise to debug logging while keeping important lifecycle messages at info or warning level.

Dataset generation and error handling are more robust

Fixed the dataset-generation failure path so all ranks participate in the error gather when volume generation fails. This avoids one rank raising while others hang at MPI collectives.
Missing fractal point-cloud files now raise FileNotFoundError with an actionable message instead of printing and calling sys.exit(1).
Dataset reuse messages now go through the MPI logger.
Git commit lookup failures during dataset metadata creation now emit logger warnings rather than raw prints.

Runtime and packaging polish

DistConvDDP device IDs now come from the selected CUDA device index, which is more robust when each rank sees a masked single-GPU view.
The benchmark worker now destroys the process group only if it was initialized.
psutil>=5.9.0 is now declared in both pyproject.toml and requirements.txt because ScaFFold imports it directly.
README examples now describe distributed-only execution, positive n_categories, per-rank batch size, DistConv shard settings, warmup batches, and convergence threshold.

Behavior Changes

Old configs with dist: 1 continue to work because the config loader ignores the deprecated key when it is truthy.
Old configs with dist: 0 now fail fast with a clear message.
Running with one process is still supported, but it should be launched as a one-rank distributed job.
n_categories: 0 is no longer accepted.
Category search now honors the configured or CLI-provided datagen_batch_size.
Logs should be less noisy and more consistently rank-qualified, especially in multi-rank jobs.

… and edge case handling; lazier imports

Patrick Miles added 9 commits June 11, 2026 08:52

remove unneeded extra config

51765cf

get datagen batch size from config rather than hard coding

2834d51

CLI updates: add dc-shard-dims override, lazy imports where possible

184a0bb

fix typo and cleanup deprecated function

b7ddca5

add psutil as explicit proj dep

970491a

remove option for running in non-distributed mode

71ec053

Remove deprecated 0-fractal-class use case

5d78c0e

standardize on MPI logger, clean up benchmark.py, better volgen error…

a22b9be

… and edge case handling; lazier imports

ruff

093a01a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up repo#81

Clean up repo#81
PatrickRMiles wants to merge 9 commits into
LBANN:mainfrom
PatrickRMiles:miles30/cleanup

PatrickRMiles commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PatrickRMiles commented Jun 18, 2026

Key Changes

Distributed execution is now required

Deprecated zero-fractal-class behavior was removed

CLI and config cleanup

Logging and diagnostics are more consistent

Dataset generation and error handling are more robust

Runtime and packaging polish

Behavior Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant