Skip to content

Clean up repo#81

Open
PatrickRMiles wants to merge 9 commits into
LBANN:mainfrom
PatrickRMiles:miles30/cleanup
Open

Clean up repo#81
PatrickRMiles wants to merge 9 commits into
LBANN:mainfrom
PatrickRMiles:miles30/cleanup

Conversation

@PatrickRMiles

Copy link
Copy Markdown
Collaborator

This branch removes deprecated execution paths and tightens several pieces of benchmark plumbing before publication. The changes reduce unsupported configuration surface area, make logging and failure behavior more consistent across ranks, and remove special cases that no longer represent supported benchmark modes.

Key Changes

Distributed execution is now required

  • Removed the user-facing dist option from the default config and README examples.
  • Updated benchmark, worker, and trainer paths to assume distributed execution.
  • Kept one-rank runs supported as distributed singleton jobs rather than a separate non-distributed mode.
  • Added a clear config error if an old config explicitly sets dist: 0.
  • Simplified trainer sampler setup, warmup barriers, checkpoint coordination, validation reduction, and DistConv/DDP setup now that those paths no longer branch on config.dist.
  • Improved the DistConv world-size check by raising a ValueError when world_size is not divisible by the total DistConv shard count.

Deprecated zero-fractal-class behavior was removed

  • n_categories must now be a positive integer at config load and CLI override time.
  • Removed the BCEWithLogitsLoss fallback used only by the deprecated zero-category case.
  • Cross entropy is now always used for the benchmark segmentation task.
  • Foreground Dice calculations now consistently exclude the background channel with [:, 1:].
  • Volume generation now sizes fractal colors from n_categories directly instead of guarding for the old zero-category case.

CLI and config cleanup

  • Added a benchmark CLI override for --dc-shard-dims, matching the existing dc_shard_dims config option.
  • Cleaned up DistConv option names and help text so config, CLI, and runtime logging use the same terminology.
  • Made generate_fractals --datagen-batch-size drive category search batch size instead of relying on a hard-coded value.
  • Removed the stale ScaFFold/configs/benchmark_testing.yml config.
  • Removed unused benchmark sweep helper functions that were no longer called.
  • Fixed small documentation and help-text issues in the README and CLI.

Logging and diagnostics are more consistent

  • Standardized many benchmark, CLI, data-generation, trainer, evaluation, worker, and checkpoint messages on setup_mpi_logger.
  • Replaced high-traffic print statements with rank-aware logger calls.
  • Made MPI logger setup idempotent so repeated setup calls do not attach duplicate handlers.
  • Kept logger imports lighter by moving heavy imports such as Torch, NumPy, Matplotlib, and Torch distributed into the functions that actually need them.
  • Converted routine multi-rank diagnostic noise to debug logging while keeping important lifecycle messages at info or warning level.

Dataset generation and error handling are more robust

  • Fixed the dataset-generation failure path so all ranks participate in the error gather when volume generation fails. This avoids one rank raising while others hang at MPI collectives.
  • Missing fractal point-cloud files now raise FileNotFoundError with an actionable message instead of printing and calling sys.exit(1).
  • Dataset reuse messages now go through the MPI logger.
  • Git commit lookup failures during dataset metadata creation now emit logger warnings rather than raw prints.

Runtime and packaging polish

  • DistConvDDP device IDs now come from the selected CUDA device index, which is more robust when each rank sees a masked single-GPU view.
  • The benchmark worker now destroys the process group only if it was initialized.
  • psutil>=5.9.0 is now declared in both pyproject.toml and requirements.txt because ScaFFold imports it directly.
  • README examples now describe distributed-only execution, positive n_categories, per-rank batch size, DistConv shard settings, warmup batches, and convergence threshold.

Behavior Changes

  • Old configs with dist: 1 continue to work because the config loader ignores the deprecated key when it is truthy.
  • Old configs with dist: 0 now fail fast with a clear message.
  • Running with one process is still supported, but it should be launched as a one-rank distributed job.
  • n_categories: 0 is no longer accepted.
  • Category search now honors the configured or CLI-provided datagen_batch_size.
  • Logs should be less noisy and more consistently rank-qualified, especially in multi-rank jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant