Skip to content

Use throughput data for Conformer RMSD benchmark#187

Open
scal444 wants to merge 7 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/conformer-rmsd-batch
Open

Use throughput data for Conformer RMSD benchmark#187
scal444 wants to merge 7 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/conformer-rmsd-batch

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented Jun 1, 2026

Before, it had been based on single smiles. Modified to be more like our other benchmarks, and added RDKit early exit. The early exit is a bit annoying right now because we can't use time_it, added a tracking bug

scal444 added 7 commits May 29, 2026 14:24
Replace the per-mol, hardcoded-SMILES benchmark with a batch-mode bench
that:

* Loads a slice of SMILES from a file (via bench_utils.load_smiles).
* Embeds one base conformer per mol in parallel and jitters via the
  shared bench_utils.embed_and_jitter (with add_hs so ETKDG sees a
  chemically reasonable graph).
* Times a single GetConformerRMSMatrixBatch call vs a serial RDKit loop.
* Sweeps confs_per_mol with a single embed run plus _slice_to_confs
  reuse, so every row sees the same molecule selection.
* Validates GPU output against RDKit (per-pair, with a tolerance) before
  timing and aborts on mismatch.
* Honors --rdkit_max_seconds for the RDKit comparison and --no-rdkit /
  --no-nvmolkit for mode selection.
Cosmetic only; matches the formatting style used in adjacent benches.
@scal444 scal444 requested a review from evasnow1992 June 1, 2026 13:14
samples = [bench_rdkit_batch(payloads, rdkit_max_seconds) for _ in range(3)]
samples.sort(key=lambda pair: pair[0] / max(pair[1], 1))
rdkit_time_s, rdkit_done = samples[len(samples) // 2]
rdkit_std_s = statistics.stdev([elapsed for elapsed, _ in samples]) if len(samples) > 1 else 0.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there may be an inconsistency between the median and std calculations. Since pair[0] is normalized by pair[1] when selecting the median (rdkit_time_s), that value appears to represent per-molecule time. However, rdkit_std_s is still computed directly from the raw elapsed times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants