Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729
Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729cvolkcvolk wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review: MVP-1 Per-Episode Sensitivity Recording + NPE Analyzer
This is a substantial and well-structured PR that lays solid data-layer and analysis-layer foundations for the sensitivity analysis workstream. The architecture is clean, with good separation of concerns between the eval-side writer, dataset loader, analyzer hierarchy, and plotting module.
✅ Strengths
Architecture & Design:
- The analyzer hierarchy (
BaseAnalyzer→PosteriorAnalyzer→NPE/MNPE+EmpiricalAnalyzer) withmake_analyzerfactory is well-designed and extensible - Clean separation between inference (
analyzer.py) and rendering (plotting.py) follows good software engineering principles - The decision to log the full
arena_env_argsper episode (rather than just--factor_keys) is a smart choice that eliminates the drift bug class where CLI flags andfactors.yamlcould disagree - Deferred imports for pxr-touching modules follow the established
policy_runner.py:107pattern correctly
Implementation Quality:
- Comprehensive docstrings with clear "what this module does in plain English" sections
- The JSONL wire format with
arena_env_args+outcomesis honest about provenance and leaves room for future extension - Continuous-first column ordering in theta matches sbi's MNPE convention
- The empirical rug in plots provides valuable sanity-checking for humans
🔍 Issues & Suggestions
1. Silent Failure: Empty arena_env_args_dict on Non-Dict Jobs
File: job_manager.py (lines 46-47, 110)
self.arena_env_args_dict = arena_env_args_dict if arena_env_args_dict is not None else {}The from_dict class method passes data["arena_env_args"] which could be None or missing entirely. If a job config omits arena_env_args, this silently produces an empty dict, and the episode writer will log rows with empty arena_env_args blocks. Consider adding validation:
arena_env_args_dict=data.get("arena_env_args") or {}Or raise explicitly if the key is required for sensitivity workflows.
2. Potential Division by Zero in Empirical Analyzer
File: analyzer.py (line 309)
if total_rate > 0:
return empirical_rates / total_rate
return np.full(num_choices, 1.0 / num_choices)This correctly handles the total_rate == 0 case, but there's a subtle issue: if category_mask.any() is False for all categories (i.e., the data has no rows matching any declared choice), empirical_rates stays all zeros, and you return a uniform distribution. This is reasonable but worth documenting in the docstring.
3. Missing Validation for factor.choices Consistency
File: dataset.py (lines 247-253)
The categorical encoding loop validates that each row's value is in choice_to_code, but doesn't validate that factor.choices is exhaustive relative to the data. If the data contains a category not declared in factors.yaml, you get a clear assertion error — good. But the reverse (declared choices not appearing in data) silently passes and could lead to misleading posteriors. Consider logging a warning:
observed_choices = set(row["arena_env_args"][factor.name] for row in self.rows)
missing_in_data = set(factor.choices) - observed_choices
if missing_in_data:
print(f"[WARN] Factor {factor.name!r}: declared choices {missing_in_data} never appear in data")4. Hardcoded Threshold >= 0.5 for Binary Outcome Classification
File: analyzer.py (line 302), plotting.py (lines 76, 95, 172)
The threshold >= 0.5 for classifying outcomes as "success" is hardcoded in multiple places. This works for binary 0/1 outcomes but could be misleading for metrics that return continuous values in [0, 1]. Consider:
- Documenting this assumption clearly in the module docstrings
- Or exposing it as a configurable parameter in the CLI
5. Range Inference Could Produce Degenerate Priors
File: dataset.py (lines 188-200)
When factor.range is inferred from observed data:
factor.range = [[min(observed_values), max(observed_values)]]If all rows have the same value (e.g., a misconfigured sweep with only one intensity), low == high and BoxUniform will produce a degenerate prior. Consider adding a guard:
if min(observed_values) == max(observed_values):
print(f"[WARN] Factor {factor.name!r} has constant value {min(observed_values)} in data; "
"posterior will be degenerate. Check your sweep config.")6. Thread Safety for JSONL Appends
File: episode_writer.py (line 76)
with open(output_path, "a", encoding="utf-8") as jsonl_output:If multiple eval jobs run in parallel writing to the same --episode_summary file, appends could interleave. While Isaac Sim eval typically runs serially, the code doesn't document this assumption. Consider either:
- Adding a note that parallel execution requires separate output files
- Or using file locking (e.g.,
fcntl.flock)
7. NotImplementedError for Vector Factors Lacks Migration Path
Files: dataset.py (lines 193, 226)
The NotImplementedError for dim > 1 factors is good for reserving the extension point, but the error message could be more actionable:
raise NotImplementedError(
f"Vector continuous factors (dim > 1) are reserved for future implementation; "
f"factor {factor.name!r} has dim={factor.dim}. For now, decompose into separate 1D factors."
)📝 Minor Suggestions
-
analyze_sensitivity.py: The script exits silently on success. Consider printing the figure path at the end:print(f"[INFO] Figure saved → {args.figure_path}") -
Type hints:
episode_writer.pyusesTYPE_CHECKINGforJobbut the function signature lacks the actual type hint forenv. Consider addingenv: Anyor the appropriate gym wrapper type. -
YAML schema validation: The
FactorSchema.from_yamlvalidates structure but doesn't catch typos in factor types. Achoices: [...]on atype: continuousfactor would be silently ignored.
🧪 Test Coverage Assessment
The PR includes synthetic data generators (synthetic_data_continuous.py, synthetic_data_categorical.py) that serve as smoke tests. These are valuable for manual validation but:
- No automated unit tests are included for
SensitivityDataset,make_analyzer, or the plotting module - Edge cases (empty JSONL, single-row data, all-same-outcome data) aren't covered
- Consider adding pytest tests in a future PR, particularly for:
FactorSchema.from_yamlwith malformed YAMLSensitivityDatasetwith missing/extra keysEmpiricalAnalyzer.categorical_marginal_probswith zero counts
🎯 Verdict
Overall: This is a solid MVP-1 implementation with good architecture and thoughtful design. The concerns raised are mostly defensive hardening and documentation improvements rather than blockers.
Recommendation: Approve with suggestions. The issues identified are quality-of-life improvements that could be addressed in follow-up PRs.
Files Reviewed
isaaclab_arena/analysis/__init__.pyisaaclab_arena/analysis/sensitivity/__init__.pyisaaclab_arena/analysis/sensitivity/analyzer.pyisaaclab_arena/analysis/sensitivity/dataset.pyisaaclab_arena/analysis/sensitivity/episode_writer.pyisaaclab_arena/analysis/sensitivity/plotting.pyisaaclab_arena/analysis/sensitivity/synthetic_data_categorical.pyisaaclab_arena/analysis/sensitivity/synthetic_data_continuous.pyisaaclab_arena/evaluation/eval_runner.pyisaaclab_arena/evaluation/eval_runner_cli.pyisaaclab_arena/evaluation/job_manager.pyisaaclab_arena/scripts/analyze_sensitivity.pyisaaclab_arena_environments/eval_jobs_configs/*.yamlisaaclab_arena_environments/eval_jobs_configs/*.jsonsetup.py
Update (38baa56): Reviewed incremental changes since 671e0c4. The new commits contain only style/formatting improvements:
plotting.py: Import reordering (movednumpybeforepathlib), removed string quotes from type hints (using directBaseAnalyzer/FactorSpectypes), minor line-length adjustmentssynthetic_data_categorical.py: Code reformatting for line length compliance, dict literal style changespick_up_object_sweep_factors.yaml: Added standard Apache-2.0 copyright header
✅ No functional changes. All prior review comments remain applicable. The formatting changes are consistent and improve readability.
Update (74585f1): Reviewed incremental changes since 38baa56.
🆕 New Feature: CameraObsVideoRecorder Integration
The new commits add --camera_video flag integration (eval_runner.py lines 191-199):
if args_cli.camera_video:
print(f"[INFO] Recording per-camera videos for job '{job.name}' -> {job_video_folder}")
env = CameraObsVideoRecorder(
env,
video_folder=job_video_folder,
step_trigger=lambda step: step == 0,
video_length=video_length,
)This properly imports and uses CameraObsVideoRecorder from isaaclab_arena.evaluation.camera_video (imported at line 17). The implementation:
- ✅ Correctly reuses
job_video_folder(computed when either--videoor--camera_videois set) - ✅ Applies the wrapper independently of RecordVideo (can use either or both)
- ✅ Follows the same
step_trigger/video_lengthpattern as RecordVideo
🔍 Observations
-
Import added correctly:
CameraObsVideoRecorderimport at line 17 is clean and follows the existing import style. -
Video folder logic refactored: The
job_video_foldercomputation was hoisted out of theif args_cli.video:block to be shared by both video recording methods - good refactoring. -
RTX ambient light fix in
isaaclab_arena_manager_based_env.py: The change disables RTX's built-in ambient light:
render=RenderCfg(
carb_settings={
"/rtx/sceneDb/ambientLightIntensity": 0.0,
"/rtx/sceneDb/ambientLightColor": [0.0, 0.0, 0.0],
},
),This ensures USD light prims fully control scene illumination, fixing a confound for vision-policy evals. The inline comment explains the rationale well.
✅ No issues with the incremental changes. The camera video integration follows existing patterns and the ambient light fix is a legitimate rendering correction for sensitivity analysis validity.
The carb setting /rtx/sceneDb/ambientLightIntensity defaults to 1.0 with color [0.1, 0.1, 0.1] in the IsaacLab kit experience. This adds a hidden ~10%-gray ambient floor to every rendered frame, completely independent of any USD UsdLuxLight prim — so policy cameras see lit scenes even when all USD lights are zeroed. The leak silently confounds vision-policy evals: a "low intensity" sweep on the dome light shows no change in scene brightness until the dome exceeds the renderer's ambient term. Override the carb on every Arena eval via RenderCfg.carb_settings, so USD lights are the sole source of illumination. Envs that previously relied on the ambient floor should add an explicit DomeLight asset. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Opt-in writer (--factor_keys + --episode_summary) records the values of the listed arena_env_args keys plus per-episode outcomes (from registered task metrics) to a JSONL during eval_runner. Existing behavior is unchanged when either flag is absent. - Job.arena_env_args_dict preserves the original dict form alongside the existing CLI-args list so the writer can look up factor values by name without re-parsing the args. - The writer's import is deferred inside the per-job try block, matching the policy_runner.py:107 pattern for pxr-touching modules (the writer pulls isaaclab_arena.metrics.metrics, which loads pxr at module top). - Hand-authored factors.yaml + jobs configs check in alongside; --factor_keys on the CLI must match the factors.yaml the analyzer consumes (the analyzer validates the pairing on load). Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Reads paired factors.yaml + episode_summary.jsonl into the (theta, x, prior, factor_columns) quadruple sbi consumes, trains NPE on a chosen outcome, plots the 1D posterior marginal for a continuous factor. CLI driver at isaaclab_arena/scripts/analyze_sensitivity.py. - MVP-1 scope: one continuous 1D factor; categorical and vector (dim > 1) branches raise NotImplementedError so the extension point is reserved. - Runtime [WARN] when fitting on a binary outcome surfaces sbi's 1D-Gaussian fallback caveat: the recovered peak reflects the empirical mean of successful theta values, not the true mode of the success curve. - synthetic_data.py generates a paired JSONL + factors.yaml from a known competence band, letting the analyzer smoke-test end-to-end without sim. - sbi added to DEV_DEPS so the docker dev install picks it up on rebuild. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Builds on the MVP-1 foundation (#729) with categorical factor support, a cleaner analyzer/plotting separation, and a tighter eval-side / analysis-side contract that drops a class of drift bugs. - Analyzer hierarchy (BaseAnalyzer / PosteriorAnalyzer / NPEAnalyzer / MNPEAnalyzer / EmpiricalAnalyzer) dispatched via make_analyzer. Pure- categorical schemas use empirical frequency analysis directly (under uniform prior the posterior is exactly the normalized per-category success rate); sbi MNPE 0.26 also requires at least one continuous theta column, which this dispatch handles automatically. - Split inference (analyzer.py) from rendering (plotting.py). Analyzers expose continuous_marginal_density and categorical_marginal_probs queries; plotting consumes them via plot_marginal. New plot types become additive (free functions) without touching the analyzer. - Drop --factor_keys CLI flag on eval_runner. The writer now logs the full arena_env_args per episode; the analyzer-side factors.yaml picks what to study. Removes the drift bug class where --factor_keys and factors.yaml could disagree. - Rename JSONL field "factors" -> "arena_env_args". Honest about provenance and leaves room for sibling source fields (future "sim_state" for MVP-3 reset-time snapshots, "variation_draws" for the variation system) without further wire-format changes. - Add synthetic_data_categorical.py smoke-test generator and rename synthetic_data.py -> synthetic_data_continuous.py for symmetry. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
CI's pre-commit was stricter than the local run had been: - isort + pyupgrade reformat tweaks in plotting.py and synthetic_data_categorical.py. - insert-license added the standard Apache-2.0 header to pick_up_object_sweep_factors.yaml. Local pre-commit run now passes cleanly across all files. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Mirrors the existing flag on policy_runner: when set, eval_runner wraps each job's env with CameraObsVideoRecorder so one mp4 per camera in obs["camera_obs"] is written into <video_dir>/<job_name>/. Independent of --video (which records the kit viewport). Useful for diagnosing what a policy actually sees during an eval sweep. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
38baa56 to
74585f1
Compare
Summary
sensitivity analysis for Arena evals: per-episode JSONL writer in eval_runner + offline analyzer that infers posteriors over input factors.
Detailed description
eval_runner(--episode_summary). Logs the fullarena_env_argsand the task's registered outcomes per recorded demo as a JSONL row. Existingeval workflows are unchanged when the flag is absent.
make_analyzer: NPE for all-continuous, MNPE for mixed continuous+categorical, Empirical (frequency table) for pure-categorical schemas. Classhierarchy is
BaseAnalyzer->PosteriorAnalyzer->NPEAnalyzer/MNPEAnalyzer, withEmpiricalAnalyzerseparate.factors.yamlschema declares slice + factor types + outcome names; the analyzer is the sole consumer. The eval-sidearena_env_argsfield in the JSONL isprovenance-honest, leaving room for sibling fields (
sim_state,variation_draws) as MVP-3 and the variation system land.analyzer.py) and rendering (plotting.py) are decoupled: analyzers exposecontinuous_marginal_density/categorical_marginal_probsqueries;plot_marginaldispatches by factor type and renders matplotlib.
synthetic_data_continuous.pyandsynthetic_data_categorical.pygenerate paired JSONL + factors.yaml for offline smoke testing; both validated end-to-end.