Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers by cvolkcvolk · Pull Request #729 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-05-27T13:07:33Z

Summary

sensitivity analysis for Arena evals: per-episode JSONL writer in eval_runner + offline analyzer that infers posteriors over input factors.

Detailed description

Opt-in per-episode writer in eval_runner (--episode_summary). Logs the full arena_env_args and the task's registered outcomes per recorded demo as a JSONL row. Existing
eval workflows are unchanged when the flag is absent.
Three-way analyzer dispatch via make_analyzer: NPE for all-continuous, MNPE for mixed continuous+categorical, Empirical (frequency table) for pure-categorical schemas. Class
hierarchy is BaseAnalyzer -> PosteriorAnalyzer -> NPEAnalyzer / MNPEAnalyzer, with EmpiricalAnalyzer separate.
Hand-authored factors.yaml schema declares slice + factor types + outcome names; the analyzer is the sole consumer. The eval-side arena_env_args field in the JSONL is
provenance-honest, leaving room for sibling fields (sim_state, variation_draws) as MVP-3 and the variation system land.
Inference (analyzer.py) and rendering (plotting.py) are decoupled: analyzers expose continuous_marginal_density / categorical_marginal_probs queries; plot_marginal
dispatches by factor type and renders matplotlib.
synthetic_data_continuous.py and synthetic_data_categorical.py generate paired JSONL + factors.yaml for offline smoke testing; both validated end-to-end.

isaaclab-review-bot

Code Review: MVP-1 Per-Episode Sensitivity Recording + NPE Analyzer

This is a substantial and well-structured PR that lays solid data-layer and analysis-layer foundations for the sensitivity analysis workstream. The architecture is clean, with good separation of concerns between the eval-side writer, dataset loader, analyzer hierarchy, and plotting module.

✅ Strengths

Architecture & Design:

The analyzer hierarchy (BaseAnalyzer → PosteriorAnalyzer → NPE/MNPE + EmpiricalAnalyzer) with make_analyzer factory is well-designed and extensible
Clean separation between inference (analyzer.py) and rendering (plotting.py) follows good software engineering principles
The decision to log the full arena_env_args per episode (rather than just --factor_keys) is a smart choice that eliminates the drift bug class where CLI flags and factors.yaml could disagree
Deferred imports for pxr-touching modules follow the established policy_runner.py:107 pattern correctly

Implementation Quality:

Comprehensive docstrings with clear "what this module does in plain English" sections
The JSONL wire format with arena_env_args + outcomes is honest about provenance and leaves room for future extension
Continuous-first column ordering in theta matches sbi's MNPE convention
The empirical rug in plots provides valuable sanity-checking for humans

🔍 Issues & Suggestions

1. Silent Failure: Empty `arena_env_args_dict` on Non-Dict Jobs

File: job_manager.py (lines 46-47, 110)

self.arena_env_args_dict = arena_env_args_dict if arena_env_args_dict is not None else {}

The from_dict class method passes data["arena_env_args"] which could be None or missing entirely. If a job config omits arena_env_args, this silently produces an empty dict, and the episode writer will log rows with empty arena_env_args blocks. Consider adding validation:

arena_env_args_dict=data.get("arena_env_args") or {}

Or raise explicitly if the key is required for sensitivity workflows.

2. Potential Division by Zero in Empirical Analyzer

File: analyzer.py (line 309)

if total_rate > 0:
    return empirical_rates / total_rate
return np.full(num_choices, 1.0 / num_choices)

This correctly handles the total_rate == 0 case, but there's a subtle issue: if category_mask.any() is False for all categories (i.e., the data has no rows matching any declared choice), empirical_rates stays all zeros, and you return a uniform distribution. This is reasonable but worth documenting in the docstring.

3. Missing Validation for `factor.choices` Consistency

File: dataset.py (lines 247-253)

The categorical encoding loop validates that each row's value is in choice_to_code, but doesn't validate that factor.choices is exhaustive relative to the data. If the data contains a category not declared in factors.yaml, you get a clear assertion error — good. But the reverse (declared choices not appearing in data) silently passes and could lead to misleading posteriors. Consider logging a warning:

observed_choices = set(row["arena_env_args"][factor.name] for row in self.rows)
missing_in_data = set(factor.choices) - observed_choices
if missing_in_data:
    print(f"[WARN] Factor {factor.name!r}: declared choices {missing_in_data} never appear in data")

4. Hardcoded Threshold `>= 0.5` for Binary Outcome Classification

File: analyzer.py (line 302), plotting.py (lines 76, 95, 172)

The threshold >= 0.5 for classifying outcomes as "success" is hardcoded in multiple places. This works for binary 0/1 outcomes but could be misleading for metrics that return continuous values in [0, 1]. Consider:

Documenting this assumption clearly in the module docstrings
Or exposing it as a configurable parameter in the CLI

5. Range Inference Could Produce Degenerate Priors

File: dataset.py (lines 188-200)

When factor.range is inferred from observed data:

factor.range = [[min(observed_values), max(observed_values)]]

If all rows have the same value (e.g., a misconfigured sweep with only one intensity), low == high and BoxUniform will produce a degenerate prior. Consider adding a guard:

if min(observed_values) == max(observed_values):
    print(f"[WARN] Factor {factor.name!r} has constant value {min(observed_values)} in data; "
          "posterior will be degenerate. Check your sweep config.")

6. Thread Safety for JSONL Appends

File: episode_writer.py (line 76)

with open(output_path, "a", encoding="utf-8") as jsonl_output:

If multiple eval jobs run in parallel writing to the same --episode_summary file, appends could interleave. While Isaac Sim eval typically runs serially, the code doesn't document this assumption. Consider either:

Adding a note that parallel execution requires separate output files
Or using file locking (e.g., fcntl.flock)

7. `NotImplementedError` for Vector Factors Lacks Migration Path

Files: dataset.py (lines 193, 226)

The NotImplementedError for dim > 1 factors is good for reserving the extension point, but the error message could be more actionable:

raise NotImplementedError(
    f"Vector continuous factors (dim > 1) are reserved for future implementation; "
    f"factor {factor.name!r} has dim={factor.dim}. For now, decompose into separate 1D factors."
)

📝 Minor Suggestions

analyze_sensitivity.py: The script exits silently on success. Consider printing the figure path at the end: print(f"[INFO] Figure saved → {args.figure_path}")
Type hints: episode_writer.py uses TYPE_CHECKING for Job but the function signature lacks the actual type hint for env. Consider adding env: Any or the appropriate gym wrapper type.
YAML schema validation: The FactorSchema.from_yaml validates structure but doesn't catch typos in factor types. A choices: [...] on a type: continuous factor would be silently ignored.

🧪 Test Coverage Assessment

The PR includes synthetic data generators (synthetic_data_continuous.py, synthetic_data_categorical.py) that serve as smoke tests. These are valuable for manual validation but:

No automated unit tests are included for SensitivityDataset, make_analyzer, or the plotting module
Edge cases (empty JSONL, single-row data, all-same-outcome data) aren't covered
Consider adding pytest tests in a future PR, particularly for:
- FactorSchema.from_yaml with malformed YAML
- SensitivityDataset with missing/extra keys
- EmpiricalAnalyzer.categorical_marginal_probs with zero counts

🎯 Verdict

Overall: This is a solid MVP-1 implementation with good architecture and thoughtful design. The concerns raised are mostly defensive hardening and documentation improvements rather than blockers.

Recommendation: Approve with suggestions. The issues identified are quality-of-life improvements that could be addressed in follow-up PRs.

Files Reviewed

isaaclab_arena/analysis/__init__.py
isaaclab_arena/analysis/sensitivity/__init__.py
isaaclab_arena/analysis/sensitivity/analyzer.py
isaaclab_arena/analysis/sensitivity/dataset.py
isaaclab_arena/analysis/sensitivity/episode_writer.py
isaaclab_arena/analysis/sensitivity/plotting.py
isaaclab_arena/analysis/sensitivity/synthetic_data_categorical.py
isaaclab_arena/analysis/sensitivity/synthetic_data_continuous.py
isaaclab_arena/evaluation/eval_runner.py
isaaclab_arena/evaluation/eval_runner_cli.py
isaaclab_arena/evaluation/job_manager.py
isaaclab_arena/scripts/analyze_sensitivity.py
isaaclab_arena_environments/eval_jobs_configs/*.yaml
isaaclab_arena_environments/eval_jobs_configs/*.json
setup.py

Update (38baa56): Reviewed incremental changes since 671e0c4. The new commits contain only style/formatting improvements:

plotting.py: Import reordering (moved numpy before pathlib), removed string quotes from type hints (using direct BaseAnalyzer/FactorSpec types), minor line-length adjustments
synthetic_data_categorical.py: Code reformatting for line length compliance, dict literal style changes
pick_up_object_sweep_factors.yaml: Added standard Apache-2.0 copyright header

✅ No functional changes. All prior review comments remain applicable. The formatting changes are consistent and improve readability.

Update (74585f1): Reviewed incremental changes since 38baa56.

🆕 New Feature: CameraObsVideoRecorder Integration

The new commits add --camera_video flag integration (eval_runner.py lines 191-199):

if args_cli.camera_video:
    print(f"[INFO] Recording per-camera videos for job '{job.name}' -> {job_video_folder}")
    env = CameraObsVideoRecorder(
        env,
        video_folder=job_video_folder,
        step_trigger=lambda step: step == 0,
        video_length=video_length,
    )

This properly imports and uses CameraObsVideoRecorder from isaaclab_arena.evaluation.camera_video (imported at line 17). The implementation:

✅ Correctly reuses job_video_folder (computed when either --video or --camera_video is set)
✅ Applies the wrapper independently of RecordVideo (can use either or both)
✅ Follows the same step_trigger / video_length pattern as RecordVideo

🔍 Observations

Import added correctly: CameraObsVideoRecorder import at line 17 is clean and follows the existing import style.
Video folder logic refactored: The job_video_folder computation was hoisted out of the if args_cli.video: block to be shared by both video recording methods - good refactoring.
RTX ambient light fix in isaaclab_arena_manager_based_env.py: The change disables RTX's built-in ambient light:

render=RenderCfg(
    carb_settings={
        "/rtx/sceneDb/ambientLightIntensity": 0.0,
        "/rtx/sceneDb/ambientLightColor": [0.0, 0.0, 0.0],
    },
),

This ensures USD light prims fully control scene illumination, fixing a confound for vision-policy evals. The inline comment explains the rationale well.

✅ No issues with the incremental changes. The camera video integration follows existing patterns and the ambient light fix is a legitimate rendering correction for sensitivity analysis validity.

The carb setting /rtx/sceneDb/ambientLightIntensity defaults to 1.0 with color [0.1, 0.1, 0.1] in the IsaacLab kit experience. This adds a hidden ~10%-gray ambient floor to every rendered frame, completely independent of any USD UsdLuxLight prim — so policy cameras see lit scenes even when all USD lights are zeroed. The leak silently confounds vision-policy evals: a "low intensity" sweep on the dome light shows no change in scene brightness until the dome exceeds the renderer's ambient term. Override the carb on every Arena eval via RenderCfg.carb_settings, so USD lights are the sole source of illumination. Envs that previously relied on the ambient floor should add an explicit DomeLight asset. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Opt-in writer (--factor_keys + --episode_summary) records the values of the listed arena_env_args keys plus per-episode outcomes (from registered task metrics) to a JSONL during eval_runner. Existing behavior is unchanged when either flag is absent. - Job.arena_env_args_dict preserves the original dict form alongside the existing CLI-args list so the writer can look up factor values by name without re-parsing the args. - The writer's import is deferred inside the per-job try block, matching the policy_runner.py:107 pattern for pxr-touching modules (the writer pulls isaaclab_arena.metrics.metrics, which loads pxr at module top). - Hand-authored factors.yaml + jobs configs check in alongside; --factor_keys on the CLI must match the factors.yaml the analyzer consumes (the analyzer validates the pairing on load). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Reads paired factors.yaml + episode_summary.jsonl into the (theta, x, prior, factor_columns) quadruple sbi consumes, trains NPE on a chosen outcome, plots the 1D posterior marginal for a continuous factor. CLI driver at isaaclab_arena/scripts/analyze_sensitivity.py. - MVP-1 scope: one continuous 1D factor; categorical and vector (dim > 1) branches raise NotImplementedError so the extension point is reserved. - Runtime [WARN] when fitting on a binary outcome surfaces sbi's 1D-Gaussian fallback caveat: the recovered peak reflects the empirical mean of successful theta values, not the true mode of the success curve. - synthetic_data.py generates a paired JSONL + factors.yaml from a known competence band, letting the analyzer smoke-test end-to-end without sim. - sbi added to DEV_DEPS so the docker dev install picks it up on rebuild. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Builds on the MVP-1 foundation (#729) with categorical factor support, a cleaner analyzer/plotting separation, and a tighter eval-side / analysis-side contract that drops a class of drift bugs. - Analyzer hierarchy (BaseAnalyzer / PosteriorAnalyzer / NPEAnalyzer / MNPEAnalyzer / EmpiricalAnalyzer) dispatched via make_analyzer. Pure- categorical schemas use empirical frequency analysis directly (under uniform prior the posterior is exactly the normalized per-category success rate); sbi MNPE 0.26 also requires at least one continuous theta column, which this dispatch handles automatically. - Split inference (analyzer.py) from rendering (plotting.py). Analyzers expose continuous_marginal_density and categorical_marginal_probs queries; plotting consumes them via plot_marginal. New plot types become additive (free functions) without touching the analyzer. - Drop --factor_keys CLI flag on eval_runner. The writer now logs the full arena_env_args per episode; the analyzer-side factors.yaml picks what to study. Removes the drift bug class where --factor_keys and factors.yaml could disagree. - Rename JSONL field "factors" -> "arena_env_args". Honest about provenance and leaves room for sibling source fields (future "sim_state" for MVP-3 reset-time snapshots, "variation_draws" for the variation system) without further wire-format changes. - Add synthetic_data_categorical.py smoke-test generator and rename synthetic_data.py -> synthetic_data_continuous.py for symmetry. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

CI's pre-commit was stricter than the local run had been: - isort + pyupgrade reformat tweaks in plotting.py and synthetic_data_categorical.py. - insert-license added the standard Apache-2.0 header to pick_up_object_sweep_factors.yaml. Local pre-commit run now passes cleanly across all files. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Mirrors the existing flag on policy_runner: when set, eval_runner wraps each job's env with CameraObsVideoRecorder so one mp4 per camera in obs["camera_obs"] is written into <video_dir>/<job_name>/. Independent of --video (which records the kit viewport). Useful for diagnosing what a policy actually sees during an eval sweep. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

cvolkcvolk changed the title ~~MVP-1: per-episode sensitivity recording + NPE analyzer~~ Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers May 28, 2026

cvolkcvolk added 6 commits May 28, 2026 16:53

cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 38baa56 to 74585f1 Compare May 28, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729

Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729
cvolkcvolk wants to merge 6 commits into
mainfrom
cvolk/feature/sensitivity_analysis_mvp1

cvolkcvolk commented May 27, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cvolkcvolk commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detailed description

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review: MVP-1 Per-Episode Sensitivity Recording + NPE Analyzer

✅ Strengths

🔍 Issues & Suggestions

1. Silent Failure: Empty arena_env_args_dict on Non-Dict Jobs

2. Potential Division by Zero in Empirical Analyzer

3. Missing Validation for factor.choices Consistency

4. Hardcoded Threshold >= 0.5 for Binary Outcome Classification

5. Range Inference Could Produce Degenerate Priors

6. Thread Safety for JSONL Appends

7. NotImplementedError for Vector Factors Lacks Migration Path

📝 Minor Suggestions

🧪 Test Coverage Assessment

🎯 Verdict

🆕 New Feature: CameraObsVideoRecorder Integration

🔍 Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cvolkcvolk commented May 27, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

1. Silent Failure: Empty `arena_env_args_dict` on Non-Dict Jobs

3. Missing Validation for `factor.choices` Consistency

4. Hardcoded Threshold `>= 0.5` for Binary Outcome Classification

7. `NotImplementedError` for Vector Factors Lacks Migration Path