Skip to content

Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729

Draft
cvolkcvolk wants to merge 6 commits into
mainfrom
cvolk/feature/sensitivity_analysis_mvp1
Draft

Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers#729
cvolkcvolk wants to merge 6 commits into
mainfrom
cvolk/feature/sensitivity_analysis_mvp1

Conversation

@cvolkcvolk
Copy link
Copy Markdown
Collaborator

@cvolkcvolk cvolkcvolk commented May 27, 2026

Summary

sensitivity analysis for Arena evals: per-episode JSONL writer in eval_runner + offline analyzer that infers posteriors over input factors.

Detailed description

  • Opt-in per-episode writer in eval_runner (--episode_summary). Logs the full arena_env_args and the task's registered outcomes per recorded demo as a JSONL row. Existing
    eval workflows are unchanged when the flag is absent.
  • Three-way analyzer dispatch via make_analyzer: NPE for all-continuous, MNPE for mixed continuous+categorical, Empirical (frequency table) for pure-categorical schemas. Class
    hierarchy is BaseAnalyzer -> PosteriorAnalyzer -> NPEAnalyzer / MNPEAnalyzer, with EmpiricalAnalyzer separate.
  • Hand-authored factors.yaml schema declares slice + factor types + outcome names; the analyzer is the sole consumer. The eval-side arena_env_args field in the JSONL is
    provenance-honest, leaving room for sibling fields (sim_state, variation_draws) as MVP-3 and the variation system land.
  • Inference (analyzer.py) and rendering (plotting.py) are decoupled: analyzers expose continuous_marginal_density / categorical_marginal_probs queries; plot_marginal
    dispatches by factor type and renders matplotlib.
  • synthetic_data_continuous.py and synthetic_data_categorical.py generate paired JSONL + factors.yaml for offline smoke testing; both validated end-to-end.

Copy link
Copy Markdown
Contributor

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: MVP-1 Per-Episode Sensitivity Recording + NPE Analyzer

This is a substantial and well-structured PR that lays solid data-layer and analysis-layer foundations for the sensitivity analysis workstream. The architecture is clean, with good separation of concerns between the eval-side writer, dataset loader, analyzer hierarchy, and plotting module.


✅ Strengths

Architecture & Design:

  • The analyzer hierarchy (BaseAnalyzerPosteriorAnalyzerNPE/MNPE + EmpiricalAnalyzer) with make_analyzer factory is well-designed and extensible
  • Clean separation between inference (analyzer.py) and rendering (plotting.py) follows good software engineering principles
  • The decision to log the full arena_env_args per episode (rather than just --factor_keys) is a smart choice that eliminates the drift bug class where CLI flags and factors.yaml could disagree
  • Deferred imports for pxr-touching modules follow the established policy_runner.py:107 pattern correctly

Implementation Quality:

  • Comprehensive docstrings with clear "what this module does in plain English" sections
  • The JSONL wire format with arena_env_args + outcomes is honest about provenance and leaves room for future extension
  • Continuous-first column ordering in theta matches sbi's MNPE convention
  • The empirical rug in plots provides valuable sanity-checking for humans

🔍 Issues & Suggestions

1. Silent Failure: Empty arena_env_args_dict on Non-Dict Jobs

File: job_manager.py (lines 46-47, 110)

self.arena_env_args_dict = arena_env_args_dict if arena_env_args_dict is not None else {}

The from_dict class method passes data["arena_env_args"] which could be None or missing entirely. If a job config omits arena_env_args, this silently produces an empty dict, and the episode writer will log rows with empty arena_env_args blocks. Consider adding validation:

arena_env_args_dict=data.get("arena_env_args") or {}

Or raise explicitly if the key is required for sensitivity workflows.


2. Potential Division by Zero in Empirical Analyzer

File: analyzer.py (line 309)

if total_rate > 0:
    return empirical_rates / total_rate
return np.full(num_choices, 1.0 / num_choices)

This correctly handles the total_rate == 0 case, but there's a subtle issue: if category_mask.any() is False for all categories (i.e., the data has no rows matching any declared choice), empirical_rates stays all zeros, and you return a uniform distribution. This is reasonable but worth documenting in the docstring.


3. Missing Validation for factor.choices Consistency

File: dataset.py (lines 247-253)

The categorical encoding loop validates that each row's value is in choice_to_code, but doesn't validate that factor.choices is exhaustive relative to the data. If the data contains a category not declared in factors.yaml, you get a clear assertion error — good. But the reverse (declared choices not appearing in data) silently passes and could lead to misleading posteriors. Consider logging a warning:

observed_choices = set(row["arena_env_args"][factor.name] for row in self.rows)
missing_in_data = set(factor.choices) - observed_choices
if missing_in_data:
    print(f"[WARN] Factor {factor.name!r}: declared choices {missing_in_data} never appear in data")

4. Hardcoded Threshold >= 0.5 for Binary Outcome Classification

File: analyzer.py (line 302), plotting.py (lines 76, 95, 172)

The threshold >= 0.5 for classifying outcomes as "success" is hardcoded in multiple places. This works for binary 0/1 outcomes but could be misleading for metrics that return continuous values in [0, 1]. Consider:

  • Documenting this assumption clearly in the module docstrings
  • Or exposing it as a configurable parameter in the CLI

5. Range Inference Could Produce Degenerate Priors

File: dataset.py (lines 188-200)

When factor.range is inferred from observed data:

factor.range = [[min(observed_values), max(observed_values)]]

If all rows have the same value (e.g., a misconfigured sweep with only one intensity), low == high and BoxUniform will produce a degenerate prior. Consider adding a guard:

if min(observed_values) == max(observed_values):
    print(f"[WARN] Factor {factor.name!r} has constant value {min(observed_values)} in data; "
          "posterior will be degenerate. Check your sweep config.")

6. Thread Safety for JSONL Appends

File: episode_writer.py (line 76)

with open(output_path, "a", encoding="utf-8") as jsonl_output:

If multiple eval jobs run in parallel writing to the same --episode_summary file, appends could interleave. While Isaac Sim eval typically runs serially, the code doesn't document this assumption. Consider either:

  • Adding a note that parallel execution requires separate output files
  • Or using file locking (e.g., fcntl.flock)

7. NotImplementedError for Vector Factors Lacks Migration Path

Files: dataset.py (lines 193, 226)

The NotImplementedError for dim > 1 factors is good for reserving the extension point, but the error message could be more actionable:

raise NotImplementedError(
    f"Vector continuous factors (dim > 1) are reserved for future implementation; "
    f"factor {factor.name!r} has dim={factor.dim}. For now, decompose into separate 1D factors."
)

📝 Minor Suggestions

  1. analyze_sensitivity.py: The script exits silently on success. Consider printing the figure path at the end: print(f"[INFO] Figure saved → {args.figure_path}")

  2. Type hints: episode_writer.py uses TYPE_CHECKING for Job but the function signature lacks the actual type hint for env. Consider adding env: Any or the appropriate gym wrapper type.

  3. YAML schema validation: The FactorSchema.from_yaml validates structure but doesn't catch typos in factor types. A choices: [...] on a type: continuous factor would be silently ignored.


🧪 Test Coverage Assessment

The PR includes synthetic data generators (synthetic_data_continuous.py, synthetic_data_categorical.py) that serve as smoke tests. These are valuable for manual validation but:

  • No automated unit tests are included for SensitivityDataset, make_analyzer, or the plotting module
  • Edge cases (empty JSONL, single-row data, all-same-outcome data) aren't covered
  • Consider adding pytest tests in a future PR, particularly for:
    • FactorSchema.from_yaml with malformed YAML
    • SensitivityDataset with missing/extra keys
    • EmpiricalAnalyzer.categorical_marginal_probs with zero counts

🎯 Verdict

Overall: This is a solid MVP-1 implementation with good architecture and thoughtful design. The concerns raised are mostly defensive hardening and documentation improvements rather than blockers.

Recommendation: Approve with suggestions. The issues identified are quality-of-life improvements that could be addressed in follow-up PRs.


Files Reviewed
  • isaaclab_arena/analysis/__init__.py
  • isaaclab_arena/analysis/sensitivity/__init__.py
  • isaaclab_arena/analysis/sensitivity/analyzer.py
  • isaaclab_arena/analysis/sensitivity/dataset.py
  • isaaclab_arena/analysis/sensitivity/episode_writer.py
  • isaaclab_arena/analysis/sensitivity/plotting.py
  • isaaclab_arena/analysis/sensitivity/synthetic_data_categorical.py
  • isaaclab_arena/analysis/sensitivity/synthetic_data_continuous.py
  • isaaclab_arena/evaluation/eval_runner.py
  • isaaclab_arena/evaluation/eval_runner_cli.py
  • isaaclab_arena/evaluation/job_manager.py
  • isaaclab_arena/scripts/analyze_sensitivity.py
  • isaaclab_arena_environments/eval_jobs_configs/*.yaml
  • isaaclab_arena_environments/eval_jobs_configs/*.json
  • setup.py

Update (38baa56): Reviewed incremental changes since 671e0c4. The new commits contain only style/formatting improvements:

  • plotting.py: Import reordering (moved numpy before pathlib), removed string quotes from type hints (using direct BaseAnalyzer/FactorSpec types), minor line-length adjustments
  • synthetic_data_categorical.py: Code reformatting for line length compliance, dict literal style changes
  • pick_up_object_sweep_factors.yaml: Added standard Apache-2.0 copyright header

No functional changes. All prior review comments remain applicable. The formatting changes are consistent and improve readability.


Update (74585f1): Reviewed incremental changes since 38baa56.

🆕 New Feature: CameraObsVideoRecorder Integration

The new commits add --camera_video flag integration (eval_runner.py lines 191-199):

if args_cli.camera_video:
    print(f"[INFO] Recording per-camera videos for job '{job.name}' -> {job_video_folder}")
    env = CameraObsVideoRecorder(
        env,
        video_folder=job_video_folder,
        step_trigger=lambda step: step == 0,
        video_length=video_length,
    )

This properly imports and uses CameraObsVideoRecorder from isaaclab_arena.evaluation.camera_video (imported at line 17). The implementation:

  • ✅ Correctly reuses job_video_folder (computed when either --video or --camera_video is set)
  • ✅ Applies the wrapper independently of RecordVideo (can use either or both)
  • ✅ Follows the same step_trigger / video_length pattern as RecordVideo

🔍 Observations

  1. Import added correctly: CameraObsVideoRecorder import at line 17 is clean and follows the existing import style.

  2. Video folder logic refactored: The job_video_folder computation was hoisted out of the if args_cli.video: block to be shared by both video recording methods - good refactoring.

  3. RTX ambient light fix in isaaclab_arena_manager_based_env.py: The change disables RTX's built-in ambient light:

render=RenderCfg(
    carb_settings={
        "/rtx/sceneDb/ambientLightIntensity": 0.0,
        "/rtx/sceneDb/ambientLightColor": [0.0, 0.0, 0.0],
    },
),

This ensures USD light prims fully control scene illumination, fixing a confound for vision-policy evals. The inline comment explains the rationale well.

No issues with the incremental changes. The camera video integration follows existing patterns and the ambient light fix is a legitimate rendering correction for sensitivity analysis validity.

@cvolkcvolk cvolkcvolk changed the title MVP-1: per-episode sensitivity recording + NPE analyzer Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers May 28, 2026
The carb setting /rtx/sceneDb/ambientLightIntensity defaults to 1.0
with color [0.1, 0.1, 0.1] in the IsaacLab kit experience. This adds
a hidden ~10%-gray ambient floor to every rendered frame, completely
independent of any USD UsdLuxLight prim — so policy cameras see lit
scenes even when all USD lights are zeroed.

The leak silently confounds vision-policy evals: a "low intensity"
sweep on the dome light shows no change in scene brightness until the
dome exceeds the renderer's ambient term.

Override the carb on every Arena eval via RenderCfg.carb_settings, so
USD lights are the sole source of illumination. Envs that previously
relied on the ambient floor should add an explicit DomeLight asset.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Opt-in writer (--factor_keys + --episode_summary) records the values of
the listed arena_env_args keys plus per-episode outcomes (from registered
task metrics) to a JSONL during eval_runner. Existing behavior is unchanged
when either flag is absent.

- Job.arena_env_args_dict preserves the original dict form alongside the
  existing CLI-args list so the writer can look up factor values by name
  without re-parsing the args.
- The writer's import is deferred inside the per-job try block, matching
  the policy_runner.py:107 pattern for pxr-touching modules (the writer
  pulls isaaclab_arena.metrics.metrics, which loads pxr at module top).
- Hand-authored factors.yaml + jobs configs check in alongside; --factor_keys
  on the CLI must match the factors.yaml the analyzer consumes (the analyzer
  validates the pairing on load).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Reads paired factors.yaml + episode_summary.jsonl into the (theta, x, prior,
factor_columns) quadruple sbi consumes, trains NPE on a chosen outcome,
plots the 1D posterior marginal for a continuous factor. CLI driver at
isaaclab_arena/scripts/analyze_sensitivity.py.

- MVP-1 scope: one continuous 1D factor; categorical and vector (dim > 1)
  branches raise NotImplementedError so the extension point is reserved.
- Runtime [WARN] when fitting on a binary outcome surfaces sbi's 1D-Gaussian
  fallback caveat: the recovered peak reflects the empirical mean of
  successful theta values, not the true mode of the success curve.
- synthetic_data.py generates a paired JSONL + factors.yaml from a known
  competence band, letting the analyzer smoke-test end-to-end without sim.
- sbi added to DEV_DEPS so the docker dev install picks it up on rebuild.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Builds on the MVP-1 foundation (#729) with categorical factor support, a
cleaner analyzer/plotting separation, and a tighter eval-side / analysis-side
contract that drops a class of drift bugs.

- Analyzer hierarchy (BaseAnalyzer / PosteriorAnalyzer / NPEAnalyzer /
  MNPEAnalyzer / EmpiricalAnalyzer) dispatched via make_analyzer. Pure-
  categorical schemas use empirical frequency analysis directly (under
  uniform prior the posterior is exactly the normalized per-category
  success rate); sbi MNPE 0.26 also requires at least one continuous theta
  column, which this dispatch handles automatically.
- Split inference (analyzer.py) from rendering (plotting.py). Analyzers
  expose continuous_marginal_density and categorical_marginal_probs
  queries; plotting consumes them via plot_marginal. New plot types
  become additive (free functions) without touching the analyzer.
- Drop --factor_keys CLI flag on eval_runner. The writer now logs the
  full arena_env_args per episode; the analyzer-side factors.yaml picks
  what to study. Removes the drift bug class where --factor_keys and
  factors.yaml could disagree.
- Rename JSONL field "factors" -> "arena_env_args". Honest about
  provenance and leaves room for sibling source fields (future "sim_state"
  for MVP-3 reset-time snapshots, "variation_draws" for the variation
  system) without further wire-format changes.
- Add synthetic_data_categorical.py smoke-test generator and rename
  synthetic_data.py -> synthetic_data_continuous.py for symmetry.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
CI's pre-commit was stricter than the local run had been:

- isort + pyupgrade reformat tweaks in plotting.py and
  synthetic_data_categorical.py.
- insert-license added the standard Apache-2.0 header to
  pick_up_object_sweep_factors.yaml.

Local pre-commit run now passes cleanly across all files.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Mirrors the existing flag on policy_runner: when set, eval_runner wraps
each job's env with CameraObsVideoRecorder so one mp4 per camera in
obs["camera_obs"] is written into <video_dir>/<job_name>/. Independent
of --video (which records the kit viewport). Useful for diagnosing what
a policy actually sees during an eval sweep.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 38baa56 to 74585f1 Compare May 28, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant