Skip to content

Geotransolver 2d 3d#1676

Open
coreyjadams wants to merge 26 commits into
NVIDIA:mainfrom
coreyjadams:geotransolver-2d-3d
Open

Geotransolver 2d 3d#1676
coreyjadams wants to merge 26 commits into
NVIDIA:mainfrom
coreyjadams:geotransolver-2d-3d

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

@coreyjadams coreyjadams commented May 27, 2026

PhysicsNeMo Pull Request

Reopening this Pull Request.

This refactors GeoTransolver, Flare, and some components of transolver to be enabled fro 2D and 3D cases. The goal is to make these more suitable for structured datasets, and enable domain parallelism in these cases.

I also enabled them in the Darcy transolver model, just so we have an example for users to test these.

In order to enable ShardTensor for GeoTransolver and Flare, and move them out of experimental, we should get this or a similar refactor in.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coreyjadams coreyjadams requested a review from ktangsali May 27, 2026 15:49
@coreyjadams
Copy link
Copy Markdown
Collaborator Author

As part of this PR, we will need to do model checkpoint surgery on our existing chekpoints. Tagging @ktangsali so we can be sure to validate checkpoints against this.

@coreyjadams
Copy link
Copy Markdown
Collaborator Author

/ok to test 931d139

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR extends GeoTransolver, FLARE, and related components to support structured 2D and 3D grids (in addition to the existing unstructured mesh path), and wires the new variants into the Darcy example. The refactoring also extracts shared helpers (_project_input, _compute_slices_from_projections, _flare_self_attention) to reduce duplication across attention backends, and consolidates GALE_FA into gale.py.

  • New structured-grid support: GALEStructuredMesh2D/3D, StructuredContextProjector, and a structured_shape parameter on GeoTransolver enable Conv2d/Conv3d slice projections with inputs accepted as flat (B,N,C) or spatial (B,H,W,C) / (B,H,W,D,C), with output reshaped to match.
  • Darcy example refactored: model instantiation switched to hydra.utils.instantiate, Muon optimizer support added, separate train/val TensorBoard writers, and JSONL metrics logging added; however, the dataset paths in config_fix.yaml were replaced with a user-specific internal cluster path that will not work for anyone outside of that environment.
  • Tensor layout unified: all attention code migrated from (B,H,N,D) to (B,N,H,D) token-first layout with temperature shape updated accordingly; the change appears consistent across all affected modules.

Important Files Changed

Filename Overview
examples/cfd/darcy_transolver/config_fix.yaml Config refactored to use Hydra defaults for model selection; dataset paths replaced with user-specific internal cluster paths that will fail for all other users.
examples/cfd/darcy_transolver/train_transolver_darcy_fix.py Training script significantly extended: model instantiation via Hydra, Muon optimizer support, separate TensorBoard writers, JSONL metrics logging. metrics_file not protected by try/finally.
physicsnemo/experimental/models/geotransolver/gale.py GALE_FA moved from its own module into gale.py; GALEStructuredMesh2D/3D added via mixin pattern; shared helpers extracted. concrete_dropout silently ignored for structured variants.
physicsnemo/experimental/models/geotransolver/context_projector.py ContextProjector refactored to use _SliceToContextMixin; StructuredContextProjector added for Conv2d/Conv3d geometry encoding; tensor layout changed from (B,H,N,D) to (B,N,H,D) consistently.
physicsnemo/nn/module/physics_attention.py _project_input and _compute_slices_from_projections extracted as free functions; temperature shape updated to (1,1,H,1) for new (B,N,H,S) layout; project_input_onto_slices simplified in all concrete subclasses.
physicsnemo/experimental/models/geotransolver/geotransolver.py structured_shape parameter added; _flatten_for_structured handles 2D/3D spatial or flat inputs; output unflattened to match input layout when structured.
physicsnemo/experimental/nn/flare_attention.py _flare_self_attention extracted as a reusable free function; FLARE.forward simplified to call it.
test/models/geotransolver/test_geotransolver.py Four new tests added covering structured 2D/3D forward pass, global context with structured grids, and rejection of incompatible options.

Comments Outside Diff (1)

  1. examples/cfd/darcy_transolver/train_transolver_darcy_fix.py, line 493 (link)

    P2 metrics_file not closed on exception

    metrics_file is opened with open(metrics_path, "a") and only closed in a guarded block at the very end of darcy_trainer. Any exception raised during training (e.g., a CUDA OOM or a shape mismatch) will skip the closing block, leaving the file handle open until the process exits. Wrapping the training loop in a try/finally — or using a context manager — would ensure the file is always flushed and closed.

Reviews (1): Last reviewed commit: "Merge branch 'NVIDIA:main' into geotrans..." | Re-trigger Greptile

Comment on lines +38 to +39
train_path: //lustre/fsw/portfolios/coreai/users/coreya/datasets/darcy_fix/example_data/piececonst_r421_N1024_smooth1.npz
test_path: //lustre/fsw/portfolios/coreai/users/coreya/datasets/darcy_fix/example_data/piececonst_r421_N1024_smooth2.npz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded internal cluster paths

train_path and test_path are set to paths on NVIDIA's internal Lustre filesystem (//lustre/fsw/portfolios/coreai/users/coreya/...). Anyone cloning the repo and running this example will immediately hit a file-not-found error. These should be replaced with placeholder paths such as /path/to/piececonst_r421_N1024_smooth1.npz — matching the documented README — so that users know they need to point these at their own downloaded copies of the dataset.

dir: .

output_dir: ./output/
run_id: ${hydra:runtime.choices.model}-muon_${precision}_r${resolution}_b${data.batch_size}_s${model.slice_num}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 run_id has the string "muon" hard-coded, so runs using optimizer.type: adamw will still produce a run_id that says muon. Consider interpolating from the config so the recorded name always matches the actual optimizer in use.

Suggested change
run_id: ${hydra:runtime.choices.model}-muon_${precision}_r${resolution}_b${data.batch_size}_s${model.slice_num}
run_id: ${hydra:runtime.choices.model}-${optimizer.type}_${precision}_r${resolution}_b${data.batch_size}_s${model.slice_num}

Comment on lines 821 to 882
else:
self.ln_1 = nn.LayerNorm(hidden_dim)

# Attention layer
dim_head = hidden_dim // num_heads
# First match on attention backend, then on spatial shape
match attention_type:
case 'GALE':
self.Attn = GALE(
hidden_dim,
heads=num_heads,
dim_head=hidden_dim // num_heads,
dropout=dropout,
slice_num=slice_num,
use_te=use_te,
plus=plus,
context_dim=context_dim,
concrete_dropout=concrete_dropout,
state_mixing_mode=state_mixing_mode,
)
if spatial_shape is None:
self.Attn = GALE(
hidden_dim,
heads=num_heads,
dim_head=dim_head,
dropout=dropout,
slice_num=slice_num,
use_te=use_te,
plus=plus,
context_dim=context_dim,
concrete_dropout=concrete_dropout,
state_mixing_mode=state_mixing_mode,
)
elif len(spatial_shape) == 2:
self.Attn = GALEStructuredMesh2D(
hidden_dim,
spatial_shape=(int(spatial_shape[0]), int(spatial_shape[1])),
heads=num_heads,
dim_head=dim_head,
dropout=dropout,
slice_num=slice_num,
use_te=use_te,
plus=plus,
context_dim=context_dim,
state_mixing_mode=state_mixing_mode,
)
elif len(spatial_shape) == 3:
self.Attn = GALEStructuredMesh3D(
hidden_dim,
spatial_shape=(
int(spatial_shape[0]),
int(spatial_shape[1]),
int(spatial_shape[2]),
),
heads=num_heads,
dim_head=dim_head,
dropout=dropout,
slice_num=slice_num,
use_te=use_te,
plus=plus,
context_dim=context_dim,
state_mixing_mode=state_mixing_mode,
)
else:
raise ValueError(
f"spatial_shape must be None, length-2, or length-3; got {spatial_shape!r}"
)
case 'GALE_FA':
self.Attn = GALE_FA(
hidden_dim,
heads=num_heads,
dim_head=hidden_dim // num_heads,
dim_head=dim_head,
dropout=dropout,
n_global_queries=slice_num,
use_te=use_te,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 concrete_dropout silently ignored for structured GALE variants

When spatial_shape is not None, GALE_block selects GALEStructuredMesh2D or GALEStructuredMesh3D, neither of which accepts a concrete_dropout argument. If a caller sets concrete_dropout=True on a structured GALE_block, the option is silently dropped — standard nn.Dropout is used instead and no warning is emitted. At minimum, a warnings.warn when concrete_dropout=True and spatial_shape is not None would make this limitation discoverable.

@coreyjadams
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant