GPU-accelerated offline atmospheric transport model in Julia. Solves tracer continuity on latitude-longitude (ERA5) and cubed-sphere (GEOS-FP C720, GEOS-IT C180) grids with mass-conserving advection, Tiedtke/RAS convection, and boundary-layer diffusion. Oceananigans-inspired modular architecture; KernelAbstractions.jl for GPU portability.
When investigating ANY transport bug, numerical instability, or physics mismatch:
-
Read the reference code FIRST. The TM5 F90 source (
deps/tm5/base/src/) and the old working Julia code (git history, legacy scripts) are the ground truth. Read them LINE BY LINE before forming any hypothesis. -
Diff old vs new. If something used to work and now doesn't, the answer is in the diff. Find the EXACT semantic change — not an approximate summary.
-
Red team / blue team for every significant change. Before implementing any fix to advection, preprocessing, or flux scaling:
- Launch a PROPOSE agent to design the change with TM5 F90 evidence
- Launch an EVALUATE agent to challenge it, find risks, verify against reference
- Only implement after both agents agree (or disagreement is explicitly resolved)
- NEVER skip this for "obvious" fixes — the ERA5 LL NaN bug (2026-04-03) was a one-line change (m_dev reset inside vs outside substep loop) that was missed for 12+ hours because this protocol was not followed.
-
One change at a time. Make one change, test, verify. Never stack multiple untested hypotheses. Each change must have evidence (code reference, math derivation, or diagnostic output) BEFORE implementation.
-
Never guess at scale factors. If a flux needs scaling, derive the factor from first principles with the exact TM5 formula, verify the units, and confirm with a diagnostic. The 4× scaling attempt (2026-04-02) wasted hours because the math was done without reading how TM5 actually applies the fluxes.
julia --project=. scripts/run.jl config/runs/<config>.tomlAll simulation parameters live in TOML configs under config/runs/. Preprocessing
configs are in config/preprocessing/. See docs/QUICKSTART.md for full reference.
Architectures → Parameters → Communications
↓
Grids → Fields
↓
Advection → Convection → Diffusion → Chemistry
↓
TimeSteppers → Adjoint → Callbacks
↓
Regridding → Diagnostics → Sources
↓
IO → Visualization → Models
src/AtmosTransport.jl includes all modules in this strict order. Each module
may import from earlier modules but never from later ones.
All physics and infrastructure is organized via abstract type hierarchies. Physics code dispatches on concrete types — no if/else on grid or scheme.
AbstractArchitecture → CPU, GPU
AbstractGrid{FT, Arch} → LatitudeLongitudeGrid, CubedSphereGrid
AbstractAdvectionScheme → SlopesAdvection, PPMAdvection{ORD}
AbstractConvection → NoConvection, TiedtkeConvection, RASConvection
AbstractDiffusion → NoDiffusion, BoundaryLayerDiffusion, PBLDiffusion, NonLocalPBLDiffusion
AbstractChemistry → NoChemistry, RadioactiveDecay, CompositeChemistry
AbstractMetDriver{FT} → ERA5MetDriver, PreprocessedLatLonMetDriver, GEOSFPCubedSphereMetDriver
AbstractSurfaceFlux → SurfaceFlux, TimeVaryingSurfaceFlux, EdgarSource, ...
AbstractBufferingStrategy → SingleBuffer, DoubleBuffer
AbstractOutputWriter → BinaryOutputWriter, NetCDFOutputWriter
- TOML-driven: all parameters in config files, no hardcoded paths in source
- Canonical variables: physics code uses abstract names (e.g.
surface_pressure); met source TOMLs map native variable names to these canonicals - GPU via extension:
using CUDAorusing Metalbeforeusing AtmosTransportloads the appropriate extension (AtmosTransportCUDAExtorAtmosTransportMetalExt).scripts/run.jlauto-detects: Metal on macOS, CUDA on Linux/Windows. - KernelAbstractions: all GPU kernels use
@kernel/@indexfor CUDA/Metal/CPU portability - Mass conservation first: advection via mass fluxes (not winds); vertical flux from continuity equation; Strang splitting X→Y→Z→Z→Y→X
- TransportPolicy: single source of truth for transport configuration (vertical method,
pressure basis, mass balance). See
src/Models/transport_policy.jl.
Key directories:
src/— all Julia source code, organized by modulescripts/— download, preprocess, run, diagnostic, and visualization scriptsconfig/runs/— simulation configsconfig/preprocessing/— preprocessor configsconfig/met_sources/— per-source variable mappings to canonical namesconfig/canonical_variables.toml— the contract between met sources and physics codetest/— test suite (15 files, 4570+ tests)docs/— Documenter.jl + Literate.jl documentationext/— GPU extensions (CUDA, Metal)
download (scripts/download_*.jl) → preprocess (scripts/preprocess_*.jl, optional) → run
- ERA5: spectral VO/D/LNSP →
preprocess_spectral_massflux.jl→ NetCDF mass fluxes (default); gridpoint u/v/lnsp →preprocess_mass_fluxes.jl(stopgap, ~0.9% mass drift) - GEOS-FP C720: NetCDF mass fluxes from WashU → optional binary staging via
preprocess_geosfp_cs.jl - GEOS-IT C180: same pipeline as GEOS-FP; data from WashU archive
scheme = "slopes"(default): van Leer slopes, 2nd-orderscheme = "ppm"withppm_order ∈ {4,5,6,7}: Putman & Lin (2007) PPMlinrood = true: Lin-Rood cross-term advection (FV3'sfv_tp_2dalgorithm). Averages X-first and Y-first PPM from original field. Eliminates directional splitting bias at CS panel boundaries. File:src/Advection/cubed_sphere_fvtp2d.jl.vertical_remap = true: Replaces Strang-split Z-advection with FV3-style conservative PPM remap (map1_q2+ppm_profile). Horizontal-only Lin-Rood transport on Lagrangian surfaces, then single vertical remap per window. File:src/Advection/vertical_remap.jl.
The vertical remap path is designed to match GCHP's offline_tracer_advection:
- Identical: Lin-Rood
fv_tp_2dper level, dry basis, PPM remap. - PE computation (fixed 2026-03-13): Source PE from air mass cumsum (inside
remap kernel), target PE from
cumsum(next_delp × (1-qv)). Both use direct cumsum — NO hybrid formula (ak + bk × PS). The hybrid approach was tested but caused massive vertical pumping (uniform 400 ppm → 535 ppm surface, 44 ppm std after 2 days) because GEOS DELP deviates from the hybrid formula by 0.1-1% per level (up to 250 Pa). GCHP compensates withcalcScalingFactor; we avoid the issue entirely by using direct cumsum. Seefv_tracer2d.F90:988-1070. - Validated (2026-03-13): 48-window (2-day) uniform tracer test: surface std 2.0 ppm (vs 93 ppm with hybrid PE), mass drift 0.02% (vs 0.75%), no panel boundary artifacts (edge/interior ratio 0.93).
Unified _run_loop! via multiple dispatch (replaced 4 duplicated methods, ~1940 lines).
Files in src/Models/:
run_loop.jl— single entry pointrun!(model)+_run_loop!physics_phases.jl— grid-dispatched phase functions (IO, compute, advection, output)simulation_state.jl— allocation factories for tracers, air mass, workspacesio_scheduler.jl—IOScheduler{B}abstracts single/double bufferingmass_diagnostics.jl—MassDiagnostics+ global mass fixerrun_helpers.jl— physics helpers, advection dispatch, emission wrapperstransport_policy.jl—TransportPolicytype for transport configuration
These are hard-won correctness constraints. Violating any causes silent wrong results.
| Symptom | Likely cause | Where to look |
|---|---|---|
| Transport 8x too slow | Missing mass_flux_dt = 450 |
Invariant 3 below |
| Extreme CFL, NaN | Wrong vertical level ordering | Invariant 2 |
| ~10% mass loss per step | In-place Z-advection (no double buffer) | Invariant 4 |
| Panel boundary waves (CS) | Wrong flux rotation or missing linrood |
Invariant 1 |
| Surface emissions invisible in column mean | Missing [diffusion] section |
Invariant 7 |
| 5x slower CPU loops | Wrong loop nesting order | Invariant 8 |
| CO2 surface 535 ppm from 400 ppm uniform | Hybrid PE in vertical remap | Use direct cumsum PE |
| Polar mass drainage / Y nloop max_nloop hit | Stale binary from old preprocessor OR mass_fixer=false | Invariant 10, 11 |
STALE BINARY WARNING at startup |
Binary written by older preprocessor than current source | Invariant 10 |
cm-continuity check FAILED at startup |
Binary's cm doesn't match its am/bm divergence — regenerate | Invariant 10 |
| Tracer mass drifts ~10⁻⁴/day with uniform IC (ERA5 LL) | Mass fix disabled in preprocessor; raw ERA5 ⟨ps⟩ drift not absorbed | Invariant 12 |
-
CS panel convention: GEOS-FP file panels (nf=1..6) differ from gnomonic numbering. Code uses FILE convention.
panel_connectivity.jlandcgrid_to_staggered_panelshandle rotated boundary fluxes (MFXC↔MFYC at panel edges). -
Vertical level ordering: GEOS-IT is bottom-to-top (k=1=surface); GEOS-FP is top-to-bottom (k=1=TOA). Auto-detected in reader; wrong ordering → extreme CFL, NaN.
-
GEOS mass_flux_dt (GEOS-IT AND GEOS-FP): MFXC accumulated over dynamics timestep (~450s), NOT the 1-hour met interval. Config:
mass_flux_dt = 450in[met_data]. Without this, transport is 8x too slow. Verified for both GEOS-IT C180 and GEOS-FP C720. -
Z-advection double buffering: kernel reads from
ws.rm_buf/ws.m_buf(copies of originals), writes torm/m. In-place update breaks flux telescoping (~10% mass loss). -
Tiedtke convection: explicit upwind, conditionally stable. Adaptive subcycling in
_max_conv_cfl_cskeeps CFL < 0.9. No positivity clamp needed. -
Spectral SHT: FFTW
bfftis unnormalized backward FFT = direct Fourier synthesis. Do NOT divide by N. Must fill negative frequencies for real-valued fields. -
Diffusion for column-mean output: without vertical mixing, surface emissions stay in bottom level. Column-mean dilutes by ~72x → looks like zero transport. Always enable
[diffusion]for realistic visualization. -
Column-major loop ordering: Julia is column-major (like Fortran). In nested loops over arrays, the leftmost/innermost index must be the innermost loop. For a 3D array
A[i, j, k], loop asfor k, for j, for i(i innermost). Wrong order causes catastrophic cache misses (5x slowdown). -
Moist vs dry mass fluxes in GEOS met data: FV3's dynamical core operates on dry air mass. The exported variables have DIFFERENT moisture conventions:
Variable Basis Notes MFXC, MFYC DRY Horizontal mass fluxes from FV3 dynamics DELP MOIST (total) Exported as total pressure thickness CMFMC MOIST (total) Convective mass flux DTRAIN MOIST (total) Detraining mass flux QV — Specific humidity; convert wet→dry: x_dry = x_wet × (1 - qv)GCHP path (
gchp=true): Runs on MOIST basis (matches GCHPUse_Total_Air_Pressure > 0). MFXC converted to moist:MFX = MFXC / (1-QV)(GCHPctmEnv:1029). Tracers converted dry→wet before advection:q_wet = q × (1-QV)(AdvCore:1070). dp = moist DELP. Back-conversion wet→dry is implicit viarm / m_dry. Required because ak/bk in the vertical remap target PE expect moist surface pressure.Non-GCHP paths: Transport and diffusion run on DRY basis; convection on MOIST basis.
-
ERA5 LL binaries MUST be made by
preprocess_spectral_v4_binary.jlorpreprocess_era5_daily.jl— both callrecompute_cm_from_divergence!to rebuild cm from continuity using the merged am/bm. The OBSOLETEconvert_merged_massflux_to_binary.jlPICKS native cm at merged interfaces and smears the residual viacorrect_cm_residual!, producing a cm that does NOT satisfy local continuity (89.7% of cells violated in the broken Dec 2021 binary). Symptoms: polar Y nloop hits max_nloop, polar cells go negative, cumulative drainage. Foolproof check (added 2026-04-06):MassFluxBinaryReaderruns a_verify_cm_continuitypass on window 1 at driver construction time. Errors LOUDLY if the binary's cm is inconsistent. Disable withENV["ATMOSTR_NO_CM_CHECK"]="1". Provenance fields (script_path, script_mtime, git_commit, git_dirty) in v4 headers also trigger a stale-binary warning if the source has moved on. Disable withENV["ATMOSTR_NO_STALE_CHECK"]="1". -
mass_fixer = trueis required for the ERA5 LL F64 debug test to run to completion. Pole-adjacent stratospheric cells (j ∈ {1, 2, Ny-1, Ny}) have |bm|/m ≈ 0.30 per face from the spectral preprocessor. Withmass_fixer = false, cumulative drainage over a window of 4 substeps × 6 sweeps (X-Y-Z-Z-Y-X) exceeds cell mass at those cells; the local nloop refinement hits its max and the run aborts (regression test:config/runs/era5_f64_debug_moist_v4_nofix.toml). This is a LOCAL cell-stability issue and is independent of the global mass drift fix in invariant 12 below — both are needed for ERA5 LL.OPEN: whether this matches TM5 r1112 behavior is NOT verified. A previous CLAUDE.md draft asserted "TM5 r1112 effectively does mass-fixing via m = (at + bt × ps) × area / g each substep". That was a hypothesis, not a checked claim — do not propagate as fact. Verifying it requires reading TM5 r1112's actual m-evolution path (advect_tools.F90 / dynam0 / Setup_MassFlow) and possibly running TM5 on the same data. Until that's done, the honest position is "we need mass_fixer=true to avoid polar drainage; whether TM5 needs it too is unknown".
-
Global mean ps is pinned in the v4 spectral preprocessor (TM5
Match('area-aver')equivalent). ERA5's 4DVar analysis is not mass-conserving; raw ⟨ps⟩ drifts ~10⁻⁴/day, which translates directly into a Σm drift in the binary becauseΣm = const + (A_Earth/g)·⟨sp⟩_area(withΣ_k b_k = 1). To eliminate this,preprocess_spectral_v4_binary.jlapplies a uniform additive shift to the griddedspimmediately aftersp = exp(LNSP_grid)so that⟨sp⟩_areacorresponds to a fixed dry-air mass target (Trenberth & Smith 2005:M_dry = 5.1352e18 kg → ⟨ps_dry⟩ ≈ 98726 Pa, converted to total viatarget_total = target_dry / (1 - ⟨qv⟩_global ≈ 0.00247)). The implementation mirrors TM5 cy3-4dvarmeteo.F90:1361-1374which callsMatch('area-aver', sp_region0=p_global=98500 Pa)viagrid_type_ll.F90:1147-1155.Verified 2026-04-07: 24h F64 LL test with non-uniform ps drift in raw ERA5 → after fix, Σm drift drops from -8.31e-04% to +5.88e-09% (~140,000× improvement, F32 quantization noise floor). Per-window offsets logged in the binary header (
ps_offsets_pa_per_window). Disable via[mass_fix] enable=falsein the preprocessor TOML for diagnostic purposes.What this does NOT fix: local polar drainage (invariant 11). The shift is uniform globally and only changes local
mbyb_k · Δps(~0.4% at the surface, less aloft). It does not change|bm|/menough at the troubled polar cells to remove the need formass_fixer = trueat runtime.
-
Define the type in
src/Advection/:struct MyScheme <: AbstractAdvectionScheme param1::Float64 end
-
Implement the interface (see
src/Advection/abstract_advection.jlfor contract):advect!(tracers, vel, grid, scheme::MyScheme, Δt) = ... adjoint_advect!(adj_tracers, vel, grid, scheme::MyScheme, Δt) = ...
For operator splitting, also implement
advect_x!,advect_y!,advect_z!. -
Wire into config (
src/IO/configuration.jl): Add a branch in_build_advection_scheme(config)to parse your TOML params. -
Export from
src/Advection/Advection.jl. -
Add tests in
test/test_advection.jl:- Uniform field invariance
- Mass conservation
- Adjoint identity (dot-product test)
-
Subtype
AbstractMetDriver{FT}insrc/IO/:struct MyDriver{FT} <: AbstractMetDriver{FT} ... end
-
Implement required methods (see
src/IO/abstract_met_driver.jl):total_windows(driver)→ Intwindow_dt(driver)→ seconds per windowsteps_per_window(driver)→ sub-steps per windowload_met_window!(cpu_buf, driver, grid, win_index)→ fill buffer with am, bm, cm, delp
-
Add TOML variable mapping in
config/met_sources/your_source.toml. -
Register in
src/IO/configuration.jldriver construction.
Follow the pattern used by existing operators (convection, diffusion, chemistry):
- Define abstract + concrete types with keyword constructor and validation
- Implement forward + adjoint methods dispatching on your type
- Add GPU kernels using
@kernel/@indexfrom KernelAbstractions - Wire into
OperatorSplittingTimeStepperif needed - Add tests for mass conservation and adjoint identity
See src/Diffusion/Diffusion.jl for a clean example with 4 implementations.
- Create matched configs with identical physics settings
- Use
TransportPolicyto ensure same transport configuration - Run both:
julia --project=. scripts/run.jl config/runs/comparison/<era5>.toml - Convert output:
julia --project=. scripts/postprocessing/convert_output_to_netcdf.jl - Compare: use
scripts/visualization/animate_comparison.jlor custom analysis
- GPU vs CPU: GPU kernels via KernelAbstractions. Same code runs on both.
All array allocation dispatches on
array_type(arch)→Array(CPU) orCuArray(GPU). - IO dominates for NetCDF: preprocessed binary < 1 s/win vs 15+ s/win for on-the-fly NetCDF
- Double buffering: overlaps IO with GPU compute. Use
buffering = "double"in config. - Column-major loops: see invariant 8. Always verify loop nesting matches memory layout.
- Avoid GC pressure: large intermediate allocations trigger GC (was 31 s/win, fixed to 0.9 s/win by reading directly into output panels with a single buffer).
- Kernel launch overhead: fuse small sequential kernels where memory traffic dominates.
- Coalesced access: ensure fastest-varying thread dimension matches innermost array index.
julia --project=. -e 'using Pkg; Pkg.test()'15 test files, 4570+ tests. Key test categories:
- Per-module physics tests (advection, diffusion, convection)
- Mass conservation (total tracer mass preserved)
- Adjoint identity (dot-product test: machine precision for linear operators)
- TM5 reference validation
- Integration tests (full run loop)
See docs/TRANSPORT_COMPARISON.md for detailed comparison of
AtmosTransport vs TM5 vs GEOS-Chem/GCHP algorithms, including the phased
implementation blueprint for unified ERA5/GEOS transport. Parts might be obsolete now...
Follow the data guide in docs/DATA_LAYOUT.md