Skip to content

Pipeline workflow killed by GitHub Actions timeout — modal run --detach does not survive runner cancellation #636

@baogorek

Description

@baogorek

Problem

The Run Pipeline GitHub Actions workflow (.github/workflows/pipeline.yaml) fails every time it runs on main. The pipeline is killed ~10 minutes into execution when the GitHub Actions runner timeout fires.

Root cause

The workflow uses modal run --detach to launch the pipeline:

timeout-minutes: 10
...
run: |
  modal run --detach modal_app/pipeline.py::main $ARGS

--detach does not work the way we assumed. Specifically:

  • --detach does not make the CLI return immediately after launching the app
  • The CLI stays connected and streams logs to stdout (we observed Phase 1, Phase 2, and Phase 3 output in the GitHub Actions log)
  • --detach only means "if the CLI process dies or disconnects, keep the Modal app running"
  • However, when GitHub Actions fires its timeout, it sends an active cancellation signal to the step — this is different from the CLI simply disconnecting
  • The Modal CLI interprets this cancellation and stops the remote app, despite --detach

The pipeline takes 10+ hours end-to-end (dataset build, calibration, regional H5s, national H5s, staging, diagnostics). GitHub Actions has a hard maximum of 6 hours per job, so even bumping the timeout cannot solve this.

What we observed

From the GitHub Actions log of run #367 (workflow run 23567657966):

[Step 1/5] Building datasets...
=== Phase 1: Building independent datasets (parallel) ===
...
=== Phase 2: Building CPS and PUF (parallel) ===
...
=== Phase 3: Building extended CPS ===
Starting policyengine_us_data/datasets/cps/extended_cps.py...
Target variable 'estate_income_would_be_qualified' has constant value True...
Error: The operation was canceled.

The runner timed out at 10m14s. The Modal app was actively running and doing real work, but was killed by the cancellation.

Why other approaches don't work

Approach Problem
timeout-minutes: 360 (6h max) Pipeline takes 10+ hours
nohup modal run --detach ... & Race condition: if the runner VM is torn down before the image build completes, the Modal app never launches. Also untested whether --detach actually survives process cleanup in GHA.
modal run --detach (current) CLI streams logs and dies on timeout cancellation

Fix: modal deploy + .spawn()

PR #635 implements this fix. The approach:

  1. modal deploy modal_app/pipeline.py — Registers the app as a persistent deployment on Modal. This builds the image, uploads code, and makes all @app.function() functions callable remotely. The deploy step takes 1-3 minutes.

  2. run_pipeline.spawn(branch='main', ...) — Calls the deployed run_pipeline function as a fire-and-forget. .spawn() submits the job to Modal and returns immediately with a FunctionCall handle. The GitHub Actions step exits in seconds.

  3. The pipeline runs on Modal infrastructure for as long as it needs (up to run_pipeline's timeout=86400 = 24 hours), completely untethered from the GitHub runner.

- name: Deploy and launch pipeline on Modal
  run: |
    modal deploy modal_app/pipeline.py

    python -c "
    import modal
    run_pipeline = modal.Function.from_name('policyengine-us-data-pipeline', 'run_pipeline')
    run_pipeline.spawn(branch='main', ...)
    print('Pipeline spawned. Monitor on the Modal dashboard.')
    "

Key Modal concepts for future reference

  • modal run creates an ephemeral app — it exists only while the CLI is connected. --detach makes it survive a disconnect but not an active cancellation.
  • modal deploy creates a persistent app — it stays registered on Modal until explicitly stopped. Functions can be invoked from anywhere via .spawn(), .remote(), or web endpoints.
  • .spawn() is Modal's fire-and-forget mechanism — it submits a function call and returns a FunctionCall handle immediately without waiting for completion.
  • .remote() calls a function and blocks until it returns — equivalent to what modal run does under the hood.

Timeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions