Skip to content

deleteWorkflow deadlocks the workflow API when a running execution of the same workflow holds child-row locks #5140

@carloea2

Description

@carloea2

What happened?

WorkflowResource.deleteWorkflow opens a JDBC transaction and CASCADE-deletes the workflow row without first stopping any in-flight executions that target the same workflow. While the ComputingUnitWorker keeps writing to FK-child tables (workflow_view_count, workflow_executions, workflow_user_likes, …), the CASCADE check blocks on a row-level lock and never returns. Every subsequent createWorkflow / deleteWorkflow / view-count POST piles up behind it on the same lock.

From the user's perspective the Workflows page becomes fully unresponsive: uploads hang with no error, deletes hang, and the webpack-dev-server proxy eventually emits ECONNRESET then ECONNREFUSED. Recovery requires restarting the JVMs.

Problematic code at WorkflowResource.scala:631:

context.transaction { _ =>
  for (wid <- workflowIDs.wids) {
    if (workflowOfUserExists(wid, user.getUid)) {
      workflowDao.deleteById(wid)
    } else {
      throw new BadRequestException("The workflow does not exist.")
    }
  }
}

No active-execution check, no lock_timeout / statement_timeout, no error path — the request thread sits in executeQuery indefinitely.

Suggested fixes (in order of preference)

  1. Cancel running executions before deleting. In deleteWorkflow, look up active executions of the workflow via ExecutionResultService / WorkflowExecutionsResource and abort them before opening the delete transaction. Deleting a workflow should imply "stop everything that depends on it".
  2. Bound the delete transaction. SET LOCAL lock_timeout = '10s'; SET LOCAL statement_timeout = '30s'; at the start of the transaction so a hung child-table lock surfaces as a 5xx instead of freezing the entire workflow API.
  3. Independently, harden HubResource.postView. It blindly upserts into workflow_view_count for whatever wid the dashboard sends; if that wid was just deleted in another tab, the FK violation throws as a 500 and stale tabs keep retrying, exacerbating the contention. An existence check (context.fetchExists(BaseEntityTable(entityType).table, idColumn.eq(entityID))) before the upsert turns those into a no-op return 0.

Workaround

Kill the Texera JVMs (TexeraWebApplication, ComputingUnitWorker, ComputingUnitMaster), restart them, then reload the Workflows page to clear any cached stale wids being POSTed for view-count.

How to reproduce?

  1. Open a workflow and start an execution that keeps the worker busy for >10 s (e.g. an iris ML pipeline).
  2. While the execution is still running, navigate to /dashboard/user/workflow and delete that workflow from the row's delete action.
  3. Try to upload another workflow (or delete a second one) from the same page.

Expected: upload completes; delete completes once the execution is canceled or finishes.
Observed: delete hangs forever, upload hangs forever, every subsequent workflow-table write piles up behind the same lock. After enough pileup the JVM closes connections under socket pressure and the dev-server proxy starts emitting ECONNRESET → ECONNREFUSED.

Branch

main

Commit Hash (Optional)

No response

What browsers are you seeing the problem on?

Not browser-specific — reproduces on any client; the freeze is server-side.

Relevant log output

# Thread dump of TexeraWebApplication while the API is frozen
# Problem: one open delete transaction holding the row lock,
# every other workflow-table write queued behind it.

"dw-NN" #N daemon (waiting on Postgres response)
   at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137)
   at org.jooq.tools.jdbc.DefaultPreparedStatement.executeQuery(DefaultPreparedStatement.java:104)
   at org.jooq.impl.AbstractDMLQuery.executeReturningQuery(AbstractDMLQuery.java:1249)
   at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:428)
   at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:961)
   at org.jooq.impl.DAOImpl.deleteById(DAOImpl.java:284)
   at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.$anonfun$deleteWorkflow$3(WorkflowResource.scala:634)
   at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:612)
   at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:611)
   at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.deleteWorkflow(WorkflowResource.scala:631)

"dw-MM" / "dw-OO" / "dw-PP" ... (queued behind the open transaction)
   at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137)
   at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:1074)
   at org.jooq.impl.TableRecordImpl.storeInsert0(TableRecordImpl.java:193)
   at org.jooq.impl.TableRecordImpl.insert(TableRecordImpl.java:140)
   at org.jooq.impl.DAOImpl.insert(DAOImpl.java:156)
   at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource$.insertWorkflow(WorkflowResource.scala:89)
   at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.createWorkflow(WorkflowResource.scala:573)

# Frontend webpack-dev-server proxy view of the same incident:
[HPM] Error occurred while proxying request localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNRESET]
[HPM] Error occurred while proxying request localhost:4200/api/workflow/delete to http://localhost:8080/ [ECONNRESET]
... (many lines later, after enough socket exhaustion)
[HPM] Error occurred while proxying request localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNREFUSED]

# Foreign-key violation that compounds the contention while the lock is held —
# fired by stale dashboard tabs POSTing /api/hub/view for the deleted wid:
org.jooq.exception.DataAccessException: SQL [insert into "texera_db"."workflow_view_count" ("wid", "view_count") values (?, ?)
  on conflict ("wid") do update set "view_count" = ("texera_db"."workflow_view_count"."view_count" + ?)
  returning "texera_db"."workflow_view_count"."view_count"];
ERROR: insert or update on table "workflow_view_count" violates foreign key constraint "workflow_view_count_wid_fkey"
Detail: Key (wid)=(173) is not present in table "workflow".
    at org.apache.texera.web.resource.dashboard.hub.HubResource.postView(HubResource.scala:401)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions