Skip to content

Workflow upload hangs indefinitely: pgroonga TokenMecab index on workflow.content blocks INSERTs #5142

@carloea2

Description

@carloea2

What happened?

The idx_workflow_pgroonga full-text index on texera_db.workflow makes workflow uploads hang indefinitely once a workflow's content field is non-trivial in size. The current definition:

CREATE INDEX idx_workflow_pgroonga ON texera_db.workflow USING pgroonga (
  ((((((COALESCE(name, '')::text || ' ') ||
        COALESCE(description, ''))     || ' ') ||
        COALESCE(content,  ''))))
) WITH (tokenizer='TokenMecab');

Two compounding problems make this catastrophic for everyday usage:

  1. content is in the indexed expression. workflow.content holds the serialized workflow graph; for any realistically sized workflow it is many KB to several MB of JSON-encoded data (operator properties, schemas, code, embedded resources). Every INSERT INTO workflow therefore runs the tokenizer over the entire blob.
  2. Tokenizer is TokenMecab — a Japanese morphological tokenizer (it depends on local config of user). Running MeCab over JSON / serialized data is dramatically slower than the natural-language input it was designed for, and produces useless index entries (no human will full-text-search the inside of a workflow's JSON content as Japanese prose).

Once a single insert is in flight inside pgroonga's C code, pg_cancel_backend and pg_terminate_backend return t but the backend ignores them — pgroonga's tokenization loop doesn't call CHECK_FOR_INTERRUPTS(). The session keeps its RowExclusiveLock on the table; every subsequent insert / update / delete queues behind it and hangs forever. Restarting the Texera JVMs doesn't help — the stuck Postgres backend survives. Recovery requires bouncing the Postgres service itself (Stop-Service postgresql / taskkill /F of the postmaster).

Visible symptom: workflow upload silently hangs in the dashboard with no error toast; the dev-server proxy eventually emits ECONNRESET → ECONNREFUSED. Workflow create / delete is broken across JVM restarts until Postgres is bounced.

Suggested fixes

In order of preference:

  1. Drop content from the indexed expression. Users search workflows by name and description, not by the internal JSON of the graph. The index becomes:
    USING pgroonga ((COALESCE(name, '') || ' ' || COALESCE(description, '')))
    That cuts the indexed text from MB to bytes per row and removes the worst-case tokenization load.
  2. Replace TokenMecab with TokenBigramSplitSymbolAlphaDigit (pgroonga's default for mixed-language / non-natural-language input). TokenMecab is appropriate only when the indexed text is Japanese prose.
  3. Make it a partial index excluding rows where octet_length(content) exceeds some threshold, as a defensive backstop.
  4. Independently, add lock_timeout=5s, statement_timeout=30s to the JDBC connection string used by WorkflowResource so a hung insert returns a 5xx in seconds rather than pinning a Postgres backend forever.

How to reproduce?

  1. With the pgroonga extension and the idx_workflow_pgroonga index in place (current schema), upload any workflow whose content is ≥ ~1 MB (most non-trivial workflows qualify). Also include any binary payload.
  2. pg_stat_activity shows the corresponding INSERT INTO workflow at state='active', wait_event=NULL, age climbing into minutes, holding RowExclusiveLock on the table.
  3. Try to upload anything else — the second INSERT queues behind it forever.
  4. pg_terminate_backend(<stuck_pid>) returns t but the backend keeps running.

Branch

main

Commit Hash (Optional)

No response

What browsers are you seeing the problem on?

Not browser-specific; reproduces from any client.

Relevant log output

-- pg_stat_activity while the upload hangs
  pid  | state  | wait_event_type | wait_event | age_s |                                          q
-------+--------+-----------------+------------+-------+-----------------------------------------------------------
 41144 | active |                 |            |   558 | insert into "texera_db"."workflow" ("name", "description", "content") values ($1, $2, $3) returning ...

-- The granted-but-stuck lock that pins the table
 locktype |       mode       | granted |  pid  | state  | age_s
----------+------------------+---------+-------+--------+-------
 relation | RowExclusiveLock | t       | 41144 | active |    63

-- pg_cancel_backend / pg_terminate_backend report success; the session ignores them
SELECT pg_terminate_backend(41144);   -- returns 't', PID 41144 still active 5 minutes later

-- JVM side: RUNNABLE inside socketRead0, waiting for a Postgres response that never comes
"dw-50 - POST /api/workflow/create" #50 elapsed=127s runnable
  java.lang.Thread.State: RUNNABLE
   at java.net.SocketInputStream.socketRead0(Native Method)
   at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2174)
   at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137)
   at org.jooq.impl.DAOImpl.insert(DAOImpl.java:156)
   at WorkflowResource.insertWorkflow(WorkflowResource.scala:89)
   at WorkflowResource.createWorkflow(WorkflowResource.scala:573)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions