Skip to content

compute_worker re-executes submissions on broker redelivery, causing duplicate scores, hostname overwrite, and status flipping #2433

Description

@AybH26

Issue

When a compute_worker task is interrupted after the work has started but before the message is acked (worker crash, OOM kill, soft time-out, container restart, network partition with the broker), RabbitMQ requeues the same compute_worker_run payload. A subsequent worker pulls it from the broker and runs the full lifecycle a second time end-to-end.

Observed concrete symptoms:

  • The submission row's scoring_worker_hostname gets overwritten with the second worker's hostname, destroying the audit trail of who actually ran the job.
  • The submission status cycles backwards: a row already in Finished is PATCHed back through Running → Scoring → Finished, breaking downstream signals (notifications, leaderboard recomputation).
  • upload_submission_scores is called a second time and inserts a duplicate SubmissionScore row per leaderboard column. The aggregator Submission.calculate_scores() then crashes with MultipleObjectsReturned on submission.scores.get(column=col), and any subsequent score read can return either of the duplicates non-deterministically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions