Skip to content

fix(infra): Task processor tasks killed during startup by health check#7887

Open
khvn26 wants to merge 1 commit into
mainfrom
fix/task-processor-healthcheck-start-period
Open

fix(infra): Task processor tasks killed during startup by health check#7887
khvn26 wants to merge 1 commit into
mainfrom
fix/task-processor-healthcheck-start-period

Conversation

@khvn26

@khvn26 khvn26 commented Jun 26, 2026

Copy link
Copy Markdown
Member

Thanks for submitting a PR! Please check the boxes below:

  • I have read the Contributing Guide.
  • I have added information to docs/ if required so people know about the feature.
  • I have filled in the "Changes" section below.
  • I have filled in the "How did you test this code" section below.

Changes

Raises the task-processor container health check startPeriod from 5s to 120s, in both the production and staging ECS task definitions.

With startPeriod: 5 along with interval 10 / retries 5, ECS only tolerated ~55s before marking the container unhealthy, often killing tasks just as they were about to come up. This has been recycling task-processor ECS tasks several times a day, and makes deploys flap.

startPeriod: 120 covers the observed cold start and its variance, while interval 10 / timeout 2 / retries 5 still flags a genuinely hung task ~50s after the grace window closes.

This is a temporary change until we figure out boot time improvements.

How did you test this code?

N/A

The task-processor health check (`flagsmith healthcheck tcp`) passes only
once the container binds port 8000, but the entrypoint runs full bootstrap
(migrations, createcachetable, ClickHouse migrate, waitfordb) first, taking
~80s to bind. With startPeriod 5s the container was marked unhealthy and
SIGTERMed after ~55s — often just before it came up — recycling tasks
several times a day and flapping deploys.

Raise startPeriod to 120s in the prod and staging ECS task definitions to
cover the observed cold-start and its variance.

beep boop
@khvn26 khvn26 requested a review from a team as a code owner June 26, 2026 12:01
@vercel

vercel Bot commented Jun 26, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments
Project Deployment Actions Updated (UTC)
docs Ignored Ignored Jun 26, 2026 12:01pm
flagsmith-frontend-preview Ignored Ignored Jun 26, 2026 12:01pm
flagsmith-frontend-staging Ignored Ignored Jun 26, 2026 12:01pm

Request Review

@khvn26 khvn26 requested review from germangarces and removed request for a team June 26, 2026 12:01

@germangarces germangarces left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants