fix(infra): Task processor tasks killed during startup by health check#7887
Open
khvn26 wants to merge 1 commit into
Open
fix(infra): Task processor tasks killed during startup by health check#7887khvn26 wants to merge 1 commit into
khvn26 wants to merge 1 commit into
Conversation
The task-processor health check (`flagsmith healthcheck tcp`) passes only once the container binds port 8000, but the entrypoint runs full bootstrap (migrations, createcachetable, ClickHouse migrate, waitfordb) first, taking ~80s to bind. With startPeriod 5s the container was marked unhealthy and SIGTERMed after ~55s — often just before it came up — recycling tasks several times a day and flapping deploys. Raise startPeriod to 120s in the prod and staging ECS task definitions to cover the observed cold-start and its variance. beep boop
|
The latest updates on your projects. Learn more about Vercel for GitHub. 3 Skipped Deployments
|
This was referenced Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for submitting a PR! Please check the boxes below:
docs/if required so people know about the feature.Changes
Raises the task-processor container health check
startPeriodfrom 5s to 120s, in both the production and staging ECS task definitions.With
startPeriod: 5along withinterval 10 / retries 5, ECS only tolerated ~55s before marking the container unhealthy, often killing tasks just as they were about to come up. This has been recycling task-processor ECS tasks several times a day, and makes deploys flap.startPeriod: 120covers the observed cold start and its variance, whileinterval 10 / timeout 2 / retries 5still flags a genuinely hung task ~50s after the grace window closes.This is a temporary change until we figure out boot time improvements.
How did you test this code?
N/A