Skip to content

fix: needs_recreate behavior#424

Open
jason-lynch wants to merge 1 commit into
mainfrom
fix/PLAT-668/needs-recreate-logic
Open

fix: needs_recreate behavior#424
jason-lynch wants to merge 1 commit into
mainfrom
fix/PLAT-668/needs-recreate-logic

Conversation

@jason-lynch

@jason-lynch jason-lynch commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

This commit modifies an existing process that handled pending deletions to remove resources that are pending recreation from the start state. This enables us to plan their creation at the appropriate phase.

I've included a golden test that demonstrates this behavior. I've also modified some existing PostgREST E2Es to test this behavior, since that's where we first noticed the issue.

Background

Resources can sometimes fail partway through the creation process, leaving files or other traces behind on the host. An example of this scenario is when we can create a Swarm service, but it fails to start. Rather than roll back the creation with a delete, which could also be broken or dangerous, we record the resource into the state with the needs_recreate flag set to true. This way, if the user deletes the database or makes another update that would delete the resource, we know to run that resource's Delete method to ensure that it's removed from the host.

Conversely, we know that we should run this resource's Create method again, rather than its Update method, if the user reattempts the operation. In this case, the intent was to call Create in whichever phase it would normally be recreated if it didn't exist. However, since our planning process only looks at differences between states, it sees resources that need to be recreated in the initial state, even if there aren't any diffs. This led to problems when the resource's normal phase was later in the update plan, such as when you fix a misconfigured supporting service resource.

Changes

The core change is in state.go: modifying the existing MarkPendingDeletion to also handle pending recreations. Most of the rest of this PR is test code.

Testing

# using a Swarm cluster
# create a database with a PostgREST instance with a misconfigured db_anon_role
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      },
      {
        "username": "anon",
        "attributes": ["NOLOGIN"]
      }
    ],
    "port": 0,
    "nodes": [{ "name": "n1", "host_ids": ["host-1"] }],
    "services": [
      {
        "service_id": "api",
        "service_type": "postgrest",
        "version": "latest",
        "host_ids": ["host-1"],
        "port": 3100,
        "connect_as": "admin",
        "config": {
          "db_anon_role": "nonexistent"
        }
      }
    ]
  }
}
EOF

# this should fail while creating the swarm.postgrest_preflight and
# swarm.postgrest_authenticator resources

# now, update the database with a corrected db_anon_role
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      },
      {
        "username": "anon",
        "attributes": ["NOLOGIN"]
      }
    ],
    "port": 0,
    "nodes": [{ "name": "n1", "host_ids": ["host-1"] }],
    "services": [
      {
        "service_id": "api",
        "service_type": "postgrest",
        "version": "latest",
        "host_ids": ["host-1"],
        "port": 3100,
        "connect_as": "admin",
        "config": {
          "db_anon_role": "anon"
        }
      }
    ]
  }
}
EOF

# this update would fail on main, but should succeed on this branch

PLAT-668

Resources can sometimes fail partway through the creation process,
leaving files or other traces behind on the host. An example of this
scenario is when we're able to create a Swarm service, but it fails to
start. Rather than rollback the create with a delete, which could also
be broken or dangerous, we record the resource into the state with the
`needs_recreate` flag set to `true`. This way, if the user deletes the
database or makes another update that would delete the resource, we know
to run that resource's `Delete` method to ensure that it's removed from
the host. Conversely, we know that we should run this resource's `Create`
method again rather than its `Update` method if the user re-attempts the
operation. In this case, the intent was to call `Create` in whichever
phase it would normally be recreated if it didn't exist. However, since
our planning process only looks at differences between states, it sees
resources that need to be recreated in the initial state, even if there
aren't any diffs. This lead to problems when the resource's normal phase
was later in the update plan, such as when you fix a misconfigured
supporting service resource.

This commit modifies an existing process that handled pending deletions
so that it removes resources that are pending recreation from the start
state. This enables us to plan their creation at the appropriate phase.

I've included a golden test that demonstrates this behavior. I've also
modified some existing PostgREST E2Es to test this behavior because
because that's where we first noticed this issue.

PLAT-668
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Changes State's MarkPendingDeletion method into PreparePendingDeletesAndRecreates, which additionally removes resources from the start-state map when they need recreation. UpdateDatabase now clones its input state and calls the new method. Tests cover this scenario with a golden fixture. Also, PostgREST e2e tests now verify successful recovery updates after initial failures.

Changes

Recreate handling and failed deployment recovery

Layer / File(s) Summary
PreparePendingDeletesAndRecreates method
server/internal/resource/state.go
Renames MarkPendingDeletion to PreparePendingDeletesAndRecreates, adding logic to delete recreate-flagged resources from the start-state map.
UpdateDatabase cloning and wiring
server/internal/database/operations/update_database.go
UpdateDatabase clones the input state before mutating it and calls the renamed method instead of MarkPendingDeletion.
Failed service deployment tests and golden output
server/internal/database/operations/update_database_test.go, server/internal/database/operations/golden_test/TestUpdateDatabase/update_after_failed_service_deployment.json
Adds a test scenario and golden JSON fixture for update planning after a partially failed service deployment involving recreate/delete resources.

PostgREST e2e recovery tests

Layer / File(s) Summary
Preflight failure and recovery assertions
e2e/postgrest_test.go
Adds follow-up db.Update calls with corrected schema/anon-role configuration after initial failures, asserting successful recovery.

Compact metadata: 5 files changed, +125/-4 lines across e2e tests, database operations, and resource state.

Suggested labels: go, tests, bug-fix

Suggested reviewers: pgEdge control-plane maintainers familiar with database operations and resource state reconciliation

Poem

A rabbit hops through state so neat,
Cloning starts before each feat,
Recreates cleared, deletions planned,
PostgREST configs fixed by hand,
Golden tests now light the way—
Hop hop hooray! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is concise and accurately summarizes the main change to needs_recreate handling.
Description check ✅ Passed The description covers summary, changes, and testing with enough detail, though the checklist and notes sections are omitted.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/PLAT-668/needs-recreate-logic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@jason-lynch jason-lynch marked this pull request as ready for review July 2, 2026 20:59
@jason-lynch jason-lynch requested a review from moizpgedge July 2, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants