Skip to content

feat: improve status on team operator resources#94

Open
ian-flores wants to merge 28 commits intomainfrom
improve-status-fields
Open

feat: improve status on team operator resources#94
ian-flores wants to merge 28 commits intomainfrom
improve-status-fields

Conversation

@ian-flores
Copy link
Collaborator

@ian-flores ian-flores commented Feb 19, 2026

Summary

  • Add Kubernetes-native conditions-based status reporting across all product CRDs
  • Add CommonProductStatus type with Conditions, ObservedGeneration, and Version fields, embedded in all status structs including PostgresDatabase
  • Add internal/status helper package with SetReady(), SetProgressing(), IsReady(), ExtractVersion() and reason constants (ReasonReconcileComplete, ReasonDeploymentReady/NotReady, ReasonStatefulSetReady/NotReady, ReasonAllComponentsReady, ReasonComponentsNotReady, ReasonDatabaseReady)
  • Add deployment health monitoring to Connect, Workbench, PackageManager, Flightdeck controllers and statefulset health monitoring to Chronicle
  • Progressing condition correctly reflects rollout state: True during active rollouts, False with ReasonReconcileComplete when done
  • Add conditions to PostgresDatabase for database provisioning success/failure tracking
  • Add aggregate Site status with per-component readiness bools (ConnectReady, WorkbenchReady, etc.) and Owns() watches on child CRs
  • Add Ready, Version, and Age printcolumns to all CRDs for kubectl get output
  • Migrate status updates from Status().Update() to Status().Patch() with MergeFrom to avoid conflicts
  • Mark existing Ready bool field as +optional for patch compatibility; value now derived from the Ready condition
  • Update fake test client with WithStatusSubresource for all CRD types
  • ExtractVersion handles registry ports (e.g., localhost:5000/image:v1) and digest references (image:tag@sha256:...)

Closes #25

Test plan

  • just build passes
  • just mgenerate regenerates CRDs and deepcopy without errors
  • just helm-generate keeps Helm chart in sync
  • go fmt produces no changes
  • Flightdeck controller tests pass with status patching
  • Deploy to test cluster and verify kubectl get connects shows READY/VERSION/AGE columns
  • Verify conditions populate correctly on reconciliation
  • Verify Site aggregate status reflects child component health
  • Verify status reverts to not-ready when deployments become unhealthy

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @ian-flores's task —— View job


Review: feat: improve status on team operator resources

Overall this is a well-structured change that follows Kubernetes conventions for conditions-based status reporting. The CommonProductStatus type, the internal/status helper package, and the migration from Status().Update() to Status().Patch() with MergeFrom are all solid patterns. Here are findings organized by priority:


Critical: ExtractVersion breaks on registry ports

internal/status/status.go:58-71

ExtractVersion uses strings.LastIndex(image, ":") which will incorrectly parse image references that include a registry port. For example, localhost:5000/myimage:v1.0 would extract 5000/myimage after the first :, not v1.0.

Fix by isolating the last path segment before checking for the tag:

func ExtractVersion(image string) string {
    if idx := strings.LastIndex(image, "@"); idx != -1 {
        image = image[:idx]
    }
    lastSlash := strings.LastIndex(image, "/")
    nameTag := image
    if lastSlash != -1 {
        nameTag = image[lastSlash+1:]
    }
    if idx := strings.LastIndex(nameTag, ":"); idx != -1 {
        tag := nameTag[idx+1:]
        if tag == "latest" {
            return ""
        }
        return tag
    }
    return ""
}

Important: Site aggregate readiness assumes all 5 components must exist

internal/controller/core/site_controller.go:97

allReady := s.Status.ConnectReady && s.Status.WorkbenchReady && s.Status.PackageManagerReady && s.Status.ChronicleReady && s.Status.FlightdeckReady

If a Site doesn't configure all five components, the missing child CRs won't be found, defaulting to false, so the Site can never become Ready. Consider making the aggregate check aware of which components are actually enabled in the Site spec.


Important: Repeated error-path status boilerplate across all controllers

Every controller has 5+ instances of this pattern:

status.SetReady(&c.Status.Conditions, c.Generation, metav1.ConditionFalse, status.ReasonReconcileError, err.Error())
status.SetProgressing(&c.Status.Conditions, c.Generation, metav1.ConditionFalse, status.ReasonReconcileError, err.Error())
_ = r.Status().Patch(ctx, c, patchBase)

Consider extracting a helper to reduce the boilerplate, e.g.:

func PatchErrorStatus(ctx context.Context, statusWriter client.StatusWriter, obj client.Object, patchBase client.Patch, conditions *[]metav1.Condition, generation int64, err error) {
    SetReady(conditions, generation, metav1.ConditionFalse, ReasonReconcileError, err.Error())
    SetProgressing(conditions, generation, metav1.ConditionFalse, ReasonReconcileError, err.Error())
    _ = statusWriter.Patch(ctx, obj, patchBase)
}

Suggestion: Inconsistent JSONPath quoting in printcolumn annotations

Some CRDs use single quotes ('Ready') and some use double quotes ("Ready") in the JSONPath filter expression. Both work, but pick one style for consistency across all CRDs. Connect, PackageManager, PostgresDatabase, and Site use 'Ready'; Chronicle, Flightdeck, and Workbench use "Ready".


Suggestion: DeploymentReady reason used for StatefulSet

internal/controller/core/chronicle_controller.go:139-140

Chronicle uses a StatefulSet but the condition reason is ReasonDeploymentReady / ReasonDeploymentNotReady. Consider adding ReasonStatefulSetReady or generalizing to ReasonReplicasReady.


Suggestion: Inline reason string for PostgresDatabase

internal/controller/core/postgresdatabase_controller.go:99

"DatabaseReady" is used as an inline string while all other reason values are centralized as constants in internal/status/status.go. Add it as a constant there for consistency.


Suggestion: Stale TODO in PackageManager

internal/controller/core/package_manager.go:221-222

The comment // TODO: Verify if PackageManager has an Image field is stale -- pm.Spec.Image is used unconditionally on the happy path. Remove the TODO.


Suggestion: PostgresDatabase duplicates Conditions/ObservedGeneration

api/core/v1beta1/postgresdatabase_types.go:51-62

PostgresDatabaseStatus defines its own Conditions and ObservedGeneration fields instead of embedding CommonProductStatus. This works but could drift over time. Consider embedding CommonProductStatus and simply not populating Version.


Suggestion: No tests for internal/status package

The internal/status package has no test file. Given the parsing logic in ExtractVersion (and the registry-port bug above), unit tests are warranted.


Suggestion: Document child-name-lookup assumption

internal/controller/core/site_controller.go:433

aggregateChildStatus uses site.Name as the object key for every child CR. A brief comment documenting this naming-convention assumption would help future maintainers.


Checklist (per review guidelines)

API Changes:

  • Kubebuilder annotations are correct
  • New fields have sensible defaults (all +optional)
  • Backward compatible (existing Ready bool field kept)
  • JSONPath quoting consistency (minor)

Controller Changes:

  • Reconciliation is idempotent
  • Error handling reports status correctly
  • Migration from Update to Patch with MergeFrom is correct
  • Unit/integration tests exist (no tests for internal/status)

CRDs:

  • CRDs are up to date (regenerated)
  • PrintColumns added correctly

|

@ian-flores
Copy link
Collaborator Author

Tested on ganso01-staging

Deployed adhoc image ghcr.io/posit-dev/team-operator:adhoc-improve-status-fields-v1.12.0-16-g7c8d4b5 to ganso01-staging and verified end-to-end.

kubectl get printcolumns

=== Connects ===
NAME   READY   VERSION                AGE
alt    True    ubuntu2204-2026.01.1   16d
main   True    ubuntu2204-2026.01.1   16d

=== Workbenches ===
NAME   READY   VERSION                AGE
alt    False   ubuntu2204-2026.01.0   16d
main   True    ubuntu2204-2026.01.0   16d

=== PackageManagers ===
NAME   READY   VERSION   AGE
alt    False             16d
main                     16d

=== Chronicles ===
NAME   READY   VERSION     AGE
alt    False               16d
main   True    2026.01.0   16d

=== Flightdecks ===
NAME   READY   VERSION   AGE
alt    True              16d
main   True              16d

=== Sites ===
NAME   READY   AGE
alt    False   16d
main   False   16d

=== PostgresDatabases ===
NAME                  READY   AGE
alt-connect                   16d
main-connect                  16d

Condition detail (Connect main)

status:
  conditions:
  - lastTransitionTime: "2026-02-19T22:09:21Z"
    message: Reconciliation complete
    observedGeneration: 2
    reason: ReconcileComplete
    status: "False"
    type: Progressing
  - lastTransitionTime: "2026-02-19T22:09:21Z"
    message: Deployment has minimum availability
    observedGeneration: 2
    reason: DeploymentReady
    status: "True"
    type: Ready
  keySecretRef: {}
  observedGeneration: 2
  ready: true
  version: ubuntu2204-2026.01.1

Observations

  • Printcolumns (READY, VERSION, AGE) working across all CRDs
  • Conditions correctly reflect deployment/statefulset health
  • Progressing=False with ReasonReconcileComplete on healthy resources
  • Site Ready=False is correct — alt site has unhealthy children (workbench, packagemanager, chronicle)
  • Version extracted from image tags where available
  • Backward-compatible ready bool derived from Ready condition
  • PostgresDatabases haven't been re-reconciled yet so conditions are empty — they'll populate on next spec change

@ian-flores ian-flores marked this pull request as ready for review February 19, 2026 22:18
@timtalbot
Copy link
Contributor

Nice! It looks like the comment Site aggregate readiness assumes all 5 components must exist is still true? Will we address that after #93?

…ut data loss

Extend the enable/disable/teardown pattern from Connect (PR #93) to
Workbench, Package Manager, and Chronicle.

- Add spec.{workbench,packageManager,chronicle}.enabled and .teardown
  fields to the Site CRD
- Add Suspended field to Workbench, PackageManager, and Chronicle CRDs
- Site controller uses three-way branching: enabled → reconcile,
  disabled → suspend (preserve data), teardown → delete CR
- Product controllers skip serving resources when Suspended=true
- Network policies cleaned up when products are disabled
- Fix Workbench cleanupDeployedService (was a no-op)
- Fix Chronicle CleanupChronicle (was a TODO no-op)
- Add 9 tests (3 per product: disable-never-enabled, suspend, teardown)

Closes #95
- Remove unused getChronicle test helper
- Correct "CRD" to "CR" in cleanup function godoc comments
- Fix double blank line in product-team-site-management.md
@ian-flores
Copy link
Collaborator Author

Let me wait to finish #99 and I'll address comprehensively for all

# Conflicts:
#	internal/controller/core/connect.go
…into improve-status-fields

# Conflicts:
#	internal/controller/core/chronicle_controller.go
#	internal/controller/core/package_manager.go
#	internal/controller/core/workbench.go
A disabled product (enabled: false) has no child CR, which caused
aggregateChildStatus to set *Ready = false and block the site from
reaching Ready. Check Enabled first for each product and treat
disabled products as ready so they don't factor into the aggregate.

Adds TestSiteReadyWithDisabledProducts to cover this case.
Changes:
- Add `Enabled` check for Flightdeck in `aggregateChildStatus` (mirrors the existing Chronicle/Connect/Workbench/PackageManager pattern) so a disabled Flightdeck is treated as ready and doesn't permanently block `allReady`
- Change `aggregateChildStatus` signature to return `error`; on non-`IsNotFound` API errors, log and return the error instead of silently setting the ready flag to false
- Update `Reconcile` to capture the returned error from `aggregateChildStatus` and propagate it as the return value after status patch, triggering reconciler requeue with backoff on transient API/RBAC failures
Build succeeds. The test failures are pre-existing environment issues (kubebuilder's etcd binary not present in this sandbox), confirmed by the same failures occurring before my changes.
Changes:
- Fix `allReady` permanently false for Sites with optional Chronicle/Flightdeck: change guard condition from `Enabled != nil && !*Enabled` (only explicit false) to `Enabled == nil || !*Enabled` (nil or false) so these optional components don't block site readiness unless explicitly opted in via `Enabled: true`
- Normalize `jsonPath` filter quoting from single to double quotes in Helm chart CRDs for connects, postgresdatabases, and sites
- Normalize `jsonPath` filter quoting from single to double quotes in base CRDs for connects, postgresdatabases, and sites
JSONPath single-quote style fix regenerated by controller-gen.
Build passes. The only failing test (`TestSiteReconcileWithExperimental`) is pre-existing and unrelated to these changes.
---
Changes:
- Add comment to `WithStatusSubresource` block in `api/localtest/fake.go` explaining the purpose of the list and reminding future authors to add new `+kubebuilder:subresource:status` types
- Add `api/localtest/fake_test.go` with `TestStatusUpdateOnlyMutatesStatus` that calls `Status().Update()` on a `Site`, confirms the status field is persisted, and confirms the spec was not mutated — documenting intent and guarding against regression if the builder pattern is accidentally reverted
- Add PatchErrorStatus helper to reduce error-path boilerplate across all controllers
- Fix inconsistent JSONPath quoting in CRD printcolumn annotations (use single quotes)
- Add comprehensive tests for internal/status package
- Regenerate CRDs and Helm charts with updated annotations
The build succeeds. The test failures (`TestConnectReconciler_SAML*`) are due to missing `etcd` binary in this environment (`fork/exec /usr/local/kubebuilder/bin/etcd: no such file or directory`) — these are infrastructure-level integration tests that require a running control plane and are unrelated to my changes. All other tests pass.
Changes:
- Move `patchBase` creation and status mutations (`ObservedGeneration`, `SetProgressing`) to after the suspension guard so they are not silently discarded when `suspendDeployedService` is called (which does not apply the patch)
- Remove extra blank line between suspension block and `secretKey` declaration (already resolved in the working tree; confirmed single blank line present)
The build passes cleanly. The test failures are all infrastructure-related (`/usr/local/kubebuilder/bin/etcd: no such file or directory`) — these are integration tests requiring a Kubernetes control plane that isn't available in this environment. They are pre-existing failures unrelated to the changes made.
---
**Changes:**
- Move suspension check before `patchBase` and status mutations in `ReconcileChronicle` so suspended resources don't have unpersisted status writes
- Move suspension check before `patchBase` and status mutations in `ReconcilePackageManager` for the same reason
- Move suspension check before `patchBase` and status mutations in `ReconcileWorkbench` for the same reason
- Suspended resources no longer receive a misleading "Reconciling" progressing condition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Status on the Team Operator Resources

2 participants