Skip to content

Add automated backup and restore via dedicated CRDs#2015

Open
discostur wants to merge 1 commit into
Altinity:0.27.1from
discostur:feature/clickhouse-backup
Open

Add automated backup and restore via dedicated CRDs#2015
discostur wants to merge 1 commit into
Altinity:0.27.1from
discostur:feature/clickhouse-backup

Conversation

@discostur

Copy link
Copy Markdown
Contributor

Add automated backup & restore via dedicated CRDs

Closes the long-standing request for operator-managed backups (#1795, #862). This
supersedes the gRPC-plugin approach of #1798 with a lighter, CRD-driven design that
wraps clickhouse-backup and mirrors the
existing ClickHouseKeeper controller-runtime pattern.

What it adds

Three new custom resources in the clickhouse.altinity.com/v1 group:

Kind Short Operator action
ClickHouseBackup chb one-off backup → Kubernetes Job
ClickHouseBackupSchedule chbs recurring backup → managed CronJob
ClickHouseRestore chr one-off restore → Kubernetes Job

Design

  • No reinvented backup logic. clickhouse-backup runs as a sidecar (documented
    prerequisite, API_CREATE_INTEGRATION_TABLES=true); the operator-generated jobs trigger
    it remotely through the system.backup_actions integration table. The new controllers
    stay fully decoupled from CHI reconciliation.
  • controller-runtime, mirroring pkg/controller/chk. Jobs/CronJobs are owned by the
    CR (automatic GC + status tracking via status.conditions).
  • Schedule → native CronJob: schedule/suspend/concurrency/history come from the
    CronJob; remote retention is delegated to clickhouse-backup (BACKUPS_TO_KEEP_REMOTE).

Cluster awareness

  • Backup defaults to one replica per shard (FirstPerShard) — correct and
    storage-efficient for Replicated* tables. AllReplicas is available for clusters with
    non-replicated/local tables.
  • Restore applies schema on all replicas and data on the first replica of each
    shard
    , letting native replication synchronize the rest.

Restore safety (per mature DB-operator conventions, e.g. CloudNativePG)

  • Preflight validation (target CHI Completed, topology reachable) surfaced via conditions.
  • Overwrite guard: refuses a non-empty target unless overwrite: true.
  • Restoring into a fresh, empty CHI is the recommended path; one-shot jobs
    (backoffLimit: 0, restartPolicy: Never).

Included

  • API types + hand-written deepcopy (regenerable via dev/run_code_generator.sh).
  • Controllers + minimal Job/CronJob builders + operator wiring (thread_backup.go).
  • Hand-written CRD section + RBAC (incl. batch jobs/cronjobs); regenerated install
    bundles and Helm chart.
  • docs/backup.md + docs/chb-examples/.
  • Go unit tests (builders + controller reconcile) and a TestFlows e2e test
    (tests/e2e/test_backup_restore.py) doing a backup→restore round-trip with replica-sync
    verification.

Notes / follow-ups

  • Host service names are resolved from cluster layout counts using the default naming
    scheme; explicit shard/replica lists / custom host names are a planned follow-up.
  • For sharded clusters the sidecar remote path should include the {shard} macro
    (e.g. S3_PATH: backup/shard-{shard}).

🤖 Generated with Claude Code

@discostur discostur force-pushed the feature/clickhouse-backup branch from 3ce23ba to 69632ec Compare June 29, 2026 18:32
@discostur

discostur commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Pushed an update extending the feature with several enhancements:

  • Selective backupsspec.tables (clickhouse-backup --tables pattern) and spec.partitions.
  • Incremental backupsspec.diffFromRemotecreate_remote --diff-from-remote=<base>.
  • Retentionspec.keepLastRemote keeps only the N most recent remote backups (best-effort prune via system.backup_list + delete remote).
  • Verificationspec.verify runs a post-backup job that downloads the backup and checks its integrity (no cluster data touched); result surfaced as the Verified condition + a clickhouse_operator_backup_verifications_failed metric.
  • Observability — Prometheus metrics on the operator's existing :9999/metrics endpoint (clickhouse_operator_backups_* / restores_*, a duration histogram, last-success timestamp), Kubernetes Events on backup/restore lifecycle, and duration in status + a duration printer column.
  • Bootstrap-from-backup — annotate a fresh ClickHouseInstallation with clickhouse.altinity.com/recover-from-backup: <name> and the operator auto-creates a one-time ClickHouseRestore once the cluster is up.
  • Compression & encryption — documented as clickhouse-backup sidecar settings (COMPRESSION_FORMAT, S3 S3_SSE/SSE_KMS_KEY_ID, etc.).

I also corrected the replicated-restore path to issue the schema ON CLUSTER (requires the sidecar's restore_schema_on_cluster) instead of per-replica.

@discostur discostur force-pushed the feature/clickhouse-backup branch from 69632ec to 56b62c0 Compare June 30, 2026 09:12
Introduces operator-managed backup and restore for ClickHouse using
clickhouse-backup, exposed through three new custom resources in the
clickhouse.altinity.com/v1 API group:

- ClickHouseBackup (chb): one-off backup -> Kubernetes Job
- ClickHouseBackupSchedule (chbs): recurring backup -> managed CronJob
- ClickHouseRestore (chr): one-off restore -> Kubernetes Job

The controllers follow the existing ClickHouseKeeper controller-runtime
pattern. clickhouse-backup runs as a sidecar (a documented prerequisite);
the generated jobs trigger it remotely through the system.backup_actions
integration table, so no backup logic is reimplemented in the operator.

Cluster-aware: backs up one replica per shard for Replicated* tables
(AllReplicas opt-in for non-replicated data); on restore it applies the
schema on the first replica per shard via ON CLUSTER (requires the
sidecar's restore_schema_on_cluster) and the data on the first replica,
letting native replication synchronize the remaining replicas.

Restore safety follows the conventions of mature DB operators: preflight
validation (target CHI Completed, topology reachable) and an overwrite
guard that refuses a non-empty target unless overwrite=true.

Also adds: selective (tables/partitions) and incremental
(--diff-from-remote) backups; remote-backup retention (keepLastRemote);
optional post-backup verification; Prometheus metrics on the operator's
existing :9999 endpoint plus Kubernetes Events; and annotation-driven
bootstrap-from-backup for new installations. Compression and encryption
are documented as clickhouse-backup sidecar settings.

Includes the CRDs, RBAC (incl. batch jobs/cronjobs), regenerated install
bundles and Helm chart, documentation and examples, Go unit tests and a
TestFlows e2e test.

Refs Altinity#1795, Altinity#862. Supersedes the gRPC-plugin approach of Altinity#1798.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Kilian Ries <mail@kilian-ries.de>
@discostur discostur force-pushed the feature/clickhouse-backup branch from 56b62c0 to d5f2eb2 Compare June 30, 2026 09:41
@discostur

Copy link
Copy Markdown
Contributor Author

Tested locally in dev k8s cluster ... happy to get some feedback from the maintainers ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant