Add automated backup and restore via dedicated CRDs by discostur · Pull Request #2015 · Altinity/clickhouse-operator

discostur · 2026-06-29T16:07:32Z

Add automated backup & restore via dedicated CRDs

Closes the long-standing request for operator-managed backups (#1795, #862). This
supersedes the gRPC-plugin approach of #1798 with a lighter, CRD-driven design that
wraps clickhouse-backup and mirrors the
existing ClickHouseKeeper controller-runtime pattern.

What it adds

Three new custom resources in the clickhouse.altinity.com/v1 group:

Kind	Short	Operator action
`ClickHouseBackup`	`chb`	one-off backup → Kubernetes `Job`
`ClickHouseBackupSchedule`	`chbs`	recurring backup → managed `CronJob`
`ClickHouseRestore`	`chr`	one-off restore → Kubernetes `Job`

Design

No reinvented backup logic. clickhouse-backup runs as a sidecar (documented
prerequisite, API_CREATE_INTEGRATION_TABLES=true); the operator-generated jobs trigger
it remotely through the system.backup_actions integration table. The new controllers
stay fully decoupled from CHI reconciliation.
controller-runtime, mirroring pkg/controller/chk. Jobs/CronJobs are owned by the
CR (automatic GC + status tracking via status.conditions).
Schedule → native CronJob: schedule/suspend/concurrency/history come from the
CronJob; remote retention is delegated to clickhouse-backup (BACKUPS_TO_KEEP_REMOTE).

Cluster awareness

Backup defaults to one replica per shard (FirstPerShard) — correct and
storage-efficient for Replicated* tables. AllReplicas is available for clusters with
non-replicated/local tables.
Restore applies schema on all replicas and data on the first replica of each
shard, letting native replication synchronize the rest.

Restore safety (per mature DB-operator conventions, e.g. CloudNativePG)

Preflight validation (target CHI Completed, topology reachable) surfaced via conditions.
Overwrite guard: refuses a non-empty target unless overwrite: true.
Restoring into a fresh, empty CHI is the recommended path; one-shot jobs
(backoffLimit: 0, restartPolicy: Never).

Included

API types + hand-written deepcopy (regenerable via dev/run_code_generator.sh).
Controllers + minimal Job/CronJob builders + operator wiring (thread_backup.go).
Hand-written CRD section + RBAC (incl. batch jobs/cronjobs); regenerated install
bundles and Helm chart.
docs/backup.md + docs/chb-examples/.
Go unit tests (builders + controller reconcile) and a TestFlows e2e test
(tests/e2e/test_backup_restore.py) doing a backup→restore round-trip with replica-sync
verification.

Notes / follow-ups

Host service names are resolved from cluster layout counts using the default naming
scheme; explicit shard/replica lists / custom host names are a planned follow-up.
For sharded clusters the sidecar remote path should include the {shard} macro
(e.g. S3_PATH: backup/shard-{shard}).

🤖 Generated with Claude Code

discostur · 2026-06-30T09:12:36Z

Pushed an update extending the feature with several enhancements:

Selective backups — spec.tables (clickhouse-backup --tables pattern) and spec.partitions.
Incremental backups — spec.diffFromRemote → create_remote --diff-from-remote=<base>.
Retention — spec.keepLastRemote keeps only the N most recent remote backups (best-effort prune via system.backup_list + delete remote).
Verification — spec.verify runs a post-backup job that downloads the backup and checks its integrity (no cluster data touched); result surfaced as the Verified condition + a clickhouse_operator_backup_verifications_failed metric.
Observability — Prometheus metrics on the operator's existing :9999/metrics endpoint (clickhouse_operator_backups_* / restores_*, a duration histogram, last-success timestamp), Kubernetes Events on backup/restore lifecycle, and duration in status + a duration printer column.
Bootstrap-from-backup — annotate a fresh ClickHouseInstallation with clickhouse.altinity.com/recover-from-backup: <name> and the operator auto-creates a one-time ClickHouseRestore once the cluster is up.
Compression & encryption — documented as clickhouse-backup sidecar settings (COMPRESSION_FORMAT, S3 S3_SSE/SSE_KMS_KEY_ID, etc.).

I also corrected the replicated-restore path to issue the schema ON CLUSTER (requires the sidecar's restore_schema_on_cluster) instead of per-replica.

Introduces operator-managed backup and restore for ClickHouse using clickhouse-backup, exposed through three new custom resources in the clickhouse.altinity.com/v1 API group: - ClickHouseBackup (chb): one-off backup -> Kubernetes Job - ClickHouseBackupSchedule (chbs): recurring backup -> managed CronJob - ClickHouseRestore (chr): one-off restore -> Kubernetes Job The controllers follow the existing ClickHouseKeeper controller-runtime pattern. clickhouse-backup runs as a sidecar (a documented prerequisite); the generated jobs trigger it remotely through the system.backup_actions integration table, so no backup logic is reimplemented in the operator. Cluster-aware: backs up one replica per shard for Replicated* tables (AllReplicas opt-in for non-replicated data); on restore it applies the schema on the first replica per shard via ON CLUSTER (requires the sidecar's restore_schema_on_cluster) and the data on the first replica, letting native replication synchronize the remaining replicas. Restore safety follows the conventions of mature DB operators: preflight validation (target CHI Completed, topology reachable) and an overwrite guard that refuses a non-empty target unless overwrite=true. Also adds: selective (tables/partitions) and incremental (--diff-from-remote) backups; remote-backup retention (keepLastRemote); optional post-backup verification; Prometheus metrics on the operator's existing :9999 endpoint plus Kubernetes Events; and annotation-driven bootstrap-from-backup for new installations. Compression and encryption are documented as clickhouse-backup sidecar settings. Includes the CRDs, RBAC (incl. batch jobs/cronjobs), regenerated install bundles and Helm chart, documentation and examples, Go unit tests and a TestFlows e2e test. Refs Altinity#1795, Altinity#862. Supersedes the gRPC-plugin approach of Altinity#1798. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Kilian Ries <mail@kilian-ries.de>

discostur · 2026-06-30T12:29:13Z

Tested locally in dev k8s cluster ... happy to get some feedback from the maintainers ;)

discostur force-pushed the feature/clickhouse-backup branch from 3ce23ba to 69632ec Compare June 29, 2026 18:32

discostur force-pushed the feature/clickhouse-backup branch from 69632ec to 56b62c0 Compare June 30, 2026 09:12

discostur force-pushed the feature/clickhouse-backup branch from 56b62c0 to d5f2eb2 Compare June 30, 2026 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add automated backup and restore via dedicated CRDs#2015

Add automated backup and restore via dedicated CRDs#2015
discostur wants to merge 1 commit into
Altinity:0.27.1from
discostur:feature/clickhouse-backup

discostur commented Jun 29, 2026

Uh oh!

discostur commented Jun 30, 2026 •

edited

Loading

Uh oh!

discostur commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

discostur commented Jun 29, 2026

Add automated backup & restore via dedicated CRDs

What it adds

Design

Cluster awareness

Restore safety (per mature DB-operator conventions, e.g. CloudNativePG)

Included

Notes / follow-ups

Uh oh!

discostur commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

discostur commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

discostur commented Jun 30, 2026 •

edited

Loading