Search before asking
Motivation
Today, enabling lakehouse for an existing table only works reliably if the table was created after the cluster had already enabled datalake support. Here, "enabled datalake support" means the cluster had already configured datalake.format; in the current behavior, setting datalake.format is treated as enabling datalake support. This causes a compatibility problem for the following user flow:
- Create a Fluss table when the cluster has not explicitly enabled lakehouse.
- Later configure the cluster to enable lakehouse.
- Enable lakehouse for the existing table.
At the moment, step 3 fails for tables created before cluster-level lakehouse was enabled. The root issue is that datalake.format currently serves two roles at the same time:
- selecting the lake-format-specific bucketing / key-encoding behavior; and
- indicating that the cluster is ready to create and manage lake tables.
This makes the semantics unclear for new deployments that want to pre-bind the future lake format (for example Paimon, so that bucketing stays consistent) but do not want users to enable lakehouse for tables until the cluster is explicitly switched on.
We need a backward-compatible way to separate "legacy cluster behavior" from "new cluster behavior", while still allowing tables created before table.datalake.enabled=true to be enabled later if their bucketing format is already predetermined.
Solution
Introduce a new cluster config datalake.enabled with compatibility semantics:
datalake.enabled is unset: treat the cluster as a legacy cluster and keep the current behavior unchanged.
datalake.enabled=false: treat the cluster as a new-style cluster in "pre-bind only" mode.
datalake.enabled=true: treat the cluster as a new-style cluster with lakehouse fully enabled.
For clusters where datalake.enabled is explicitly configured (either true or false):
- require
datalake.format to be configured;
- automatically persist
table.datalake.format=<cluster datalake.format> into newly created tables;
- when
datalake.enabled=false, do not allow creating/enabling lake tables yet;
- when
datalake.enabled=true, allow ALTER TABLE ... SET ('table.datalake.enabled'='true') for tables whose table.datalake.format already matches the cluster datalake.format.
This keeps old clusters fully compatible while enabling the desired flow for new clusters:
- Create cluster with
datalake.enabled=false and datalake.format=paimon.
- Create table; Fluss auto-persists
table.datalake.format=paimon, so writes already follow Paimon bucketing.
- Later switch cluster to
datalake.enabled=true.
- Enable lakehouse for the existing table successfully.
Suggested validation rules:
- If
datalake.enabled is explicitly set but datalake.format is missing, fail fast.
- If a table has no persisted
table.datalake.format, keep rejecting later lakehouse enablement to avoid bucket inconsistency.
- If a table's
table.datalake.format differs from the cluster datalake.format, reject enablement.
- In new-style clusters,
datalake.format should be treated as immutable (or at least strongly restricted) once tables have been created with the pre-bound format.
Affected areas likely include:
- cluster config parsing / compatibility checks;
CoordinatorService.applySystemDefaults(...);
LakeCatalogDynamicLoader load conditions;
- alter-table validation for
table.datalake.enabled.
Anything else?
This issue is mainly about compatibility and semantic clarity:
- old clusters should continue to behave exactly as they do today;
- new clusters should be able to pre-bind lake-format bucketing without exposing lakehouse functionality too early;
- users should be able to create a table first, enable cluster lakehouse later, and then enable lakehouse on that table successfully.
Willingness to contribute
Search before asking
Motivation
Today, enabling lakehouse for an existing table only works reliably if the table was created after the cluster had already enabled datalake support. Here, "enabled datalake support" means the cluster had already configured
datalake.format; in the current behavior, settingdatalake.formatis treated as enabling datalake support. This causes a compatibility problem for the following user flow:At the moment, step 3 fails for tables created before cluster-level lakehouse was enabled. The root issue is that
datalake.formatcurrently serves two roles at the same time:This makes the semantics unclear for new deployments that want to pre-bind the future lake format (for example Paimon, so that bucketing stays consistent) but do not want users to enable lakehouse for tables until the cluster is explicitly switched on.
We need a backward-compatible way to separate "legacy cluster behavior" from "new cluster behavior", while still allowing tables created before
table.datalake.enabled=trueto be enabled later if their bucketing format is already predetermined.Solution
Introduce a new cluster config
datalake.enabledwith compatibility semantics:datalake.enabledis unset: treat the cluster as a legacy cluster and keep the current behavior unchanged.datalake.enabled=false: treat the cluster as a new-style cluster in "pre-bind only" mode.datalake.enabled=true: treat the cluster as a new-style cluster with lakehouse fully enabled.For clusters where
datalake.enabledis explicitly configured (eithertrueorfalse):datalake.formatto be configured;table.datalake.format=<cluster datalake.format>into newly created tables;datalake.enabled=false, do not allow creating/enabling lake tables yet;datalake.enabled=true, allowALTER TABLE ... SET ('table.datalake.enabled'='true')for tables whosetable.datalake.formatalready matches the clusterdatalake.format.This keeps old clusters fully compatible while enabling the desired flow for new clusters:
datalake.enabled=falseanddatalake.format=paimon.table.datalake.format=paimon, so writes already follow Paimon bucketing.datalake.enabled=true.Suggested validation rules:
datalake.enabledis explicitly set butdatalake.formatis missing, fail fast.table.datalake.format, keep rejecting later lakehouse enablement to avoid bucket inconsistency.table.datalake.formatdiffers from the clusterdatalake.format, reject enablement.datalake.formatshould be treated as immutable (or at least strongly restricted) once tables have been created with the pre-bound format.Affected areas likely include:
CoordinatorService.applySystemDefaults(...);LakeCatalogDynamicLoaderload conditions;table.datalake.enabled.Anything else?
This issue is mainly about compatibility and semantic clarity:
Willingness to contribute