You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A warehouse here is a top level entity in the catalog hierarchy (Project → Warehouse → Namespace → Table) that owns a set of namespaces (results, runtime_stats, console_logs) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the Lakekeeper warehouse concept.
Today Texera writes all execution outputs (results, runtime_stats, console_logs) into a single global Iceberg warehouse. One warehouse, all users share it, storage costs absorbed by the platform.
This issue proposes a per-user warehouse model: each user registers one or more warehouses, each backed by their own S3 bucket (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.
Background / Motivation
Billing. S3 cost should be attributed to the user who owns the data, not the platform.
Isolation. Per-tenant namespaces/tables, no shared blast radius.
Per-user warehouses are scoped to the Kubernetes deployment. Local / single-node Docker Compose deployments continue to work as today: PsqlCatalog remains supported and unchanged, and RestCatalog mode keeps its current single global Lakekeeper warehouse (no per-user split).
Catalog hierarchy
Texera already has two Catalog implementations:
Catalog (interface)
├── PsqlCatalog — backed by PostgreSQL
└── RestCatalog — backed by any Iceberg REST Catalog service (Lakekeeper is one implementation of this)
This design uses RestCatalog with Lakekeeper as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); Texera never persists raw S3 creds, only the Lakekeeper warehouse UUID and non-secret metadata.
Proposed Solution or Design
Design 1:
User ─1:N→ Warehouse (new)
User ─1:N→ ComputingUnit (existing)
ComputingUnit ─1:N→ Execution (existing)
Warehouse ─1:N→ CU (new)
Flow A — Registering a warehouse (Same for both design)
User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
Backend posts the credentials directly to Lakekeeper to create the warehouse. Creds never touch the Texera DB.
Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.
Sequence diagram:
Flow B — Binding a warehouse to a CU (For Design 1)
When the user creates a CU they pick which warehouse to use.
At execution time, Texera instantiates a RestCatalog for that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.
Two-layer split at runtime:
Catalog path — RestCatalog talks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.
Data path — the Iceberg client reads/writes Parquet directly to the user's S3 bucket, using short-lived credentials vended by Lakekeeper per request. Lakekeeper does not proxy S3 traffic.
Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (results / runtime_stats / console_logs) and per-execution table.
Sequence diagram (CU creation + RestCatalog instantiation):
User ─1:N→ Warehouse (new)
User ─1:N→ ComputingUnit (existing)
ComputingUnit ─1:N→ Execution (existing)
Warehouse ─1:N→ Execution (new association)
ER diagram:
Flow A — Registering a warehouse (Same for both design)
User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
Backend posts the credentials directly to Lakekeeper to create the warehouse. Creds never touch the Texera DB.
Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.
Sequence diagram:
Flow B — Binding a warehouse to an Execution (For Design 2)
CU creation does not ask for a warehouse. CUs are warehouse-agnostic and one CU can serve executions targeting any
warehouse the user owns.
The user picks the warehouse this execution will write to (from a warehouse selector in the workflow toolbar,
similar to selecting a CU).
The submit-execution RPC to the CU carries the resolved whid/Lakekeeper warehouse name.
The CU JVM maintains a per-warehouse RestCatalog cache (Map[warehouseName, RestCatalog]). The arriving execution:
- Cache hit → reuses the existing instance
- Cache miss → lazily initializes a new RestCatalog for that warehouse. Adding a new entry is atomic and does not touch other entries; in-flight executions on other warehouses are unaffected.
Two-layer split at runtime (same as Design 1):
- Catalog path — the per-warehouse RestCatalog talks to Lakekeeper for metadata operations.
- Data path — Iceberg reads/writes Parquet directly to the user's S3 bucket via Lakekeeper-vended short-lived credentials. Lakekeeper does not proxy S3 traffic.
Result reading on the amber side looks up workflow_executions.whid first, then routes the IcebergDocument read through the corresponding RestCatalog.
Please note that currently CU is directly communicating with Postgres, there is an issue track this: #5011. However, this is out of scope of this current issue.
Shared CU (Design 1): when User A runs a workflow on a CU owned by User B, whose warehouse stores the results? In other words, should we allow User A store results into User B's Warehouse.
Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave S3 data orphaned in the user's bucket (Texera has no write access to user buckets), or soft-archive the catalog so existing executions stay readable until the user explicitly purges?
Feature Summary
A warehouse here is a top level entity in the catalog hierarchy (
Project → Warehouse → Namespace → Table) that owns a set of namespaces (results,runtime_stats,console_logs) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the Lakekeeper warehouse concept.Today Texera writes all execution outputs (
results,runtime_stats,console_logs) into a single global Iceberg warehouse. One warehouse, all users share it, storage costs absorbed by the platform.This issue proposes a per-user warehouse model: each user registers one or more warehouses, each backed by their own S3 bucket (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.
Background / Motivation
Scope
Per-user warehouses are scoped to the Kubernetes deployment. Local / single-node Docker Compose deployments continue to work as today:
PsqlCatalogremains supported and unchanged, andRestCatalogmode keeps its current single global Lakekeeper warehouse (no per-user split).Catalog hierarchy
Texera already has two
Catalogimplementations:This design uses
RestCatalogwith Lakekeeper as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); Texera never persists raw S3 creds, only the Lakekeeper warehouse UUID and non-secret metadata.Proposed Solution or Design
Design 1:
Flow A — Registering a warehouse (Same for both design)
Sequence diagram:
Flow B — Binding a warehouse to a CU (For Design 1)
RestCatalogfor that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.RestCatalogtalks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (
results/runtime_stats/console_logs) and per-execution table.Sequence diagram (CU creation + RestCatalog instantiation):
For execution diagram please check: #4126
Design 2:
Data model
ER diagram:
Flow A — Registering a warehouse (Same for both design)
Sequence diagram:
Flow B — Binding a warehouse to an Execution (For Design 2)
warehouse the user owns.
similar to selecting a CU).
- Cache hit → reuses the existing instance
- Cache miss → lazily initializes a new RestCatalog for that warehouse. Adding a new entry is atomic and does not touch other entries; in-flight executions on other warehouses are unaffected.
- Catalog path — the per-warehouse RestCatalog talks to Lakekeeper for metadata operations.
- Data path — Iceberg reads/writes Parquet directly to the user's S3 bucket via Lakekeeper-vended short-lived credentials. Lakekeeper does not proxy S3 traffic.
Sequence diagram (Execution start + RestCatalog instantiation):
Please note that currently CU is directly communicating with Postgres, there is an issue track this: #5011. However, this is out of scope of this current issue.
For execution diagram please check: #4126
Open questions