Per-user Iceberg warehouse with bring-your-own S3 storage

### Feature Summary

A **warehouse** here is a top level entity in the catalog hierarchy (`Project → Warehouse → Namespace → Table`) that owns a set of namespaces (`results`, `runtime_stats`, `console_logs`) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the [Lakekeeper warehouse concept](https://docs.lakekeeper.io/docs/nightly/concepts/).

Today Texera writes all execution outputs (`results`, `runtime_stats`, `console_logs`) into a single **global Iceberg warehouse**. One warehouse, all users share it, storage costs absorbed by the platform.

This issue proposes a **per-user warehouse** model: each user registers one or more warehouses, each backed by **their own S3 bucket** (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.

### Background / Motivation

- **Billing.** S3 cost should be attributed to the user who owns the data, not the platform.
- **Isolation.** Per-tenant namespaces/tables, no shared blast radius.
- **Builds on #4126**  — that issue introduced the REST Catalog Service (Lakekeeper) layer. This issue is the next step: make Lakekeeper multi-tenant.

#### Scope

Per-user warehouses are scoped to the **Kubernetes deployment**. Local / single-node Docker Compose deployments continue to work as today: `PsqlCatalog` remains supported and unchanged, and `RestCatalog` mode keeps its current single global Lakekeeper warehouse (no per-user split).

#### Catalog hierarchy

Texera already has two `Catalog` implementations:

```
Catalog (interface)
├── PsqlCatalog          — backed by PostgreSQL
└── RestCatalog          — backed by any Iceberg REST Catalog service (Lakekeeper is one implementation of this)
```

This design uses **`RestCatalog` with Lakekeeper** as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); **Texera never persists raw S3 creds**, only the Lakekeeper warehouse UUID and non-secret metadata.

### Proposed Solution or Design

#### Design 1:

```
User ─1:N→ Warehouse                (new)
User ─1:N→ ComputingUnit            (existing)
ComputingUnit ─1:N→ Execution       (existing)
Warehouse ─1:N→ CU                  (new)
```

<img width="603" height="396" alt="Image" src="https://github.com/user-attachments/assets/9ae1e74c-88bd-4a0e-b66f-43783a768fd9" />

#### Flow A — Registering a warehouse  (Same for both design)

1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
2. Backend posts the credentials directly to Lakekeeper to create the warehouse. **Creds never touch the Texera DB.**
3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.

**Sequence diagram:** 

<img width="946" height="433" alt="Image" src="https://github.com/user-attachments/assets/7306bb14-055f-459e-b20f-0800a24650e9" />

#### Flow B — Binding a warehouse to a CU (For Design 1)

1. When the user creates a CU they pick which warehouse to use.
2. At execution time, Texera instantiates a `RestCatalog` for that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.
3. Two-layer split at runtime:
   - **Catalog path** — `RestCatalog` talks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.
   - **Data path** — the Iceberg client reads/writes Parquet **directly to the user's S3 bucket**, using short-lived credentials vended by Lakekeeper per request. Lakekeeper does not proxy S3 traffic.

Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (`results` / `runtime_stats` / `console_logs`) and per-execution table.

**Sequence diagram (CU creation + RestCatalog instantiation):** 

<img width="954" height="391" alt="Image" src="https://github.com/user-attachments/assets/220be779-7f9d-4892-b9a1-811f41cd1353" />

<img width="945" height="379" alt="Image" src="https://github.com/user-attachments/assets/d6b46553-9a20-4d7e-a18f-0fc2c3d582da" />

For execution diagram please check: #4126

#### Design 2:

#### Data model

```
User ─1:N→ Warehouse                (new)
User ─1:N→ ComputingUnit            (existing)
ComputingUnit ─1:N→ Execution       (existing)
Warehouse ─1:N→ Execution           (new association)
```

**ER diagram:** 

<img width="582" height="346" alt="Image" src="https://github.com/user-attachments/assets/504241a8-ed50-40d9-8117-a3b41b6654d7" />


#### Flow A — Registering a warehouse (Same for both design)

1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
2. Backend posts the credentials directly to Lakekeeper to create the warehouse. **Creds never touch the Texera DB.**
3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.

**Sequence diagram:** 

<img width="946" height="433" alt="Image" src="https://github.com/user-attachments/assets/7306bb14-055f-459e-b20f-0800a24650e9" />

#### Flow B — Binding a warehouse to an Execution (For Design 2)
  
  1. CU creation does not ask for a warehouse. CUs are warehouse-agnostic and one CU can serve executions targeting any
  warehouse the user owns.
  2. The user picks the warehouse this execution will write to (from a warehouse selector in the workflow toolbar,
  similar to selecting a CU). 
  3. The submit-execution RPC to the CU carries the resolved whid/Lakekeeper warehouse name.
  4. The CU JVM maintains a per-warehouse RestCatalog cache (Map[warehouseName, RestCatalog]). The arriving execution:
    - Cache hit → reuses the existing instance
    - Cache miss → lazily initializes a new RestCatalog for that warehouse. Adding a new entry is atomic and does not touch other entries; in-flight executions on other warehouses are unaffected.
  5. Two-layer split at runtime (same as Design 1):
    - Catalog path — the per-warehouse RestCatalog talks to Lakekeeper for metadata operations.
    - Data path — Iceberg reads/writes Parquet directly to the user's S3 bucket via Lakekeeper-vended short-lived credentials. Lakekeeper does not proxy S3 traffic.
  6. Result reading on the amber side looks up workflow_executions.whid first, then routes the IcebergDocument read through the corresponding RestCatalog.

**Sequence diagram (Execution start + RestCatalog instantiation):** 

<img width="890" height="346" alt="Image" src="https://github.com/user-attachments/assets/49037020-1054-4536-bb51-a180259ffea3" />

Please note that currently CU is directly communicating with Postgres, there is an issue track this: https://github.com/apache/texera/issues/5011. However, this is out of scope of this current issue.

<img width="945" height="379" alt="Image" src="https://github.com/user-attachments/assets/3a2b3a64-db9c-494e-b905-db03a8f74755" />

For execution diagram please check: #4126

### Open questions


- Should we allow Share Warehouse?
- Shared CU (Design 1): when User A runs a workflow on a CU owned by User B, whose warehouse stores the results? In other words, should we allow User A store results into User B's Warehouse.
- Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave S3 data orphaned in the user's bucket (Texera has no write access to user buckets), or soft-archive the catalog so existing executions stay readable until the user explicitly purges?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-user Iceberg warehouse with bring-your-own S3 storage #5135

Feature Summary

Background / Motivation

Scope

Catalog hierarchy

Proposed Solution or Design

Design 1:

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to a CU (For Design 1)

Design 2:

Data model

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to an Execution (For Design 2)

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Per-user Iceberg warehouse with bring-your-own S3 storage #5135

Description

Feature Summary

Background / Motivation

Scope

Catalog hierarchy

Proposed Solution or Design

Design 1:

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to a CU (For Design 1)

Design 2:

Data model

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to an Execution (For Design 2)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions