Skip to content

Commit 6584427

Browse files
authored
Merge branch 'main' into add-databricks-query-tags-session-properties
2 parents 041844c + bfda1b9 commit 6584427

97 files changed

Lines changed: 12938 additions & 224 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/scripts/wait-for-db.sh

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,37 @@ spark_ready() {
5050
probe_port 15002
5151
}
5252

53+
starrocks_ready() {
54+
probe_port 9030
55+
56+
echo "Checking for 1 alive StarRocks backends..."
57+
sleep 5
58+
59+
while true; do
60+
echo "Checking StarRocks backends..."
61+
ALIVE_BACKENDS=$(docker exec -i starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e "show backends \G" | grep -c "^ *Alive: true *$")
62+
63+
# fallback value if failed to get number
64+
if ! [[ "$ALIVE_BACKENDS" =~ ^[0-9]+$ ]]; then
65+
echo "WARN: Unable to parse number of alive backends, got: '$ALIVE_BACKENDS'"
66+
ALIVE_BACKENDS=0
67+
fi
68+
69+
echo "Found $ALIVE_BACKENDS alive backends"
70+
71+
if [ "$ALIVE_BACKENDS" -ge 1 ]; then
72+
echo "StarRocks has 1 or more alive backends"
73+
break
74+
fi
75+
76+
echo "Waiting for more backends to become alive..."
77+
sleep 5
78+
done
79+
80+
# set default replication num to 1 (there is only one be in the docker compose file)
81+
docker exec -i starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e "ADMIN SET frontend config ('default_replication_num' = '1');"
82+
}
83+
5384
trino_ready() {
5485
# Trino has a built-in healthcheck script, just call that
5586
docker compose -f tests/core/engine_adapter/integration/docker/compose.trino.yaml exec trino /bin/bash -c '/usr/lib/trino/bin/health-check'

.github/workflows/pr.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,7 @@ jobs:
252252
fail-fast: false
253253
matrix:
254254
engine:
255-
[duckdb, postgres, mysql, mssql, trino, spark, clickhouse, risingwave]
255+
[duckdb, postgres, mysql, mssql, trino, spark, clickhouse, risingwave, starrocks]
256256
env:
257257
PYTEST_XDIST_AUTO_NUM_WORKERS: 2
258258
SQLMESH__DISABLE_ANONYMIZED_ANALYTICS: '1'

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build:
66
python: "3.10"
77
jobs:
88
pre_build:
9-
- pip install -e ".[athena,azuresql,bigframes,bigquery,clickhouse,databricks,dbt,dlt,gcppostgres,github,llm,mssql,mysql,mwaa,postgres,redshift,slack,snowflake,trino,web,risingwave]"
9+
- pip install -e ".[athena,azuresql,bigframes,bigquery,clickhouse,databricks,dbt,dlt,gcppostgres,github,llm,mssql,mysql,mwaa,postgres,redshift,slack,snowflake,starrocks,trino,web,risingwave]"
1010
- make api-docs
1111

1212
mkdocs:

Makefile

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,13 @@ install-dev-dbt-%:
4949
$(MAKE) install-dev; \
5050
if [ "$$version" = "1.6.0" ]; then \
5151
echo "Applying overrides for dbt 1.6.0"; \
52-
$(PIP) install 'pydantic>=2.0.0' 'google-cloud-bigquery==3.30.0' 'databricks-sdk==0.28.0' --reinstall; \
52+
$(PIP) install 'pydantic>=2.0.0' 'google-cloud-bigquery==3.30.0' 'databricks-sdk==0.28.0' \
53+
'pyOpenSSL>=24.0.0' --reinstall; \
5354
fi; \
5455
if [ "$$version" = "1.7.0" ]; then \
5556
echo "Applying overrides for dbt 1.7.0"; \
56-
$(PIP) install 'databricks-sdk==0.28.0' --reinstall; \
57+
$(PIP) install 'databricks-sdk==0.28.0' \
58+
'pyOpenSSL>=24.0.0' --reinstall; \
5759
fi; \
5860
if [ "$$version" = "1.5.0" ]; then \
5961
echo "Applying overrides for dbt 1.5.0"; \
@@ -212,6 +214,9 @@ trino-test: engine-trino-up
212214
risingwave-test: engine-risingwave-up
213215
pytest -n auto -m "risingwave" --reruns 3 --junitxml=test-results/junit-risingwave.xml
214216

217+
starrocks-test: engine-starrocks-up
218+
pytest -n auto -m "starrocks" --reruns 3 --junitxml=test-results/junit-starrocks.xml
219+
215220
#################
216221
# Cloud Engines #
217222
#################

docs/concepts/models/python_models.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,33 @@ def entrypoint(
369369
)
370370
```
371371

372+
Blueprint variables can also be used as **column names and column types** in the `columns` dictionary. For example, if each blueprint produces a model with a different set of column names and types, both can be parameterized using the same `@{variable}` syntax:
373+
374+
```python linenums="1"
375+
import pandas as pd
376+
from sqlmesh import ExecutionContext, model
377+
378+
@model(
379+
"@{customer}.metrics",
380+
kind="FULL",
381+
blueprints=[
382+
{"customer": "customer1", "primary_metric": "revenue", "primary_type": "int", "secondary_metric": "cost", "secondary_type": "double"},
383+
{"customer": "customer2", "primary_metric": "sales", "primary_type": "text", "secondary_metric": "profit", "secondary_type": "double"},
384+
],
385+
columns={
386+
"@{primary_metric}": "@{primary_type}",
387+
"@{secondary_metric}": "@{secondary_type}",
388+
},
389+
)
390+
def entrypoint(context: ExecutionContext, **kwargs) -> pd.DataFrame:
391+
return pd.DataFrame({
392+
context.blueprint_var("primary_metric"): [1],
393+
context.blueprint_var("secondary_metric"): [1.5],
394+
})
395+
```
396+
397+
Global variables (defined in the project config) can also be used as column names and types in the same way.
398+
372399
Note the use of curly brace syntax `@{customer}` in the model name above. It is used to ensure SQLMesh can combine the macro variable into the model name identifier correctly - learn more [here](../../concepts/macros/sqlmesh_macros.md#embedding-variables-in-strings).
373400

374401
Blueprint variable mappings can also be constructed dynamically, e.g., by using a macro: `blueprints="@gen_blueprints()"`. This is useful in cases where the `blueprints` list needs to be sourced from external sources, such as CSV files.

docs/guides/configuration.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -920,6 +920,7 @@ These pages describe the connection configuration options for each execution eng
920920
* [GCP Postgres](../integrations/engines/gcp-postgres.md)
921921
* [Redshift](../integrations/engines/redshift.md)
922922
* [Snowflake](../integrations/engines/snowflake.md)
923+
* [StarRocks](../integrations/engines/starrocks.md)
923924
* [Spark](../integrations/engines/spark.md)
924925
* [Trino](../integrations/engines/trino.md)
925926

@@ -952,6 +953,7 @@ Unsupported state engines, even for development:
952953

953954
* [ClickHouse](../integrations/engines/clickhouse.md)
954955
* [Spark](../integrations/engines/spark.md)
956+
* [StarRocks](../integrations/engines/starrocks.md)
955957
* [Trino](../integrations/engines/trino.md)
956958

957959
This example gateway configuration uses Snowflake for the data warehouse connection and Postgres for the state backend connection:

docs/guides/connections.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,4 +90,5 @@ default_gateway: local_db
9090
* [Redshift](../integrations/engines/redshift.md)
9191
* [Snowflake](../integrations/engines/snowflake.md)
9292
* [Spark](../integrations/engines/spark.md)
93+
* [StarRocks](../integrations/engines/starrocks.md)
9394
* [Trino](../integrations/engines/trino.md)

docs/integrations/dlt.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,12 @@ This will create the configuration file and directories, which are found in all
2828

2929
SQLMesh will also automatically generate models to ingest data from the pipeline incrementally. Incremental loading is ideal for large datasets where recomputing entire tables is resource-intensive. In this case utilizing the [`INCREMENTAL_BY_TIME_RANGE` model kind](../concepts/models/model_kinds.md#incremental_by_time_range). However, these model definitions can be customized to meet your specific project needs.
3030

31-
#### Specify the path to the pipelines directory
31+
#### Specify the path to the pipelines working directory
3232

33-
The default location for dlt pipelines is `~/.dlt/pipelines/<pipeline_name>`. If your pipelines are in a [different directory](https://dlthub.com/docs/general-usage/pipeline#separate-working-environments-with-pipelines_dir), use the `--dlt-path` argument to specify the path explicitly:
33+
The default location for dlt pipeline working state is `~/.dlt/pipelines/<pipeline_name>`. If dlt stores your pipeline state in a [different pipelines working directory](https://dlthub.com/docs/general-usage/pipeline#separate-working-environments-with-pipelines_dir), use the `--dlt-path` argument to specify that directory explicitly. This should be the directory where dlt stores pipeline state, not the directory containing your pipeline scripts:
3434

3535
```bash
36-
sqlmesh init -t dlt --dlt-pipeline <pipeline-name> --dlt-path <pipelines-directory> dialect
36+
sqlmesh init -t dlt --dlt-pipeline <pipeline-name> --dlt-path <pipelines-working-directory> dialect
3737
```
3838

3939
### Generating models on demand
@@ -58,10 +58,10 @@ sqlmesh dlt_refresh <pipeline-name> --force
5858
sqlmesh dlt_refresh <pipeline-name> --table <dlt-table>
5959
```
6060

61-
- **Provide the explicit path to the pipelines directory** (using `--dlt-path`):
61+
- **Provide the explicit path to the pipelines working directory** (using `--dlt-path`):
6262

6363
```bash
64-
sqlmesh dlt_refresh <pipeline-name> --dlt-path <pipelines-directory>
64+
sqlmesh dlt_refresh <pipeline-name> --dlt-path <pipelines-working-directory>
6565
```
6666

6767
#### Configuration

docs/integrations/engines/clickhouse.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,54 @@ If a model has many records in each partition, you may see additional performanc
420420

421421
Choose a model's time partitioning granularity based on the characteristics of the data it will process, making sure the total number of partitions is 1000 or fewer.
422422

423+
## Multi-gateway setup
424+
425+
ClickHouse does not have a catalog concept — its fully-qualified table names are two-level (`database.table`), not three-level (`catalog.database.table`).
426+
427+
When a SQLMesh project uses ClickHouse alongside a catalog-aware gateway such as Trino or BigQuery, the two gateway types produce FQNs with different nesting depths. SQLMesh's internal schema tracking requires uniform nesting, so it assigns a **virtual catalog** to ClickHouse models at load time.
428+
429+
### How the virtual catalog works
430+
431+
- SQLMesh automatically detects the nesting mismatch and injects a virtual catalog into each ClickHouse adapter when a catalog-aware gateway is also present.
432+
- ClickHouse models will appear with three-level FQNs in `sqlmesh plan` output and logs — for example, `__ch_prod__.mydb.mytable` for a gateway named `ch_prod`.
433+
- The virtual catalog prefix is **never sent to ClickHouse**. It is stripped from every DDL and DML statement before execution.
434+
- When ClickHouse is the only gateway in a project, no virtual catalog is assigned and models remain two-level.
435+
436+
### Adding a second gateway to an existing ClickHouse-only project
437+
438+
!!! warning "Re-materialization required"
439+
Adding a catalog-aware gateway (such as Trino or BigQuery) to a project that previously used ClickHouse as the only gateway triggers a **full re-materialization of every ClickHouse model** on the next `sqlmesh apply`. Plan for this before making the change.
440+
441+
If your project previously used ClickHouse as the only gateway, your models were fingerprinted with 2-level FQNs (`db.table`). Adding a catalog-aware gateway causes all ClickHouse models to be treated as new versions (their FQNs change to `__{gateway_name}__.db.table`):
442+
443+
- `FULL` models are recreated once — cost is proportional to the size of each table.
444+
- `INCREMENTAL_BY_TIME_RANGE` models require a **full historical backfill** from the model's configured start date.
445+
- The old 2-level model names appear as **Removed** in the plan and will be cleaned up after the environment TTL expires.
446+
447+
This is a one-time cost at the transition point and does not recur. There is no way to skip it — `--forward-only` does not apply because SQLMesh treats the 3-level names as new models, not modified ones.
448+
449+
### Virtual catalog naming
450+
451+
By default, the virtual catalog name is derived from **the gateway name you chose in your config**, wrapped in double underscores — for example, a gateway named `clickhouse` produces `__clickhouse__`, and a gateway named `ch_prod` produces `__ch_prod__`. The double-underscore wrapping makes it visually clear that this is an internal SQLMesh concept, not a real ClickHouse object.
452+
453+
You can override the default name by setting `virtual_catalog` in your ClickHouse connection configuration:
454+
455+
```yaml
456+
gateways:
457+
clickhouse:
458+
connection:
459+
type: clickhouse
460+
host: my-clickhouse-host
461+
username: default
462+
virtual_catalog: ch_virtual # optional; defaults to __{gateway_name}__ (e.g. __clickhouse__)
463+
trino:
464+
connection:
465+
type: trino
466+
...
467+
```
468+
469+
With this configuration, ClickHouse models will appear as `ch_virtual.mydb.mytable` in plan output instead of `__clickhouse__.mydb.mytable`.
470+
423471
## Local/Built-in Scheduler
424472

425473
**Engine Adapter Type**: `clickhouse`
@@ -446,4 +494,5 @@ If a model has many records in each partition, you may see additional performanc
446494
| `server_host_name` | The ClickHouse server hostname as identified by the CN or SNI of its TLS certificate. Set this to avoid SSL errors when connecting through a proxy or tunnel with a different hostname. | string | N |
447495
| `tls_mode` | Controls advanced TLS behavior. proxy and strict do not invoke ClickHouse mutual TLS connection, but do send client cert and key. mutual assumes ClickHouse mutual TLS auth with a client certificate. | string | N |
448496
| `connection_settings` | Additional [connection settings](https://clickhouse.com/docs/integrations/python#settings-argument) | dict | N |
449-
| `connection_pool_options` | Additional [options](https://clickhouse.com/docs/integrations/python#customizing-the-http-connection-pool) for the HTTP connection pool | dict | N |
497+
| `connection_pool_options` | Additional [options](https://clickhouse.com/docs/integrations/python#customizing-the-http-connection-pool) for the HTTP connection pool | dict | N |
498+
| `virtual_catalog` | Override the virtual catalog name used when ClickHouse runs alongside a catalog-aware gateway (e.g. Trino). Defaults to `__{gateway_name}__`. See [Multi-gateway setup](#multi-gateway-setup) for details. | string | N |

0 commit comments

Comments
 (0)