Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 9 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,11 @@ API key priority (lowest to highest): config file → `HOTDATA_API_KEY` env var
| `connections` | `list`, `create`, `refresh`, `new` | Manage connections |
| `databases` | `list`, `create`, `delete`, `tables` | Managed databases (create and load tables via parquet) |
| `tables` | `list` | List tables and columns |
| `datasets` | `list`, `create`, `update` | Manage uploaded datasets |
| `context` | `list`, `show`, `pull`, `push` | Workspace Markdown context (e.g. data model `DATAMODEL`) via the context API |
| `query` | | Execute a SQL query |
| `queries` | `list` | Inspect query run history |
| `search` | | Full-text search across a table column |
| `indexes` | `list`, `create`, `delete` | Manage indexes on a table or dataset |
| `indexes` | `list`, `create`, `delete` | Manage indexes on a table |
| `embedding-providers` | `list`, `get`, `create`, `update`, `delete` | Manage embedding providers used by vector indexes |
| `results` | `list` | Retrieve stored query results |
| `jobs` | `list` | Manage background jobs |
Expand Down Expand Up @@ -155,7 +154,7 @@ hotdata databases tables delete <table> [--database <id_or_name>] [--schema publ
- `load` (top-level shorthand) — loads a parquet file into `--catalog.--schema.--table`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
- `tables load` uploads a **parquet** file (or uses a staged `upload_id` from `POST /v1/files`) and publishes it as the table generation (`replace` mode).
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment.
- For CSV/JSON uploads without a managed database, use `hotdata datasets create` instead (`datasets.main.*`).
- Managed table loads accept **parquet** only — convert CSV/JSON to parquet first.

Example:

Expand All @@ -176,26 +175,6 @@ hotdata tables list [--workspace-id <id>] [--connection-id <id>] [--schema <patt
- `--schema` and `--table` support SQL `%` wildcard patterns.
- Tables are displayed as `<connection>.<schema>.<table>` — use this format in SQL queries.

## Datasets

```sh
hotdata datasets list [--workspace-id <id>] [--limit <n>] [--offset <n>] [--format table|json|yaml]
hotdata datasets <dataset_id> [--workspace-id <id>] [--format table|json|yaml]
hotdata datasets create --file data.csv [--label "My Dataset"] [--table-name my_dataset]
hotdata datasets create --sql "SELECT ..." --label "My Dataset"
hotdata datasets create --url "https://example.com/data.parquet" --label "My Dataset"
hotdata datasets update <dataset_id> [--label "New Label"] [--table-name new_table]
hotdata datasets refresh <dataset_id> [--workspace-id <id>] [--async]
```

- Datasets are queryable as `datasets.main.<table_name>`.
- `--file`, `--sql`, `--query-id`, and `--url` are mutually exclusive.
- `--url` imports data directly from a URL (supports csv, json, parquet).
- Format is auto-detected from file extension or content.
- Piped stdin is supported: `cat data.csv | hotdata datasets create --label "My Dataset"`
- `refresh` re-runs the dataset's source (URL fetch or saved query) and creates a new version. Not supported for upload-source datasets.
- `--async` submits the refresh as a background job and returns a job ID; poll with `hotdata jobs <job_id>`.

## Workspace context

Named Markdown documents for a workspace (data model, glossary, etc.) are stored in the **context API**. The CLI treats the server as the **source of truth**; local files are only used where the tool requires a path on disk.
Expand Down Expand Up @@ -260,25 +239,20 @@ hotdata search "<query>" --table <table> [--type vector] [--column <source_text_

## Indexes

Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive.
`create` attaches an index to a table via its `--catalog` alias (a managed-database catalog or a connection name). `list` and `delete` accept `--connection-id` (+ `--schema` + `--table`) for connection-scoped operations.

```sh
# Managed database scope (catalog alias resolves via active database)
# Create — by catalog alias (resolves a managed-database catalog or a connection name)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <cols> --type bm25|vector|sorted \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Connection-table scope (for non-managed connections)
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [-o table|json|yaml]
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
# List — workspace scan, optionally filtered by connection / schema / table
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>] [-o table|json|yaml]

# Dataset scope
hotdata indexes list --dataset-id <id> [-o table|json|yaml]
hotdata indexes create --dataset-id <id> --column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --dataset-id <id> --name <name>
# Delete — connection scope (--connection-id + --schema + --table)
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
```

- `--type` is **required** — choose `sorted` (B-tree-like), `bm25` (full-text), or `vector` (similarity).
Expand Down Expand Up @@ -319,7 +293,7 @@ hotdata jobs <job_id> [--workspace-id <id>] [--format table|json|yaml]
```

- `list` shows only active jobs (`pending` and `running`) by default. Use `--all` to see all jobs.
- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `dataset_refresh`, `create_index`, `create_dataset_index`.
- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `create_index`.
- `--status` accepts: `pending`, `running`, `succeeded`, `partially_succeeded`, `failed`.

## Configuration
Expand Down
20 changes: 6 additions & 14 deletions skills/hotdata-analytics/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
name: hotdata-analytics
description: Use this skill when the user wants OLAP-style SQL analytics in Hotdata — aggregations, GROUP BY, JOINs, reporting, exploratory queries, query run history, stored results, or materialized follow-up tables (Chain via datasets or managed databases). Activate for "analyze", "aggregate", "rollup", "pivot", "report", "metrics", "GROUP BY", "query history", "past queries", "query runs", "stored results", "materialize", "chain", "intermediate table", or sorted indexes for filters/range scans. Do not load for BM25/vector search or geospatial SQL — use hotdata-search or hotdata-geospatial. Requires the core hotdata skill for connections, tables, datasets, and auth.
description: Use this skill when the user wants OLAP-style SQL analytics in Hotdata — aggregations, GROUP BY, JOINs, reporting, exploratory queries, query run history, stored results, or materialized follow-up tables (Chain into managed databases). Activate for "analyze", "aggregate", "rollup", "pivot", "report", "metrics", "GROUP BY", "query history", "past queries", "query runs", "stored results", "materialize", "chain", "intermediate table", or sorted indexes for filters/range scans. Do not load for BM25/vector search or geospatial SQL — use hotdata-search or hotdata-geospatial. Requires the core hotdata skill for connections, tables, and auth.
version: 0.5.0
---

# Hotdata Analytics Skill

**OLAP-style analytics** in Hotdata: PostgreSQL-dialect SQL, query execution, run history, stored results, **Chain** materializations, and **sorted** indexes for filters and joins.

**Prerequisites:** Authenticate, workspace, and catalog discovery via the **`hotdata`** skill (`connections`, `tables`, `datasets`, `databases`).
**Prerequisites:** Authenticate, workspace, and catalog discovery via the **`hotdata`** skill (`connections`, `tables`, `databases`).

**Related skills:** **`hotdata-search`** (BM25, vector, retrieval indexes), **`hotdata-geospatial`** (spatial SQL).

Expand All @@ -23,7 +23,7 @@ hotdata query status <query_run_id> [--output table|json|csv]

- **PostgreSQL dialect.** Quote mixed-case identifiers: `"CustomerName"`.
- Use **`hotdata tables list`** for schema discovery — not `information_schema` via `query`.
- Fully qualified names: `<connection>.<schema>.<table>`, `datasets.<schema>.<table>`, `<database>.<schema>.<table>`.
- Fully qualified names: `<connection>.<schema>.<table>`, `<database>.<schema>.<table>`.
- Long-running queries may return `query_run_id` → poll with **`query status`** (exit `2` = still running). Do not re-run identical heavy SQL while polling.
- For **workspace-wide** joins and naming, load **context:DATAMODEL** when listed (`hotdata context list` → `show DATAMODEL`) — see **`hotdata`** skill.

Expand Down Expand Up @@ -79,24 +79,16 @@ hotdata results <result_id> [--workspace-id <workspace_id>] [--output table|json
hotdata query status <query_run_id> # if async
```

2. **Materialize** (pick one)

```bash
hotdata datasets create --name chain_slice [--description "chain slice"] --sql "SELECT ..."
hotdata datasets create --name chain_from_saved [--description "from saved"] --query-id <query_id>
```

Or managed parquet:
2. **Materialize** into a managed database (parquet)

```bash
hotdata databases create --catalog analytics
hotdata databases load --catalog analytics --table slice --file ./slice.parquet
```

3. **Chain query** — use printed **`full_name`** or `datasets list` **FULL NAME** column:
3. **Chain query** — use the catalog-qualified name `<catalog>.public.<table>`:

```bash
hotdata query "SELECT * FROM datasets.main.chain_slice WHERE ..."
hotdata query "SELECT * FROM analytics.public.slice WHERE ..."
```

Expand All @@ -111,7 +103,7 @@ Full procedure: [references/WORKFLOWS.md](references/WORKFLOWS.md).
For equality, range, and sort-heavy OLAP — not full-text or vector (see **`hotdata-search`**):

```bash
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
hotdata indexes create --catalog <connection-name-or-id> --schema <schema> --table <table> \
--name idx_orders_created --column created_at --type sorted [--async]
```

Expand Down
27 changes: 8 additions & 19 deletions skills/hotdata-analytics/references/WORKFLOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

OLAP-style SQL, **History** (query runs and stored results), and **Chain** (materialized follow-ups). Requires **`hotdata`** for auth, workspaces, and catalog commands.

**Related:** **`hotdata-search`** for BM25/vector indexes and `hotdata search`; **`hotdata`** [WORKFLOWS.md](../../hotdata/references/WORKFLOWS.md) for datasets vs managed databases.
**Related:** **`hotdata-search`** for BM25/vector indexes and `hotdata search`; **`hotdata`** [WORKFLOWS.md](../../hotdata/references/WORKFLOWS.md) for managed databases.

---

Expand Down Expand Up @@ -64,43 +64,32 @@ hotdata query "SELECT ..."

### 2. Materialize

Land a smaller table — pick one:

**Datasets** (SQL query or saved query → `datasets.<schema>.<table>`):

```bash
hotdata datasets create --name chain_revenue_slice [--description "chain revenue slice"] --sql "SELECT ..."
hotdata datasets create --name chain_from_saved [--description "from saved"] --query-id <query_id>
```

**Managed database** (parquet → `<database>.<schema>.<table>`):
Land a smaller table in a **managed database** (parquet → `<database>.<schema>.<table>`):

```bash
hotdata databases create --catalog chain_db
hotdata databases load --catalog chain_db --table revenue_slice --file ./revenue_slice.parquet
```

Note the printed **`full_name`** (e.g. `datasets.main.chain_revenue_slice` or `chain_db.public.revenue_slice`). For datasets, **`FULL NAME`** from `datasets list` is authoritative.
The table is then addressable as `chain_db.public.revenue_slice`. Confirm with `hotdata databases tables list`.

### 3. Chain query

Query using the actual `full_name` from create or list — do not hardcode `datasets.main`; use whatever qualified name was printed:
Query using the catalog-qualified name `<catalog>.public.<table>`:

```bash
hotdata datasets list
hotdata query "SELECT * FROM datasets.main.chain_revenue_slice WHERE ..."
# Managed database:
# hotdata query "SELECT * FROM chain_db.public.revenue_slice WHERE ..."
hotdata databases tables list
hotdata query "SELECT * FROM chain_db.public.revenue_slice WHERE ..."
```

### Naming and documentation

- Prefer predictable `--name` values: `chain_<topic>_<YYYYMMDD>`.
- Record long-lived chains in **context:DATAMODEL → Derived tables (Chain)** with the **full** SQL name you use (`datasets.…` or `database.schema.table`).
- Record long-lived chains in **context:DATAMODEL → Derived tables (Chain)** with the **full** SQL name you use (`database.schema.table`).
- Promote join/grain findings to **context:DATAMODEL** when they should be shared or persisted (**`hotdata`** skill).

### Guardrails

- Materialize when the base scan is large and the follow-up runs many times.
- Keep Chain tables focused; avoid wide `SELECT *` materializations when a narrow projection suffices.
- For upload format choice (datasets vs databases), see **`hotdata`** WORKFLOWS — [Datasets vs managed databases](../../hotdata/references/WORKFLOWS.md#datasets-vs-managed-databases).
- For managed-database uploads, see **`hotdata`** WORKFLOWS — [Managed databases](../../hotdata/references/WORKFLOWS.md#managed-databases).
13 changes: 4 additions & 9 deletions skills/hotdata-search/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,25 +42,20 @@ hotdata search "<query>" --table <connection.schema.table> [--type vector] [--co

## Indexes (BM25 and vector)

Indexes attach to a **managed database table** (`--catalog`) or a **dataset** (`--dataset-id`). Create is not supported on raw connection tables via CLI. `list` and `delete` accept `--connection-id` for connection-scoped operations.
Create attaches to a table via its `--catalog` alias (a managed-database catalog or a connection name). `list` and `delete` accept `--connection-id` (+ `--schema` + `--table`) for connection-scoped operations.

```bash
# List — workspace scan (filter by connection, schema, table, or dataset)
# List — workspace scan (filter by connection, schema, or table)
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>] [--workspace-id <ws>] [--output table|json|yaml]
hotdata indexes list --dataset-id <dataset_id> [--workspace-id <ws>] [--output table|json|yaml]

# Create — managed database table (catalog alias)
# Create — by catalog alias (resolves a managed-database catalog or a connection name)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <col> --type bm25|vector \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Create — dataset
hotdata indexes create --dataset-id <dataset_id> --column <col> --type bm25|vector [--name <name>] ...

# Delete — connection table or dataset
# Delete — connection table (--connection-id + --schema + --table)
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
hotdata indexes delete --dataset-id <dataset_id> --name <name>
```

- **`--type` is required** on create: `bm25` (one text column) or `vector` (exactly one column; often embeddings or auto-embedded text).
Expand Down
7 changes: 3 additions & 4 deletions skills/hotdata-search/references/INDEXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ High-cardinality **text** (`title`, `body`, …) → **bm25**. **Embedding** / f

```bash
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>]
hotdata indexes list --dataset-id <dataset_id>
```

Skip duplicates (same table, column, and purpose).
Expand All @@ -40,13 +39,13 @@ hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column embedding --type vector --metric cosine
```

For regular connections (explicit connection ID):
For a regular connection, pass its name or ID to `--catalog`:

```bash
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
hotdata indexes create --catalog <connection-name-or-id> --schema <schema> --table <table> \
--name idx_posts_body_bm25 --column body --type bm25

hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
hotdata indexes create --catalog <connection-name-or-id> --schema <schema> --table <table> \
--name idx_chunks_embedding --column embedding --type vector --metric cosine
```

Expand Down
Loading
Loading