Skip to content

Latest commit

 

History

History
446 lines (355 loc) · 22.3 KB

File metadata and controls

446 lines (355 loc) · 22.3 KB

Structured Database Seeding

Initium's seed subcommand applies structured, repeatable data provisioning to your database from YAML or JSON seed spec files. It replaces ad-hoc shell scripts with a declarative approach that supports idempotency, referential integrity, and a tracking table to prevent duplicate application.

Seed spec files are MiniJinja templates: they are rendered with environment variables before parsing, enabling conditional phases, loops, and dynamic configuration.

Supported Databases

Driver Connection URL format
postgres postgres://user:pass@host:5432/dbname
mysql mysql://user:pass@host:3306/dbname
sqlite /path/to/database.db or :memory: for tests

Quick Start

# Apply seeds from a YAML spec file
initium seed --spec /seeds/seed.yaml

# Apply seeds with reset (delete + re-seed)
initium seed --spec /seeds/seed.yaml --reset

# With JSON log output
initium seed --spec /seeds/seed.yaml --json

Seed Spec Schema

The seed spec file defines the complete seeding plan. Both YAML and JSON formats are supported (file extension determines parser). The spec file is a MiniJinja template rendered with environment variables before parsing.

Database connection

Two connection styles are supported. Choose one — they cannot be combined.

URL-based (existing behavior):

database:
  driver: postgres
  url: "postgres://user:pass@host:5432/dbname"
  # or: url_env: DATABASE_URL

Structured fields (no URL encoding needed):

database:
  driver: postgres
  host: pg.example.com
  port: 5432 # Optional. Default: 5432 (postgres), 3306 (mysql)
  user: netbird
  password: "{{ env.DB_PASSWORD }}" # Special chars just work — no URL encoding
  name: mydb
  options: # Optional. Driver-specific connection parameters
    sslmode: disable

Structured config builds the connection using driver-native APIs, bypassing URL parsing entirely. Passwords with @, %, :, or other URL-reserved characters work without encoding.

Note: Structured config is not supported for SQLite — use url instead.

Full schema

database:
  driver: postgres # Required. One of: postgres, mysql, sqlite
  # --- URL-based connection (pick one style) ---
  url: "postgres://..." # Direct database URL
  url_env: DATABASE_URL # Or: name of env var containing the URL
  # --- Structured connection (alternative to url/url_env) ---
  host: pg.example.com # Database host
  port: 5432 # Optional. Default per driver
  user: myuser # Database user
  password: "secret" # Database password (special chars OK)
  name: mydb # Database name
  options: # Optional. Driver-specific parameters
    sslmode: disable
  default_database: postgres # Optional. Bootstrap database for create_if_missing
  # --- Common ---
  tracking_table: initium_seed # Default: "initium_seed"

phases:
  - name: setup # Required. Phase name.
    order: 1 # Optional. Execution order (default: 0).
    database: reporting # Optional. Database to target/create.
    schema: analytics # Optional. Schema to target/create.
    create_if_missing: true # Optional. Create database/schema if missing.
    timeout: 30s # Optional. Default wait timeout (e.g. 30s, 1m; default: 30s).
    wait_for: # Optional. Objects to wait for before seeding.
      - type: table # One of: table, view, schema, database.
        name: users
        timeout: 60s # Optional. Per-object timeout override.
    seed_sets: # Optional. Seed sets to apply in this phase.
      - name: initial_data
        order: 1 # Optional. Controls execution order across seed sets.
        mode: once # Optional. "once" (default) or "reconcile".
        tables:
          - table: config
            order: 1 # Optional. Controls execution order within a seed set.
            unique_key: [
              email,
            ] # Optional. Columns used for duplicate detection.
            auto_id: # Optional. Auto-generated ID configuration.
              column: id # Column name for the auto-generated ID.
              id_type: integer # ID type (default: integer).
            rows:
              - _ref: row_alias # Optional. Internal reference name for this row.
                key: app_name
                value: "{{ env.APP_NAME }}"

Field reference

Field Type Required Description
database.driver string Yes Database driver: postgres, mysql, or sqlite
database.url string No Direct database connection URL (cannot combine with structured fields)
database.url_env string No Environment variable containing the database URL
database.host string No Database host (structured config; cannot combine with url/url_env)
database.port integer No Database port (default: 5432 for postgres, 3306 for mysql)
database.user string No Database user (structured config)
database.password string No Database password — special characters work without encoding
database.name string No Database name (structured config)
database.options map[string]string No Driver-specific connection parameters (e.g. sslmode: disable)
database.default_database string No Database to connect to during create_if_missing bootstrap. Default: postgres for PostgreSQL, none for MySQL.
database.tracking_table string No Name of the seed tracking table (default: initium_seed)
phases[].name string Yes Unique phase name
phases[].order integer No Execution order (lower first, default: 0)
phases[].database string No Target database name (for create/switch)
phases[].schema string No Target schema name (for create/switch)
phases[].create_if_missing boolean No Create the database/schema if it does not exist (default: false)
phases[].timeout string No Default wait timeout (e.g. 30s, 1m, 1m30s; default: 30s)
phases[].wait_for[].type string Yes Object type: table, view, schema, or database
phases[].wait_for[].name string Yes Object name to wait for
phases[].wait_for[].timeout string No Per-object timeout override (e.g. 60s, 2m, 1m30s)
phases[].seed_sets[].name string Yes Unique name for the seed set (used in tracking)
phases[].seed_sets[].order integer No Execution order (lower values first, default: 0)
phases[].seed_sets[].mode string No Seed mode: once (default) or reconcile
phases[].seed_sets[].tables[].table string Yes Target database table name
phases[].seed_sets[].tables[].order integer No Execution order within the seed set (default: 0)
phases[].seed_sets[].tables[].unique_key string[] No Columns for duplicate detection
phases[].seed_sets[].tables[].auto_id.column string No Auto-generated ID column name
phases[].seed_sets[].tables[].auto_id.id_type string No ID type (default: integer)
phases[].seed_sets[].tables[].rows[]._ref string No Internal reference name for cross-table references

Wait-for object support by driver

Object type SQLite PostgreSQL MySQL
table
view
schema ✅*
database ✅*

* In MySQL, schema and database are synonymous.

Create-if-missing support by driver

Operation SQLite PostgreSQL MySQL
CREATE DATABASE
CREATE SCHEMA ✅*

* In MySQL, CREATE SCHEMA maps to CREATE DATABASE.

SQLite does not support separate databases or schemas — each file is a database.

Database connection resolution

If structured fields (host, port, user, password, name) are provided, the connection is built using driver-native APIs — no URL is needed.

Otherwise, the database URL is resolved in this order:

  1. database.url_env — environment variable name containing the URL
  2. database.url — direct URL in the spec file
  3. DATABASE_URL — fallback environment variable

Structured fields and URL-based fields (url/url_env) are mutually exclusive — specifying both is a validation error.

Features

MiniJinja Templating

All seed spec files are rendered as MiniJinja templates before parsing. Environment variables are available as {{ env.VAR_NAME }}. This enables:

  • Dynamic values: {{ env.APP_VERSION }}
  • Conditional phases: {% if env.ENABLE_ANALYTICS %}...{% endif %}
  • Generated rows: {% for i in range(10) %}...{% endfor %}
  • Lenient mode: missing env vars render as empty strings (no errors)
database:
  driver: {{ env.DB_DRIVER }}
  url_env: DATABASE_URL
phases:
  - name: config
    seed_sets:
      - name: app_config
        tables:
          - table: config
            rows:
{% for key in ["app_name", "version", "env"] %}
              - key: {{ key }}
                value: "{{ env['APP_' ~ key | upper] }}"
{% endfor %}

Note: The @ref: syntax for cross-table references is processed at execution time (after template rendering), so it works seamlessly with MiniJinja templates.

Idempotency via Tracking Table

Initium creates a tracking table (default: initium_seed) that records which seed sets have been applied. On subsequent runs, already-applied seed sets are skipped automatically.

┌──────────────────────────────────────┐
│           initium_seed               │
├──────────┬───────────────────────────┤
│ seed_set │ applied_at                │
├──────────┼───────────────────────────┤
│ users    │ 2025-01-15T10:30:00Z      │
│ config   │ 2025-01-15T10:30:01Z      │
└──────────┴───────────────────────────┘

Duplicate Detection via Unique Keys

When unique_key is specified on a table, each row is checked against existing data before insertion. Rows matching the unique key are skipped, preventing duplicate inserts even within the same seed set.

tables:
  - table: users
    unique_key: [email]
    rows:
      - name: Alice
        email: alice@example.com # Skipped if email already exists

Auto-Generated IDs and Cross-Table References

Use auto_id to let the database generate IDs, and _ref + @ref: to reference generated values in other tables:

phases:
  - name: data
    seed_sets:
      - name: with_refs
        tables:
          - table: departments
            order: 1
            auto_id:
              column: id
            rows:
              - _ref: dept_eng # Name this row for later reference
                name: Engineering

          - table: employees
            order: 2
            rows:
              - name: Alice
                email: alice@example.com
                department_id: "@ref:dept_eng.id" # Resolves to the generated ID

Environment Variable Substitution

Use $env:VAR_NAME or MiniJinja {{ env.VAR_NAME }} to inject values from environment variables at runtime. This is ideal for credentials loaded from Kubernetes secrets:

rows:
  - username: "$env:ADMIN_USERNAME"
    password_hash: "{{ env.ADMIN_PASSWORD_HASH }}"

Reconcile Mode

By default, seed sets are applied once and never modified (mode: once). Reconcile mode makes seeding declarative: the rendered spec becomes the source of truth, and initium reconciles the database to match it whenever the rendered spec changes.

If the rendered spec has not changed since the last run (content hash match), initium treats the seed set as already reconciled and skips it. Out-of-band database changes are not corrected until a spec change triggers reconciliation again.

Enable reconcile mode per seed set:

seed_sets:
  - name: departments
    mode: reconcile # "once" (default) or "reconcile"
    tables:
      - table: departments
        unique_key: [name] # Required for reconcile mode
        rows:
          - name: Engineering
          - name: Sales

Or override all seed sets for a single run:

initium seed --spec /seeds/seed.yaml --reconcile-all

How it works:

  1. On each run, initium computes a content hash of the rendered seed set (after template/env expansion).
  2. If the hash matches the stored hash, the seed set is skipped (no-op).
  3. If the hash differs, initium reconciles row by row:
    • New rows (in spec but not in DB) are inserted.
    • Changed rows (different values for same unique key) are updated.
    • Removed rows (in DB but not in spec) are deleted.

Ignoring columns: Some columns should be set on initial insert but never overwritten during reconciliation (e.g., timestamps, random tokens, or values managed by database triggers). Use ignore_columns to exclude them:

tables:
  - table: users
    unique_key: [email]
    ignore_columns: [created_at, api_token]
    rows:
      - email: alice@example.com
        name: Alice
        created_at: "2026-01-01"
        api_token: "$env:ALICE_TOKEN"

Ignored columns are:

  • Included in the initial INSERT (the row is written with all columns).
  • Excluded from change detection (changing an ignored column's value in the spec does not trigger an update).
  • Excluded from UPDATE statements (manual or trigger-managed changes in the database are preserved).
  • Excluded from the content hash (so they don't affect the fast-path skip).

ignore_columns cannot overlap with unique_key.

Requirements:

  • Every table in a reconciled seed set must have a unique_key. Without it, there is no way to identify which rows correspond to which spec entries.
  • Environment variable changes trigger reconciliation (resolved values are compared, not raw templates).

Row tracking: Initium creates a companion table ({tracking_table}_rows, e.g., initium_seed_rows) that stores the resolved values of each seeded row. This enables change detection and orphan deletion.

Dry-run mode: Preview what reconciliation would do without modifying the database:

initium seed --spec /seeds/seed.yaml --dry-run

This logs insert/update/delete counts per table without executing any changes.

Reset Mode

Use --reset to delete all data from seeded tables and remove tracking entries before re-applying. Tables are deleted in reverse order to respect foreign key constraints:

initium seed --spec /seeds/seed.yaml --reset

Ordering

Both seed sets and tables within seed sets support explicit ordering via the order field. Lower values execute first (default: 0). This ensures parent tables are seeded before dependent tables.

Transaction Safety

Each seed set is applied within a database transaction. If any row fails to insert, the entire seed set is rolled back, preventing partial data application.

Kubernetes Usage

Credentials via Environment Variables (from Secrets)

apiVersion: v1
kind: Pod
spec:
  initContainers:
    - name: seed-data
      image: ghcr.io/kitstream/initium:latest
      args: ["seed", "--spec", "/seeds/seed.yaml"]
      env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: ADMIN_USERNAME
          valueFrom:
            secretKeyRef:
              name: admin-credentials
              key: username
        - name: ADMIN_PASSWORD_HASH
          valueFrom:
            secretKeyRef:
              name: admin-credentials
              key: password-hash
      volumeMounts:
        - name: seed-specs
          mountPath: /seeds
          readOnly: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        capabilities:
          drop: [ALL]
  volumes:
    - name: seed-specs
      configMap:
        name: seed-specs

CLI Reference

Flag Default Description
--spec (required) Path to seed spec file (YAML or JSON)
--reset false Delete existing data and re-apply seeds
--dry-run false Preview changes without modifying the database
--reconcile-all false Override all seed sets to reconcile mode for this run
--json false Enable JSON log output

Failure Modes

Scenario Behavior
Invalid spec file Fails with parse error before connecting to database
Invalid MiniJinja template Fails with template syntax error before parsing YAML
Database unreachable Fails with connection error
Unsupported driver Fails with descriptive error listing supported drivers
Missing env var for URL Fails with error naming the missing variable
Missing env var in $env: Fails with error naming the missing variable
Unresolved @ref: Fails with error naming the missing reference
Row insertion failure Entire seed set rolled back via transaction
Duplicate row (with unique_key) Row silently skipped
Already-applied seed set Seed set silently skipped
Wait-for object timeout Fails with structured timeout error naming the object
Unsupported object type for driver Fails immediately with driver-specific error
CREATE DATABASE on SQLite Fails with "not supported" error
CREATE SCHEMA on SQLite Fails with "not supported" error

Examples

See the examples/seed/ directory: