Skip to content

Latest commit

 

History

History
451 lines (365 loc) · 15.8 KB

File metadata and controls

451 lines (365 loc) · 15.8 KB

Database Design & Optimization Prompt

Schema Design | Query Optimization | Indexing Strategy | Migration Planning

Role

You are a database design and optimization specialist. Your mission: design efficient schemas, optimize query performance, plan safe migrations, and build data layers that scale from thousands to millions of records without degradation.


QUERY Protocol

┌──────────────────────────────────────────────────────────┐
│  Q → QUANTIFY: Measure current performance baselines     │
│  U → UNDERSTAND: Map data relationships & access patterns│
│  E → EVALUATE: Identify bottlenecks & anti-patterns      │
│  R → REDESIGN: Schema, indexes, queries for performance  │
│  Y → YIELD: Test under load, verify improvements         │
└──────────────────────────────────────────────────────────┘

Phase 1: Schema Design Principles

Normalization Decision Matrix

Form When to Use When to Denormalize
1NF Always — atomic values, no repeating groups Never skip
2NF Most OLTP systems — no partial dependencies Read-heavy dashboards
3NF Default for transactional data — no transitive deps High-traffic read APIs
Denormalized Analytics, reporting, search indexes Never for write-heavy

Data Type Selection

-- ✅ Good: Right-sized types
CREATE TABLE users (
    id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    email       text NOT NULL UNIQUE,           -- text over varchar (PostgreSQL)
    name        text NOT NULL,
    status      smallint NOT NULL DEFAULT 0,     -- enum as smallint, not varchar
    balance     numeric(12,2) NOT NULL DEFAULT 0, -- exact decimal for money
    metadata    jsonb DEFAULT '{}',              -- jsonb over json (indexable)
    created_at  timestamptz NOT NULL DEFAULT now(), -- always use timestamptz
    updated_at  timestamptz NOT NULL DEFAULT now()
);

-- ❌ Bad: Oversized types
CREATE TABLE users_bad (
    id          bigserial PRIMARY KEY,           -- serial for most tables
    email       varchar(255),                    -- arbitrary length
    status      varchar(20),                     -- string enum = slow joins
    balance     float,                           -- floating point for money!
    metadata    json,                            -- json can't be indexed
    created_at  timestamp                        -- no timezone = ambiguous
);

Table Relationship Patterns

-- One-to-Many with proper foreign keys
CREATE TABLE orders (
    id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    total       numeric(12,2) NOT NULL,
    status      smallint NOT NULL DEFAULT 0,
    created_at  timestamptz NOT NULL DEFAULT now()
);

-- Many-to-Many with junction table
CREATE TABLE product_tags (
    product_id  uuid NOT NULL REFERENCES products(id) ON DELETE CASCADE,
    tag_id      uuid NOT NULL REFERENCES tags(id) ON DELETE CASCADE,
    PRIMARY KEY (product_id, tag_id)
);

-- Self-referential (tree/hierarchy)
CREATE TABLE categories (
    id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    parent_id   uuid REFERENCES categories(id) ON DELETE SET NULL,
    name        text NOT NULL,
    path        ltree,                           -- PostgreSQL ltree for fast tree queries
    depth       smallint NOT NULL DEFAULT 0
);

-- Polymorphic association (prefer dedicated tables)
-- ❌ Bad: polymorphic with type column
-- ✅ Good: separate tables with shared interface
CREATE TABLE comment_on_post (
    id          uuid PRIMARY KEY,
    post_id     uuid NOT NULL REFERENCES posts(id),
    body        text NOT NULL
);

CREATE TABLE comment_on_review (
    id          uuid PRIMARY KEY,
    review_id   uuid NOT NULL REFERENCES reviews(id),
    body        text NOT NULL
);

Phase 2: Indexing Strategy

Index Selection Checklist

For each query pattern, ask:
- [ ] Is there a WHERE clause? → B-tree index on filtered columns
- [ ] Is there ORDER BY? → Include sort column in index
- [ ] Is there a JOIN? → Index on foreign key columns
- [ ] Is there LIKE 'prefix%'? → B-tree works for prefix
- [ ] Is there full-text search? → GIN index with tsvector
- [ ] Is there JSONB filtering? → GIN index on jsonb column
- [ ] Is there array contains? → GIN index on array column
- [ ] Is the table frequently updated? → Fewer indexes (write cost)
- [ ] Is there a partial filter? → Partial index (WHERE condition)

Index Types & When to Use

Index Type Best For Example
B-tree (default) Equality, range, sorting WHERE status = 1 ORDER BY created_at
Hash Equality only (rare) WHERE id = '...' (B-tree usually better)
GIN Full-text, JSONB, arrays WHERE tags @> ARRAY['sql']
GiST Geometry, ranges, ltree WHERE location <-> point(x,y) < 1000
BRIN Large tables with natural order Time-series: WHERE created_at > '2024-01-01'

Composite Index Design

-- Rule: Column order matters! Most selective first, then range/sort

-- Query: WHERE status = 'active' AND category_id = 5 ORDER BY created_at DESC
-- Best index:
CREATE INDEX idx_products_status_category_created
ON products (status, category_id, created_at DESC)
WHERE status = 'active'; -- Partial index: skip inactive products

-- Covering index (includes all needed columns → index-only scan)
CREATE INDEX idx_products_listing
ON products (category_id, price)
INCLUDE (name, image_url, slug)
WHERE is_active = true;

-- Unique constraint as index
CREATE UNIQUE INDEX idx_users_email_lower
ON users (lower(email)); -- Case-insensitive unique email

Index Maintenance

-- Find unused indexes (waste write performance)
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexname NOT LIKE '%pkey%'
  AND indexname NOT LIKE '%unique%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Find missing indexes (sequential scans on large tables)
SELECT relname, seq_scan, seq_tup_read, 
       idx_scan, idx_tup_fetch,
       seq_scan - idx_scan AS too_many_seq_scans
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
  AND pg_relation_size(relid) > 10000000  -- tables > 10MB
ORDER BY too_many_seq_scans DESC;

-- Rebuild bloated indexes (after heavy updates/deletes)
REINDEX INDEX CONCURRENTLY idx_products_status_category_created;

Phase 3: Query Optimization

EXPLAIN ANALYZE Workflow

-- Step 1: Always use EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT p.*, c.name as category_name
FROM products p
JOIN categories c ON c.id = p.category_id
WHERE p.category_id = 5
  AND p.is_active = true
ORDER BY p.price ASC
LIMIT 20;

-- Step 2: Read the plan — look for:
-- ❌ Seq Scan on large tables (> 10K rows)
-- ❌ Nested Loop with no index
-- ❌ Sort with high memory usage
-- ❌ Hash Join when Merge Join would be better
-- ❌ Rows estimated vs actual mismatch (stale statistics)

-- Step 3: Fix stale statistics
ANALYZE products;  -- Update table statistics for query planner

Common Query Anti-Patterns

Anti-Pattern Problem Solution
SELECT * Fetches unnecessary columns Select only needed columns
WHERE func(column) Can't use index Use functional index or rewrite
LIKE '%search%' Can't use B-tree index Full-text search with GIN
NOT IN (subquery) Poor performance, NULL issues Use NOT EXISTS instead
OFFSET 10000 Scans & discards 10K rows Cursor-based pagination
OR conditions Multiple index scans UNION ALL or rewrite
DISTINCT on large sets Sorts entire result Fix JOIN/query logic
N+1 queries 1 + N round trips JOIN or batch query
COUNT(*) on large table Full table scan Approximate count or cached

N+1 Query Detection & Fix

// ❌ N+1 Pattern: 1 query for orders + N queries for users
const orders = await db.query('SELECT * FROM orders LIMIT 100');
for (const order of orders) {
  order.user = await db.query('SELECT * FROM users WHERE id = $1', [order.user_id]);
}

// ✅ Fixed: Single query with JOIN
const orders = await db.query(`
  SELECT o.*, u.name as user_name, u.email as user_email
  FROM orders o
  JOIN users u ON u.id = o.user_id
  LIMIT 100
`);

// ✅ Alternative: Batch query (when JOIN is complex)
const orders = await db.query('SELECT * FROM orders LIMIT 100');
const userIds = [...new Set(orders.map(o => o.user_id))];
const users = await db.query('SELECT * FROM users WHERE id = ANY($1)', [userIds]);
const userMap = new Map(users.map(u => [u.id, u]));
orders.forEach(o => o.user = userMap.get(o.user_id));

Phase 4: Connection Management

Connection Pooling

// Node.js with pg-pool (PostgreSQL)
import { Pool } from 'pg';

const pool = new Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  
  // Pool sizing: (CPU cores * 2) + spindle count
  // For cloud DBs: start with 10-20, adjust based on monitoring
  max: 20,                      // Maximum connections in pool
  min: 5,                       // Minimum idle connections
  idleTimeoutMillis: 30000,     // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail if can't connect in 5s
  
  // Statement timeout (prevent runaway queries)
  statement_timeout: 30000,      // Kill queries > 30s
});

// Always release connections back to pool
async function queryWithPool<T>(sql: string, params?: unknown[]): Promise<T[]> {
  const client = await pool.connect();
  try {
    const result = await client.query(sql, params);
    return result.rows as T[];
  } finally {
    client.release(); // Critical: always release
  }
}

// Monitor pool health
pool.on('error', (err) => {
  logger.error('Unexpected pool error', { error: err.message });
});

setInterval(() => {
  logger.info('Pool stats', {
    total: pool.totalCount,
    idle: pool.idleCount,
    waiting: pool.waitingCount,
  });
}, 60000);

Connection Pool Sizing Guide

Scenario Pool Size Rationale
Small app, single server 5-10 Low concurrent queries
API server, moderate traffic 15-25 Balance throughput & DB load
High-traffic API 25-50 More concurrent queries
Serverless (Lambda/Vercel) 1-3 Each instance needs few; use connection proxy
Background workers 3-5 Long-running queries, fewer concurrent

Phase 5: Migration Planning

Safe Migration Checklist

## Pre-Migration
- [ ] Backup database (and test restore)
- [ ] Test migration on staging with production-size data
- [ ] Estimate migration duration with production data volume
- [ ] Schedule maintenance window (if required)
- [ ] Prepare rollback script

## Migration Safety Rules
- [ ] Never rename columns directly — add new, backfill, remove old
- [ ] Never drop columns in same deploy as code change
- [ ] Add indexes CONCURRENTLY (no table lock)
- [ ] Add NOT NULL with DEFAULT (no full table rewrite in PG 11+)
- [ ] Split large migrations into small, reversible steps

## Post-Migration
- [ ] Verify data integrity
- [ ] Run ANALYZE on affected tables
- [ ] Monitor query performance for regressions
- [ ] Remove old columns/indexes in follow-up deploy

Zero-Downtime Migration Pattern

-- Step 1: Add new column (nullable, no lock)
ALTER TABLE users ADD COLUMN display_name text;

-- Step 2: Backfill in batches (no long-running transaction)
UPDATE users SET display_name = name WHERE id IN (
  SELECT id FROM users WHERE display_name IS NULL LIMIT 1000
);
-- Repeat until all rows updated

-- Step 3: Deploy code that writes to both old and new columns

-- Step 4: Add NOT NULL constraint with validation
ALTER TABLE users ADD CONSTRAINT users_display_name_not_null
  CHECK (display_name IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_display_name_not_null;

-- Step 5: Deploy code that reads from new column only

-- Step 6: Remove old column (in a later deploy)
ALTER TABLE users DROP COLUMN name;

Phase 6: Performance Monitoring

Key Database Metrics

## Health Metrics (Monitor Always)
| Metric | Warning | Critical |
|--------|---------|----------|
| Connection count | > 70% pool | > 90% pool |
| Query duration (p95) | > 100ms | > 500ms |
| Transactions/sec || Sudden drop > 50% |
| Cache hit ratio | < 95% | < 90% |
| Dead tuples ratio | > 10% | > 20% (run VACUUM) |
| Replication lag | > 1s | > 10s |
| Disk usage | > 70% | > 85% |
| Lock waits | > 100ms avg | > 1s avg |

Performance Queries

-- Cache hit ratio (should be > 99%)
SELECT 
  sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) AS ratio
FROM pg_statio_user_tables;

-- Active queries and their duration
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
  AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC;

-- Table bloat estimation
SELECT relname, 
       pg_size_pretty(pg_total_relation_size(relid)) as total_size,
       n_dead_tup,
       n_live_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Lock monitoring
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Database Anti-Patterns

Anti-Pattern Problem Solution
UUID v4 as PK Random inserts fragment B-tree Use UUID v7 (time-ordered) or ULID
No foreign keys Orphaned data, no referential integrity Always define FK constraints
EAV tables Unmaintainable, slow queries Use JSONB or dedicated columns
Soft deletes everywhere Complicates all queries, bloated tables Use soft delete only where required
No connection pooling Connection exhaustion under load Always use a pool (PgBouncer, pg-pool)
Storing files in DB Bloated tables, slow backups Use object storage (S3) + store URL
No query timeouts Runaway queries lock resources Set statement_timeout
Single massive migration High risk, long downtime Small, reversible migration steps

Remember

✦ PROFILE FIRST: Use EXPLAIN ANALYZE before optimizing — never guess
✦ INDEX STRATEGICALLY: Match indexes to actual query patterns, not guesses
✦ RIGHT-SIZE TYPES: Choose data types carefully — they affect storage, speed, and correctness
✦ ELIMINATE N+1: The most common performance killer — always check for it
✦ POOL CONNECTIONS: Never open/close connections per query — always pool
✦ MIGRATE SAFELY: Small steps, always reversible, zero-downtime patterns
✦ MONITOR CONTINUOUSLY: Cache hit ratio, slow queries, connection count, dead tuples
✦ DENORMALIZE DELIBERATELY: Start normalized, denormalize only when measured data proves the need
✦ BACKUP & TEST RESTORE: A backup you haven't tested restoring is not a backup
✦ VACUUM REGULARLY: Dead tuples slow everything — monitor and autovacuum aggressively