Skip to content

repfly/hawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hawk

A distribution-native analytics engine. Instead of storing raw rows, Hawk digests data into compact probability distributions and lets you query them directly -- compare, explain, track drift, and discover correlations, all through an information-theoretic lens.

209,527 rows of news articles compress to 5KB. Queries return in microseconds.

hawk> COMPARE category BETWEEN time:2013 AND time:2022

Metric              Value
──────────────────  ──────────────────────────────────────────
JSD                 0.684139
PSI                 36.357643
Hellinger           0.782895
Entropy(A)          3.6248 bits
Entropy(B)          3.1460 bits
Samples             34583 vs 1398

--- Top Movers ---
POLITICS            +0.2854  (0.000 → 0.285)  contrib=0.1427
WELLNESS            -0.2150  (0.232 → 0.017)  contrib=0.0796
U.S. NEWS           +0.1724  (0.000 → 0.172)  contrib=0.0862

How it works

  1. Define variables (categorical or continuous) and dimensions (e.g., time)
  2. Ingest data from CSV, JSON, or Parquet -- Hawk builds histograms and contingency tables
  3. Query the distributions directly using a SQL-like language or web UI

The database stores only the distribution summaries, not the raw data. Everything is built on entropy and information theory: JSD for comparison, mutual information for association, KL divergence for directionality.

Quick start

# Build
cargo build --release

# Start the web UI
cargo run --release --bin hawk-server -- my_database.db 3000

# Or use the CLI
cargo run --release --bin hawk -- my_database.db

Query language

-- Compare two distribution slices
COMPARE category BETWEEN time:2013 AND time:2022

-- With dimension filters
COMPARE category BETWEEN time:2013 AND time:2022 WHERE region:US

-- Compare all pairs across a dimension
COMPARE category ACROSS time

-- What drives the divergence?
EXPLAIN time:2013 VS time:2022

-- Track drift over time
TRACK category FROM time:2012 GRANULARITY yearly

-- Show a distribution (top 5 categories)
SHOW category AT time:2022 TOP 5

-- Entropy ranking
RANK category BY ENTROPY OVER time

-- Mutual information between variables
MI author, category AT time:2016

-- Conditional MI (controlling for time)
CMI author, category GIVEN time

-- Find strongest associations
CORRELATIONS OVER time LIMIT 10

-- Pairwise distance matrix
PAIRWISE time ON category USING jsd

-- Nearest distributions
NEAREST time:2022 ON time LIMIT 3 USING hellinger

-- Export results
EXPORT STATS AS JSON
EXPORT COMPARE category ACROSS time AS CSV

-- Metadata
STATS
SCHEMA
DIMENSIONS time

Example outputs

Drift tracking:

hawk> TRACK category FROM time:2012 GRANULARITY yearly

Time  Entropy  Drift (JSD)
────  ───────  ────────────────
2012  3.6310   0.0314
2013  3.6248   0.3571 <- shift
2014  4.8237   0.1656 <- shift
2015  4.4118   0.0561 <- shift
2018  3.3050   0.1775 <- shift
2020  3.0430   0.0372
2021  2.9053   0.0286
2022  3.1460   0.0000

Explain divergence:

hawk> EXPLAIN time:2013 VS time:2022

Variable          JSD       Fraction
────────────────  ────────  ──────────────
TOTAL             0.830323  100.0%
category          0.684139  82.4%
  POLITICS        +0.2854   contrib=0.1427
  WELLNESS        -0.2150   contrib=0.0796
  U.S. NEWS       +0.1724   contrib=0.0862
author            0.146184  17.6%
  Mary Papenfuss  +0.0715   contrib=0.0358

Association strength:

hawk> MI author, category AT time:2016

Metric       Value
───────────  ───────────
MI           1.7794 bits
NMI          0.5537
Cramer's V   0.5186
Samples      5688
Strength     strong

Metrics

All metrics are rooted in information theory:

Metric Formula Range What it measures
Entropy H(X) = -Σ p_i log p_i [0, log k] Distribution uncertainty
JSD H(M) - ½H(P) - ½H(Q) [0, 1] Symmetric divergence
KL divergence Σ p_i log(p_i/q_i) [0, ∞) Directional divergence
PSI KL(P||Q) + KL(Q||P) [0, ∞) Population stability (<0.1 stable, >0.2 significant)
Hellinger (1/√2)√(Σ(√p-√q)²) [0, 1] Bounded symmetric distance
MI H(X)+H(Y)-H(X,Y) [0, ∞) Shared information between variables
NMI MI / min(H(X),H(Y)) [0, 1] Normalized association strength
Cramer's V √(χ²/(n·min(r-1,c-1))) [0, 1] Effect size for categorical association
Wasserstein Σ|CDF_P - CDF_Q|·Δx [0, ∞) Earth mover's distance (histograms only)

Web UI

cargo run --release --bin hawk-server -- my_database.db 3000
# Open http://localhost:3000

Features:

  • Interactive query input with htmx (no page reloads)
  • SVG charts: diverging bar charts for COMPARE, entropy timelines for TRACK, distribution bars for SHOW, heatmaps for PAIRWISE
  • Clickable schema sidebar
  • Query history (persisted in localStorage)
  • Streaming ingestion endpoint: POST /ingest with JSON body

Streaming ingestion

The web server accepts live data via HTTP:

# Single record
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '{"category": "TECH", "date": "2024-01-15"}'

# Batch
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '[{"category": "TECH", "date": "2024-01-15"}, {"category": "SPORTS", "date": "2024-01-16"}]'

Storage format

Hawk uses a custom binary format with zstd compression:

[4 bytes] "HAWK" magic
[4 bytes] format version (u32 LE)
[rest]    zstd-compressed bincode payload

A database that digests 209K news articles (42 categories, 20 authors, 11 years) occupies ~6KB on disk.

File Contents
meta.edb Schema, counters, config
distributions.edb All marginal distributions + joint contingency tables
dist_index.edb Lookup index for (variable, dimension_key) -> distribution
snapshots.edb Historical distribution snapshots

Architecture

Single library crate (hawk-engine) with modules:

hawk_engine::core       Types: Distribution, Joint, Schema, DimensionKey
hawk_engine::math       Entropy, JSD, KL, PSI, Hellinger, MI, NMI, Cramer's V, Wasserstein
hawk_engine::storage    Binary file storage, zstd compression, mmap reads, locking
hawk_engine::ingest     CSV/JSON/Parquet ingestion, rayon parallelism, schema inference
hawk_engine::query      Query engine: compare, explain, track, pairwise, correlations
hawk_engine::sql        SQL-like DSL: tokenizer, recursive descent parser, executor

Plus a separate binary crate (hawk-server) for the web UI: axum + htmx, SVG charts, streaming ingestion endpoint.

Using as a library

[dependencies]
hawk-engine = "0.1"
use hawk_engine::storage::{Database, OpenMode};
use hawk_engine::query::QueryEngine;

let db = Database::open("my.db", OpenMode::ReadOnly).unwrap();
let engine = QueryEngine::default();
let result = engine.compare(&db, "time:2023", "time:2024", None).unwrap();
println!("JSD = {:.6}", result.jsd);

Published to crates.io.

Building

cargo build --release
cargo test

Requirements: Rust 1.75+

License

MIT

About

A distribution-native analytics engine.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors