Hawk

A distribution-native analytics engine. Instead of storing raw rows, Hawk digests data into compact probability distributions and lets you query them directly -- compare, explain, track drift, and discover correlations, all through an information-theoretic lens.

209,527 rows of news articles compress to 5KB. Queries return in microseconds.

hawk> COMPARE category BETWEEN time:2013 AND time:2022

Metric              Value
──────────────────  ──────────────────────────────────────────
JSD                 0.684139
PSI                 36.357643
Hellinger           0.782895
Entropy(A)          3.6248 bits
Entropy(B)          3.1460 bits
Samples             34583 vs 1398

--- Top Movers ---
POLITICS            +0.2854  (0.000 → 0.285)  contrib=0.1427
WELLNESS            -0.2150  (0.232 → 0.017)  contrib=0.0796
U.S. NEWS           +0.1724  (0.000 → 0.172)  contrib=0.0862

How it works

Define variables (categorical or continuous) and dimensions (e.g., time)
Ingest data from CSV, JSON, or Parquet -- Hawk builds histograms and contingency tables
Query the distributions directly using a SQL-like language or web UI

The database stores only the distribution summaries, not the raw data. Everything is built on entropy and information theory: JSD for comparison, mutual information for association, KL divergence for directionality.

Quick start

# Build
cargo build --release

# Start the web UI
cargo run --release --bin hawk-server -- my_database.db 3000

# Or use the CLI
cargo run --release --bin hawk -- my_database.db

Query language

-- Compare two distribution slices
COMPARE category BETWEEN time:2013 AND time:2022

-- With dimension filters
COMPARE category BETWEEN time:2013 AND time:2022 WHERE region:US

-- Compare all pairs across a dimension
COMPARE category ACROSS time

-- What drives the divergence?
EXPLAIN time:2013 VS time:2022

-- Track drift over time
TRACK category FROM time:2012 GRANULARITY yearly

-- Show a distribution (top 5 categories)
SHOW category AT time:2022 TOP 5

-- Entropy ranking
RANK category BY ENTROPY OVER time

-- Mutual information between variables
MI author, category AT time:2016

-- Conditional MI (controlling for time)
CMI author, category GIVEN time

-- Find strongest associations
CORRELATIONS OVER time LIMIT 10

-- Pairwise distance matrix
PAIRWISE time ON category USING jsd

-- Nearest distributions
NEAREST time:2022 ON time LIMIT 3 USING hellinger

-- Export results
EXPORT STATS AS JSON
EXPORT COMPARE category ACROSS time AS CSV

-- Metadata
STATS
SCHEMA
DIMENSIONS time

Example outputs

Drift tracking:

hawk> TRACK category FROM time:2012 GRANULARITY yearly

Time  Entropy  Drift (JSD)
────  ───────  ────────────────
2012  3.6310   0.0314
2013  3.6248   0.3571 <- shift
2014  4.8237   0.1656 <- shift
2015  4.4118   0.0561 <- shift
2018  3.3050   0.1775 <- shift
2020  3.0430   0.0372
2021  2.9053   0.0286
2022  3.1460   0.0000

Explain divergence:

hawk> EXPLAIN time:2013 VS time:2022

Variable          JSD       Fraction
────────────────  ────────  ──────────────
TOTAL             0.830323  100.0%
category          0.684139  82.4%
  POLITICS        +0.2854   contrib=0.1427
  WELLNESS        -0.2150   contrib=0.0796
  U.S. NEWS       +0.1724   contrib=0.0862
author            0.146184  17.6%
  Mary Papenfuss  +0.0715   contrib=0.0358

Association strength:

hawk> MI author, category AT time:2016

Metric       Value
───────────  ───────────
MI           1.7794 bits
NMI          0.5537
Cramer's V   0.5186
Samples      5688
Strength     strong

Metrics

All metrics are rooted in information theory:

Metric	Formula	Range	What it measures
Entropy	H(X) = -Σ p_i log p_i	[0, log k]	Distribution uncertainty
JSD	H(M) - ½H(P) - ½H(Q)	[0, 1]	Symmetric divergence
KL divergence	Σ p_i log(p_i/q_i)	[0, ∞)	Directional divergence
PSI	KL(P\|\|Q) + KL(Q\|\|P)	[0, ∞)	Population stability (<0.1 stable, >0.2 significant)
Hellinger	(1/√2)√(Σ(√p-√q)²)	[0, 1]	Bounded symmetric distance
MI	H(X)+H(Y)-H(X,Y)	[0, ∞)	Shared information between variables
NMI	MI / min(H(X),H(Y))	[0, 1]	Normalized association strength
Cramer's V	√(χ²/(n·min(r-1,c-1)))	[0, 1]	Effect size for categorical association
Wasserstein	Σ\|CDF_P - CDF_Q\|·Δx	[0, ∞)	Earth mover's distance (histograms only)

Web UI

cargo run --release --bin hawk-server -- my_database.db 3000
# Open http://localhost:3000

Features:

Interactive query input with htmx (no page reloads)
SVG charts: diverging bar charts for COMPARE, entropy timelines for TRACK, distribution bars for SHOW, heatmaps for PAIRWISE
Clickable schema sidebar
Query history (persisted in localStorage)
Streaming ingestion endpoint: POST /ingest with JSON body

Streaming ingestion

The web server accepts live data via HTTP:

# Single record
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '{"category": "TECH", "date": "2024-01-15"}'

# Batch
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '[{"category": "TECH", "date": "2024-01-15"}, {"category": "SPORTS", "date": "2024-01-16"}]'

Storage format

Hawk uses a custom binary format with zstd compression:

[4 bytes] "HAWK" magic
[4 bytes] format version (u32 LE)
[rest]    zstd-compressed bincode payload

A database that digests 209K news articles (42 categories, 20 authors, 11 years) occupies ~6KB on disk.

File	Contents
`meta.edb`	Schema, counters, config
`distributions.edb`	All marginal distributions + joint contingency tables
`dist_index.edb`	Lookup index for (variable, dimension_key) -> distribution
`snapshots.edb`	Historical distribution snapshots

Architecture

Single library crate (hawk-engine) with modules:

hawk_engine::core       Types: Distribution, Joint, Schema, DimensionKey
hawk_engine::math       Entropy, JSD, KL, PSI, Hellinger, MI, NMI, Cramer's V, Wasserstein
hawk_engine::storage    Binary file storage, zstd compression, mmap reads, locking
hawk_engine::ingest     CSV/JSON/Parquet ingestion, rayon parallelism, schema inference
hawk_engine::query      Query engine: compare, explain, track, pairwise, correlations
hawk_engine::sql        SQL-like DSL: tokenizer, recursive descent parser, executor

Plus a separate binary crate (hawk-server) for the web UI: axum + htmx, SVG charts, streaming ingestion endpoint.

Using as a library

[dependencies]
hawk-engine = "0.1"

use hawk_engine::storage::{Database, OpenMode};
use hawk_engine::query::QueryEngine;

let db = Database::open("my.db", OpenMode::ReadOnly).unwrap();
let engine = QueryEngine::default();
let result = engine.compare(&db, "time:2023", "time:2024", None).unwrap();
println!("JSD = {:.6}", result.jsd);

Published to crates.io.

Building

cargo build --release
cargo test

Requirements: Rust 1.75+

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
crates		crates
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hawk

How it works

Quick start

Query language

Example outputs

Metrics

Web UI

Streaming ingestion

Storage format

Architecture

Using as a library

Building

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hawk

How it works

Quick start

Query language

Example outputs

Metrics

Web UI

Streaming ingestion

Storage format

Architecture

Using as a library

Building

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages