teadata is a high-performance, spatially-aware Python framework designed for the comprehensive analysis and modeling of Texas public education data (TEA). It provides a robust, object-oriented interface for handling large-scale geographic and demographic datasets.
The framework employs a "snapshot-based" architecture: instead of repeatedly processing raw data, it loads a pre-built, optimized binary snapshot (pickle/gzip) containing the complete state of Districts, Campuses, and Geometries into memory. This facilitates near-instantaneous query execution and complex spatial operations.
Key Technologies:
- Language: Python 3.11+
- Data Manipulation: Pandas, NumPy, DuckDB, PyArrow
- Geospatial: Shapely 2.0+, GeoPandas, PyOgrio, SciPy (KDTree)
- Build System:
uv(recommended),setuptools - Testing/Quality:
pytest,ruff,ty
The central hub of the library.
- Loading: Loads data from a snapshot file (defaulting to
.cache/repo_*.pkl). - Querying: Provides a fluent interface (
>>) for filtering districts and campuses. - Spatial: Manages spatial indexes (
cKDTreefor points,STRtreefor geometries) to perform fast nearest-neighbor and point-in-polygon queries. - Enrichment: Facilitates attaching external datasets (finance, accountability ratings) to entities.
District: Represents a school district (ISD). Contains geometry (Polygon/MultiPolygon), enrollment, rating, and metadata.Campus: Represents a school. Contains location (Point), grade levels, school type, and metadata.EntityMap/EntityList: Specialized collections for holding these objects, providing Pandas-like methods (.to_df(),.value_counts()).
The >> operator is overloaded on DataEngine to support concise queries:
# Get a district by number
district = engine >> ("district", "101902")
# Get all campuses in that district
campuses = engine >> ("campuses_in", district)
# Find 5 nearest campuses to a point
nearest = engine >> ("nearest", (lon, lat), 5)Defines the mapping between logical data keys (e.g., tapr, peims) and physical file paths or URLs.
teadata/: Main package source code.engine.py: CoreDataEnginelogic.entities.py:DistrictandCampusdataclasses.teadata_sources.yaml: Data source configuration.scripts/: Utilities for downloading/processing raw data.
examples/: Example scripts and Jupyter notebooks demonstrating usage.tests/:pytesttest suite.docs/: MkDocs documentation source..cache/: Snapshot and store artifacts that must be committed.
The project follows a modified Semantic Versioning (SemVer) approach:
- Major (X.0.0): Significant architectural changes, API-breaking updates, or major framework migrations.
- Minor (0.X.0): New features, significant new data source integrations, or changes to the DataEngine query DSL.
- Patch (0.0.X): Bug fixes, minor logic updates, or metadata improvements.
- Data Refresh (0.0.Xy): When a simple data refresh occurs (e.g., running
load_data.pyto pick up latest TEA releases) without significant code changes, append an extra digit (e.g.,0.0.7->0.0.71).
Release tag policy:
- Tags always use the thousandths place (e.g.,
v0.0.101,v0.0.102). If no tags exist, start atv0.0.101. - Keep only the three most recent tags/releases; delete older tags and their release assets everywhere (GitHub releases included).
When to increment:
- Increment the version BEFORE running a release build or distributing a new snapshot.
- Always increment the version when
load_data.pyis executed for a production data refresh.
The project uses uv for dependency management and command execution; do not use bare python or pip.
# Recommended
uv syncuv run pytestuv run ruff check .
uv run ty check .mkdocs serve-
Loading Data:
from teadata import DataEngine # Tries to find the latest snapshot automatically engine = DataEngine.from_snapshot()
-
Accessing Attributes: Entities use dynamic attribute access via
__getattr__to expose data stored in their.metadictionary, allowing for flexible schema evolution. -
Adding New Data:
- Add the source definition to
teadata/teadata_sources.yaml. - Use/Modify scripts in
teadata/scripts/to process the raw data into the snapshot format.
- Add the source definition to