Semantic Index Types

When a language model consumes a type schema, names stop being addresses and become instructions.

# Same type.
# Different computation.

churn_risk_tier: RiskTier   # assesses voluntary customer departure risk
x7: RiskTier                # picks an enum member

That is the thesis.

A semantic index type is a type declaration whose natural-language tokens — field names, descriptions, enum member names — function as computational indices for a neural consumer.

They do not just label slots.

They tell the model what to compute.

What changed

Traditional types were compiled for consumers that erase names.

Schema-driven LLM systems are compiled for consumers that read names.

The compilation target changed.

So the old assumption broke:

Renaming is no longer guaranteed to be behavior-preserving.

Why this matters

Most teams still treat response schemas like output formats.

In schema-driven LLM systems, that mental model is no longer sufficient. The schema is part of the inference surface.

What you assumed	What's actually true
The schema defines output format	The schema participates in inference
Renaming is cosmetic	Renaming can change behavior
Descriptions are documentation	Descriptions are executable guidance
Validation catches bad outputs	Validation bounds semantic failure

Schema authorship is computational authorship.

One declaration, two interpreters

A semantic index type is a single declaration with split operational semantics.

flowchart LR
    A["Type Schema<br/>names, descriptions, enum labels, constraints"] --> B["Formal Interpreter<br/>runtime / validator"]
    A --> C["Neural Interpreter<br/>language model"]

    B --> D["Structural Channel<br/>what outputs are valid"]
    C --> E["Semantic Channel<br/>which valid outputs become likely"]

    D --> F["Structured Output"]
    E --> F

The formal interpreter sees

structure
arity
validators
constraints
construction invariants

The neural interpreter sees

task framing
domain meaning
semantic distinctions
implied reasoning
what kind of computation to perform

The schema is no longer just representation. It is part of the computation.

The core law: semantic influence is bounded by structural compression

$$I(N; Y_f) \leq H(Y_f) \leq \log_2 |V_f|$$

Where:

$N$ = schema naming variant
$Y_f$ = model output for field $f$
$V_f$ = structurally valid values for field $f$

In plain English

Structure defines what can be output
Semantics influences which valid output is chosen
The tighter the type, the less room semantics has to move behavior
The looser the type, the more semantic burden the names carry

Type constraint	Valid outputs	Max semantic influence
`bool`	2	1 bit
4-member enum	4	2 bits
unconstrained `str`	unbounded	unbounded

This one inequality governs both how much a good name can help and how much a poisoned one can hurt.

So:

tight types give you proof
precise names give you guidance
robust systems want both

This is not just prompt engineering

A prompt gives instruction.

A semantic index type gives:

instruction through names and descriptions
constraint through types and enums
proof through validation and construction

Artifact	Instructs	Constrains	Proves
Prompt	Yes	No	No
Schema text alone	Sometimes	Weakly	No
Semantic index type in a typed construction system	Yes	Yes	Yes

That is why this is a systems concept, not a prompting trick.

The five principles

1. Naming is programming

Choosing churn_risk_tier over attrition_risk_tier is choosing between analytical framings — voluntary departure versus passive loss. The field name is an instruction. Changing the name changes the computation.

2. Descriptions are program text

A description that says "Projected total revenue across the full customer relationship, not historical sum" narrows the model from a broad concept to a specific calculation. Removing it changes the output distribution.

3. Renaming is refactoring

In traditional programming, renaming is safe and mechanical. With language models consuming schemas, that invariant breaks. Renaming requires the same care as modifying function logic.

4. Types constrain, names guide

Structure defines admissibility. Semantics defines salience. A 4-member enum gives the name 2 bits of influence. A bare str gives it everything.

5. Names are also attack vectors

Every field name and description is a point where the data/instruction boundary collapses — the same vulnerability class as SQL injection, instantiated at the schema level.

The practical engineering pattern: progressive hardening

Start with semantic precision.

Then turn repeated semantic failure into structural guarantees.

flowchart LR
    A["Precise names<br/>clear descriptions<br/>good enum labels"] --> B["Observe where the model fails"]
    B --> C["Add structure<br/>validators, tighter types, constraints"]
    C --> D["Reduce semantic bandwidth<br/>increase guarantees"]

This is the development discipline for mixed formal/neural systems:

use language to steer
watch where it fails
harden those failures into structure

Companion experiment

The empirical phenomenon is already established — converging evidence from schema-guided dialogue, text-to-SQL, and code language model research independently demonstrates that neural consumers are not invariant under structure-preserving renaming (§4 of the paper). We are operationalizing the framework in our target domain: Pydantic structured output, with four structurally isomorphic schema variants:

Variant	Semantic content	What it tests
Baseline	precise names + descriptions	correct semantic indexing
Names-only	names kept, descriptions removed	identifiers vs prose
Vacuous	`field_1`, `OPTION_A`, generic text	semantic channel removed
Misleading	coherent wrong-domain naming	different computation, same structure

Why the misleading condition matters

The vacuous condition shows loss of guidance.

The misleading condition is stronger.

It tests whether the model computes a different function under the same structure when the semantic index points somewhere else.

If the output stays structurally valid while the computation shifts, the schema is not merely formatting — it is participating in the task.

See experiment.md.

Why this matters for builders

If you use:

Pydantic
Zod
JSON Schema in tool definitions
structured outputs / function calling
grammar-constrained decoding
typed agent tools

…then this is already part of your system whether you recognize it or not.

Semantic Index Types imply:

schema authorship is computational authorship
naming deserves the same care as function design
descriptions are a behavioral surface
validation alone is not enough
structure and semantics should be designed together

Why this matters for security

If names and descriptions influence computation, they also create an instruction surface.

So the same mechanism that makes good schemas useful makes poisoned schemas dangerous.

That means schema text deserves real security treatment:

provenance
sanitization
least privilege
structural containment

The engineering story and the security story are the same story.

They are both about control of the semantic channel.

What the paper contributes

The paper names the phenomenon, provides the PL-theoretic framing, formalizes the two-channel constraint system, derives the information-theoretic bound that governs both engineering utility and security risk, and unifies converging evidence from three independent research communities under a single abstraction. It is a theory-and-formalization paper grounded in already-established empirical reality.

This README is the front door. The paper is the formal treatment.

Repository contents

semantic-index-types.md — paper
experiment.md — experiment design
sit/ — experiment code
.agents/scripts/building_block.py — recursive Pydantic type classifier

License

Source code is MIT.

Written content is CC BY 4.0.

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.agents/scripts		.agents/scripts
.vscode		.vscode
sit		sit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
experiment.md		experiment.md
pyproject.toml		pyproject.toml
semantic-index-types.md		semantic-index-types.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Index Types

What changed

Why this matters

One declaration, two interpreters

The formal interpreter sees

The neural interpreter sees

The core law: semantic influence is bounded by structural compression

In plain English

This is not just prompt engineering

The five principles

1. Naming is programming

2. Descriptions are program text

3. Renaming is refactoring

4. Types constrain, names guide

5. Names are also attack vectors

The practical engineering pattern: progressive hardening

Companion experiment

Why the misleading condition matters

Why this matters for builders

Why this matters for security

What the paper contributes

Read next

Paper

Experiment design

Code

Repository contents

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Index Types

What changed

Why this matters

One declaration, two interpreters

The formal interpreter sees

The neural interpreter sees

The core law: semantic influence is bounded by structural compression

In plain English

This is not just prompt engineering

The five principles

1. Naming is programming

2. Descriptions are program text

3. Renaming is refactoring

4. Types constrain, names guide

5. Names are also attack vectors

The practical engineering pattern: progressive hardening

Companion experiment

Why the misleading condition matters

Why this matters for builders

Why this matters for security

What the paper contributes

Read next

Paper

Experiment design

Code

Repository contents

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages