Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,7 @@ COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

# Copy project files
RUN mkdir -p docs
COPY mkdocs.yml ./
COPY . ./docs/
COPY . .

# Build the MkDocs site
RUN uv run mkdocs build --strict --site-dir /app/site
Expand Down
50 changes: 31 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# The AI Reliability Engineering (AIRE) Standards

[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Status: Draft](https://img.shields.io/badge/Status-Draft%20v0.1-orange)]()
[![Status: Live](https://img.shields.io/badge/Status-Live%20v0.1-green)](https://github.com/exospherehost/ai-reliability-standards)

> **An open implementation guide for building reliable AI Agents at scale. Defining the practices for AI Reliability Engineering (AIRE).**

Expand Down Expand Up @@ -134,16 +134,16 @@ Security for AI agents differs from traditional software-agents are autonomous d

### 5. Operational Excellence & Team Culture

*Establishing SLAs, error budgets, team structures, and operational practices that enable reliable AI systems to scale.*
*Establishing performance targets, quality budgets, team structures, and operational practices that enable reliable AI systems to scale.*

Operational Excellence bridges the gap between technical architecture and organizational culture. While the first four pillars define *what* to build, this pillar defines *how* teams operate, measure, and continuously improve AI systems at scale:

- **AI-Specific SLAs & Error Budgets** - Service Level Objectives for availability, latency, quality, safety, and efficiency; error budget policies for balancing reliability with innovation velocity
- **AI-Specific Performance Targets & Quality Budgets** - Performance targets for cognitive accuracy, safety integrity, autonomy level, response performance, and cost efficiency; quality budget policies for balancing reliability with innovation velocity
- **Team Structure & Shared Responsibility** - Product teams own agents end-to-end; embedded AI Reliability Engineers (AIREs) with 20% time allocation; central platform team provides infrastructure
- **Progressive Autonomy Maturity Model** - Five levels of agent autonomy (L0: Human-Driven → L4: Autonomous), reducing HITL rate from 100% to <5% over time
- **Reliability Reviews** - Weekly metric reviews, monthly postmortems, error budget tracking, SLO compliance monitoring
- **Reliability Reviews** - Weekly metric reviews, monthly postmortems, quality budget tracking, performance target compliance monitoring

**Key Metrics:** SLO Compliance >95%, Error Budget Remaining >25%, HITL Rate <10%, Autonomy Level L3+, Time to Autonomy <6 months
**Key Metrics:** Performance Target Compliance >95%, Quality Budget Remaining >50%, HITL Rate <10%, Autonomy Level L3+, Time to Autonomy <6 months

📖 **[Read the full Operational Excellence guide →](docs/pillars/operational-excellence.md)**

Expand Down Expand Up @@ -187,19 +187,31 @@ You get to shape the future of AI reliability engineering and get recognized for

## Repository Structure

```
docs/
├── getting-started.md # Adoption roadmap for organizations
├── pillars/
│ ├── resilient-architecture.md # Pillar 1: Fault tolerance, scaling, recovery
│ ├── cognitive-reliability.md # Pillar 2: Accuracy, consistency, drift detection
│ ├── quality-lifecycle.md # Pillar 3: Testing, deployment, feedback loops
│ ├── security.md # Pillar 4: JIT access, guardrails, audit logs
│ └── operational-excellence.md # Pillar 5: SLAs, team structure, progressive autonomy
└── appendix/
├── principles.md # AIRE Principles (5 guiding tenets)
├── metrics-framework.md # Three-tier metrics framework
└── glossary.md # Key terms and definitions
This repository contains the source files for the AIRE Standards documentation and deployment infrastructure:

```text
.
├── docs/ # MkDocs documentation source
│ ├── index.md # Documentation homepage
│ ├── getting-started.md # Adoption roadmap for organizations
│ ├── principles.md # AIRE Principles (5 guiding tenets)
│ ├── pillars/ # Core reliability pillars
│ │ ├── resilient-architecture.md # Pillar 1: Fault tolerance, scaling, recovery
│ │ ├── cognitive-reliability.md # Pillar 2: Accuracy, consistency, drift detection
│ │ ├── quality-lifecycle.md # Pillar 3: Testing, deployment, feedback loops
│ │ ├── security.md # Pillar 4: JIT access, guardrails, audit logs
│ │ └── operational-excellence.md # Pillar 5: Performance targets, team structure, progressive autonomy
│ └── appendix/
│ ├── metrics-framework.md # Three-tier metrics framework
│ └── glossary.md # Key terms and definitions
├── assets/ # Static assets (sponsor logos, images)
├── k8s/ # Kubernetes deployment manifests
├── stylesheets/ # Custom CSS for documentation
├── mkdocs.yml # MkDocs configuration
├── Dockerfile # Container image for documentation site
├── pyproject.toml # Python project dependencies
├── README.md # GitHub repository homepage (this file)
├── CONTRIBUTORS.md # Contributors registry
```

---
Expand All @@ -215,7 +227,7 @@ We welcome Pull Requests (PRs) from engineers who have solved specific reliabili

## Sponsors

<a href="https://exosphere.host"><img src="./assets/sponsors/exosphere.png" alt="ExosphereHost Inc." width="75"></a>
<a href="https://exosphere.host"><img src="./docs/assets/sponsors/exosphere.png" alt="ExosphereHost Inc." width="75"></a>

Contact nikita@exosphere.host to sponsor this work.

Expand Down
File renamed without changes
19 changes: 10 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# The AI Reliability Engineering (AIRE) Standards

[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Status: Draft](https://img.shields.io/badge/Status-Draft%20v0.1-orange)]()
[![Status: Live](https://img.shields.io/badge/Status-Live%20v0.1-green)](https://github.com/exospherehost/ai-reliability-standards)

> **An open implementation guide for building reliable AI Agents at scale. Defining the practices for AI Reliability Engineering (AIRE).**

Expand Down Expand Up @@ -109,7 +109,6 @@ Operational Excellence bridges the gap between technical architecture and organi

---


## AIRE Principles

*Guiding tenets inspired by SRE:*
Expand Down Expand Up @@ -150,7 +149,6 @@ Design for autonomous operation. Human escalation is a safety net for edge cases

---


## Getting Started

**New to AIRE?** Start with the **[Getting Started Guide →](getting-started.md)** for a step-by-step adoption roadmap:
Expand Down Expand Up @@ -189,17 +187,20 @@ You get to shape the future of AI reliability engineering and get recognized for

## Repository Structure

```
docs/
This documentation is built from the [ai-reliability-standards repository](https://github.com/exospherehost/ai-reliability-standards). The repository structure includes:

```text
docs/ # Documentation source files
├── index.md # This page (documentation homepage)
├── getting-started.md # Adoption roadmap for organizations
├── pillars/
├── principles.md # AIRE Principles (5 guiding tenets)
├── pillars/ # Core reliability pillars
│ ├── resilient-architecture.md # Pillar 1: Fault tolerance, scaling, recovery
│ ├── cognitive-reliability.md # Pillar 2: Accuracy, consistency, drift detection
│ ├── quality-lifecycle.md # Pillar 3: Testing, deployment, feedback loops
│ ├── security.md # Pillar 4: JIT access, guardrails, audit logs
│ └── operational-excellence.md # Pillar 5: SLAs, team structure, progressive autonomy
│ └── operational-excellence.md # Pillar 5: Performance targets, team structure, progressive autonomy
└── appendix/
├── principles.md # AIRE Principles (5 guiding tenets)
├── metrics-framework.md # Three-tier metrics framework
└── glossary.md # Key terms and definitions
```
Expand All @@ -219,7 +220,7 @@ We welcome Pull Requests (PRs) from engineers who have solved specific reliabili

<a href="https://exosphere.host"><img src="./assets/sponsors/exosphere.png" alt="ExosphereHost Inc." width="75"></a>

Contact nivedit@exosphere.host to sponsor this work.
Contact nikita@exosphere.host to sponsor this work.

## License

Expand Down