Skip to content

Latest commit

 

History

History
537 lines (415 loc) · 20.8 KB

File metadata and controls

537 lines (415 loc) · 20.8 KB

Lingua project guide for Claude

This guide helps AI assistants understand and work with the Lingua codebase effectively.

Project overview

Lingua is a universal message format that compiles to provider-specific formats with zero runtime overhead. It's designed to allow seamless interoperability between different LLM providers without runtime penalties.

Key principles

  • Universal compatibility: Supports 100% of provider-specific quirks and capabilities
  • Zero runtime overhead: Pure compile-time translation to native provider formats
  • Type safety: Full TypeScript and Rust type generation with bidirectional validation
  • No network calls: This is a message format library, not an API client
  • Explicit error handling: All errors must be properly handled, never silently swallowed
  • No hidden marker fields: Do not encode provider semantics via internal marker keys (for example in provider_options) to fake lossless roundtrips.
  • Ask when non-lossy mapping is unclear: If the universal type cannot represent a provider feature non-lossily, stop and ask for clarification on the intended canonical representation before implementing a workaround.
  • No unapproved fallback logic: Do not add ad-hoc fallback parsing/translation paths (for example fallback_* helpers) without checking with the programmer first.
  • Typed boundaries only: At provider boundaries, parse into well-defined typed structs/enums. Do not add lenient raw-JSON parsing that guesses defaults for required fields (for example defaulting missing role to user, lowercasing unknown roles, or inventing empty content).
  • Do not handwrite provider-format structs: Do not manually define Rust structs/enums that represent provider wire formats when generated or canonical provider types already exist. Fix generation or add typed adapters around canonical types instead.
  • Do not inspect serde_json::Value directly for provider semantics: Do not branch on provider-format fields via ad-hoc Value map access. Deserialize into typed provider or typed compatibility structs first, then convert.
  • Lenient import paths are typed boundaries too: Files like processing/import.rs are not exempt. For any role/content/tool_call_id compatibility handling, first deserialize into typed compatibility structs (with serde aliases as needed), then branch on typed enums/fields.
  • Pre-edit parser guardrail: Before finalizing parser/converter changes in typed-boundary code, scan your diff for new as_object(), .get(\"...\"), Value::Object, or raw Map<String, Value> field-plucking used for semantics. If present, rewrite to typed deserialization or stop and ask.
  • Fix via types or explicit errors: If fuzzing finds unsupported/ambiguous shapes, either model them explicitly in types/converters or return a clear error. Do not silently coerce invalid input into a "best effort" shape.
  • Typed-boundary CI gate: CI enforces make typed-boundary-check-branch BASE=origin/<base-branch> on pull requests. Running make typed-boundary-check locally is recommended for faster feedback, but not required as a pre-commit hook.
  • Typed extras views over raw map access: If provider extras must be read, deserialize extras into a typed view struct first; do not pluck fields ad-hoc with map.get(...).

Documentation style guide

Always use sentence case for all headings, not title case:

  • ## Pipeline overview
  • ## Pipeline Overview

Be concise and direct:

  • Focus on what, not why (unless specifically asked)
  • Avoid unnecessary explanations or summaries
  • Use bullet points and structured formats

Project structure

src/
├── universal/             # Core Lingua message types
├── providers/             # Provider-specific API type definitions
├── translators/           # Bidirectional format conversion logic
├── capabilities/          # Provider capability detection
└── lib.rs                 # Main entry point and re-exports

Working with providers

Each provider should have:

  • Separate request/response types: Don't conflate them into single structs
  • Complete type coverage: All fields from provider SDKs, even optional ones
  • Validation tests: TypeScript compatibility tests in tests/typescript/{provider}/

Type generation workflow

  1. Check for SDK updates in provider test directories
  2. Extract TypeScript types manually from provider SDKs
  3. Convert to Rust following consistent patterns (see pipelines/ docs)
  4. Validate compatibility through multi-layer testing
  5. Update translators to use new types

Common patterns

Rust type derivations:

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")] // when needed

TypeScript exports (for ts-rs):

#[derive(TS)]
#[ts(export, export_to = "bindings/typescript/")]

Optional fields: Always use Option<T> for optional provider fields

Union types: Convert TypeScript unions to Rust enums or separate structs

Testing approach

Type compatibility: Verify Rust-generated TypeScript matches provider SDK types Round-trip testing: Ensure lossless serialization/deserialization Real API integration: Test with actual provider APIs when possible

Required workflow for new import fixture cases

When adding or fixing a case under payloads/import-cases/, follow this order.

  1. Anonymize first (if needed).
    • If the span includes non-anonymized user/company/PII content, ask the user whether to anonymize now and confirm expected anonymization level.
    • Do not proceed with fixture assertions on unanonymized sensitive data unless the user explicitly approves.
  2. Generate baseline assertions from imported messages.
    • Run exactly:
    GENERATE_MISSING=1 cargo test -p lingua --test import_fixtures -- --nocapture
  3. Confirm the failing behavior intentionally.
    • Re-run the test without GENERATE_MISSING and verify the target case fails for the expected reason.
    • If expected behavior is unclear, stop and ask for clarification before implementing a fallback.
  4. Fix importer/converter logic.
    • Keep typed boundaries (typed structs/enums at provider boundaries and compatibility boundaries).
    • Add or update assertions only after the code fix, not as a substitute for the fix.
  5. Name the case by behavior, not by span ID.
    • Rename fixture pairs to descriptive kebab-case names that describe behavior (for example chat-completions-tool-role-string-content.json).
    • Avoid UUID or span_<id> filenames once behavior is understood.

Required workflow for provider behavior changes

When fixing provider transform behavior, follow this order. Do not skip steps.

  1. Add or update the payload case first.
    • For provider parameter behavior, start in payloads/cases/params.ts.
    • Use a case name that maps directly to the behavior being fixed.
  2. Capture and triage failures.
    • Run make capture FILTER=<case_name>.
    • If capture emits failed requests/transforms, treat this as unresolved logic in adapter/converter code.
    • Use transform path names to triage ownership:
      • payloads/transforms/chat-completions_to_anthropic/<case>.json → Anthropic adapter path
      • payloads/transforms/chat-completions_to_google/<case>.json → Google adapter path
      • same pattern for other providers.
  3. Write the fix plan before implementation.
    • Create plan.md in lingua/ before making code changes.
    • The plan must include:
      • root cause,
      • target files,
      • expected behavior after fix,
      • tests to add/update,
      • expected-diff impact (if any),
      • command sequence to validate.
  4. Fix adapter/converter logic first.
    • Do not use artifact regeneration as a substitute for code fixes.
    • Prefer provider adapter.rs for cross-field policy/orchestration fixes.
    • Keep typed boundaries: parse into typed structs/enums; avoid new ad-hoc raw Value access.
  5. Run targeted tests, then re-capture.
    • Run focused Rust tests for touched adapters first.
    • Re-run capture for the affected case/pair.
    • Run payload tests and sync checks.
  6. Only then update expected diffs (if intentional behavior loss remains).
    • Use narrow perTestCase entries in:
      • crates/coverage-report/src/requests_expected_differences.json
      • crates/coverage-report/src/streaming_expected_differences.json
      • crates/coverage-report/src/responses_expected_differences.json
    • Do not add broad global exceptions for case-specific behavior.

Validation command sequence

Use this exact flow after implementing a fix:

# 1) Capture and inspect this behavior
make capture FILTER=<case_name>

# 2) Run focused adapter tests
cargo test -p lingua <targeted_test_name_or_module>

# 3) Re-capture transforms for fixed behavior
make capture FILTER=<case_name>

# 4) Run payload transform checks
make test-payloads

# 5) If snapshots/transforms are stale after logic fix, regenerate failed artifacts
make regenerate-failed-transforms

# 6) Cross-provider guard
cargo test -p coverage-report --test cross_provider_test cross_provider_transformations_have_no_unexpected_failures

# 7) Typed-boundary checks
make typed-boundary-check
make typed-boundary-check-branch BASE=main

Anti-patterns

  • Do not run make regenerate-failed-transforms before fixing adapter/converter logic.
  • Do not patch transform/snapshot files manually to hide failing transforms.
  • Do not add new direct Value.get(...) assertions/logic in typed-boundary-protected paths.
  • Do not add new semantic branching in parse_lenient_* or import compatibility code using raw JSON map access; use typed compatibility structs/enums.

Development priorities

  1. Correctness over convenience: Match provider APIs exactly
  2. Type safety over flexibility: Strict typing prevents runtime errors
  3. Manual precision over automation: Control type design decisions
  4. Validation over assumptions: Test everything thoroughly

File naming conventions

  • Provider modules: src/providers/{provider}/ (e.g., openai/, anthropic/)
  • Request types: {provider}_request.rs or request.rs in provider directory
  • Response types: {provider}_response.rs or response.rs in provider directory
  • Tests: tests/typescript/{provider}/ with provider-specific validation

⚠️ CRITICAL: Never edit generated files directly

🚨 DO NOT EDIT generated.rs FILES DIRECTLY 🚨

⚠️ ABSOLUTELY FORBIDDEN - CHANGES WILL BE LOST ⚠️

Files named generated.rs are automatically generated and will be overwritten:

  • src/providers/google/generated.rs - Generated from protobuf files
  • src/providers/openai/generated.rs - Generated from OpenAPI specs
  • src/providers/anthropic/generated.rs - Generated from OpenAPI specs

ANY MANUAL CHANGES TO THESE FILES WILL BE PERMANENTLY LOST ON NEXT REGENERATION

If you need to fix issues in generated files:

  1. DO: Edit the generation logic in scripts/generate_types/main.rs
  2. DO: Add fixes to the fix_google_type_references() or similar functions
  3. DO: Regenerate using cargo run --bin generate-types <provider>
  4. DON'T: Edit the generated files directly - your changes will be lost!

⚠️ NEVER MANUALLY EDIT:

  • Any struct, enum, or type definitions in generated.rs files
  • Field types, names, or annotations in generated types
  • Serde attributes or derives in generated code

Claude Code AI Assistant: You must NEVER directly edit generated.rs files. Always use the generation pipeline and post-processing functions.

Example of proper fix approach:

// In scripts/generate_types/main.rs, in fix_google_type_references():
fn fix_google_type_references(content: String) -> String {
    let mut fixed = content;
    
    // Fix doctest JSON examples that fail to compile
    fixed = fixed.replace(
        "    /// ```\n    /// {\n    ///    \"type\": \"object\",",
        "    /// ```json\n    /// {\n    ///    \"type\": \"object\","
    );
    
    fixed
}

This ensures fixes are permanent and survive regeneration cycles.

Common gotchas

TypeScript → Rust conversions:

  • string | number unions need careful handling (usually separate enums)
  • Optional properties (field?:) become Option<field>
  • Nested objects may need serde_json::Value for unknown structures
  • Array types become Vec<T>

Serde configuration:

  • Use rename_all = "snake_case" sparingly (only when provider uses snake_case)
  • Most providers use camelCase, so default serde behavior is correct
  • Add #[serde(skip_serializing_if = "Option::is_none")] for optional fields

Error handling guidelines

🚨 CRITICAL: Never silently swallow errors with unwrap_or_default() or unwrap_or() 🚨

Silent error handling makes debugging extremely difficult and can hide important issues. Always use explicit error propagation or logging.

❌ NEVER DO THIS:

// Dangerous - silently swallows serialization errors
serde_json::to_value(data).unwrap_or_default()
serde_json::to_string(data).unwrap_or(String::new())

✅ ALWAYS DO THIS INSTEAD:

For functions that return Result (most conversions):

// Propagate errors with proper context
serde_json::to_value(data).map_err(|e| ConvertError::JsonSerializationFailed {
    field: "field_name".to_string(),
    error: e.to_string(),
})?

// Or for String error types:
serde_json::to_string(data)
    .map_err(|e| format!("Failed to serialize field_name to JSON: {}", e))?

For filter_map closures that return Option:

// For invalid data that's already known to be invalid, use appropriate fallback
match tool_arguments {
    ToolCallArguments::Valid(map) => serde_json::Value::Object(map.clone()),
    ToolCallArguments::Invalid(s) => serde_json::Value::String(s.clone()), // Don't try to parse invalid data
}

Error types to use:

  • OpenAI conversions: Use ConvertError enum with specific variants
  • Anthropic conversions: Use descriptive String error messages
  • Always include context: field names, operation type, original data when safe

When adding new ConvertError variants:

pub enum ConvertError {
    // Existing variants...
    JsonSerializationFailed { field: String, error: String },
    // Add new specific variants as needed
}

// Update the Display impl:
ConvertError::JsonSerializationFailed { field, error } => {
    write!(f, "JSON serialization failed for field '{}': {}", field, error)
}

Testing error conditions:

  • Always test that error conditions produce meaningful error messages
  • Verify that errors propagate correctly through the call stack
  • Never ignore warnings from error handling during development

Pipeline maintenance

The pipelines/ directory contains automated tooling for:

  • Downloading latest OpenAPI specifications from providers
  • Generating Rust types automatically using typify
  • Building and validating generated code
  • Minimal type generation focused on chat completion APIs

Run the pipeline to update provider types:

./pipelines/generate-provider-types.sh openai

This process is fully automated and generates only essential types to minimize code size.

Development setup

Git hooks installation: After cloning the repository, install pre-commit hooks for consistent formatting:

./scripts/install-hooks.sh

This installs hooks that automatically run:

  • cargo fmt - ensures consistent formatting

Code quality checks: Clippy linting is handled by GitHub Actions CI and will run on pull requests.

  • cargo clippy - catches common issues and enforces best practices

Hooks run automatically before each commit. To bypass temporarily: git commit --no-verify

Adding new providers

Follow this step-by-step guide to add support for a new LLM provider:

1. Create provider directory structure

mkdir -p src/providers/{provider}
touch src/providers/{provider}/mod.rs
touch src/providers/{provider}/request.rs  
touch src/providers/{provider}/response.rs

2. Add feature flag to Cargo.toml

[features]
default = ["openai", "anthropic", "google", "bedrock", "{provider}"]
{provider} = ["dep:{provider-sdk}"]  # Only if external SDK needed

[dependencies]
{provider-sdk} = { version = "1.0", optional = true }  # If needed

3. Create provider types

src/providers/{provider}/request.rs:

use serde::{Deserialize, Serialize};
use ts_rs::TS;

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, TS)]
#[ts(export, export_to = "bindings/typescript/")]
pub struct {Provider}Request {
    pub messages: Vec<{Provider}Message>,
    pub model: String,
    // ... other required fields
}

// Define all necessary types following provider API exactly

src/providers/{provider}/response.rs:

use serde::{Deserialize, Serialize};
use ts_rs::TS;

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, TS)]
#[ts(export, export_to = "bindings/typescript/")]
pub struct {Provider}Response {
    pub choices: Vec<{Provider}Choice>,
    pub usage: {Provider}Usage,
    // ... other response fields
}

src/providers/{provider}/mod.rs:

/*!
{Provider} API provider types.
*/

pub mod request;
pub mod response;

pub use request::{Provider}Request;
pub use response::{Provider}Response;

4. Add conditional compilation

src/providers/mod.rs:

#[cfg(feature = "{provider}")]
pub mod {provider};

5. Create translator

src/translators/{provider}.rs:

use crate::providers::{provider}::{Provider}Request, {Provider}Response};
use crate::translators::{TranslationResult, Translator};
use crate::universal::{SimpleMessage, SimpleRole};

pub struct {Provider}Translator;

impl Translator<{Provider}Request, {Provider}Response> for {Provider}Translator {
    fn to_provider_request(messages: Vec<SimpleMessage>) -> TranslationResult<{Provider}Request> {
        // Convert SimpleMessage to provider format
        todo!()
    }

    fn from_provider_response(response: {Provider}Response) -> TranslationResult<Vec<SimpleMessage>> {
        // Convert provider response back to SimpleMessage  
        todo!()
    }
}

// Convenience functions
pub fn to_{provider}_format(messages: Vec<SimpleMessage>) -> TranslationResult<{Provider}Request> {
    {Provider}Translator::to_provider_request(messages)
}

pub fn from_{provider}_response(response: {Provider}Response) -> TranslationResult<Vec<SimpleMessage>> {
    {Provider}Translator::from_provider_response(response)
}

6. Update translator module

src/translators/mod.rs:

#[cfg(feature = "{provider}")]
pub mod {provider};

// Re-export convenience functions
#[cfg(feature = "{provider}")]
pub use {provider}::{from_{provider}_response, to_{provider}_format};

7. Type design guidelines

Message structure:

  • Use Vec<ContentBlock> pattern for multi-modal content
  • Support text, images, tool calls as separate enum variants
  • Follow provider's exact field names and casing

Serde configuration:

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, TS)]
#[ts(export, export_to = "bindings/typescript/")]
#[serde(rename_all = "camelCase")]  // Match provider API casing
pub struct {Provider}Message {
    #[serde(skip_serializing_if = "Option::is_none")]
    pub optional_field: Option<String>,
}

Handle serde_json::Value for TypeScript:

// For unknown/flexible JSON structures
#[ts(type = "any")]
pub field: serde_json::Value,

// For fields that shouldn't appear in TypeScript
#[ts(skip)]
pub internal_field: InternalType,

8. Testing and validation

  1. Compile test: cargo check --features="{provider}"
  2. Isolation test: cargo check --no-default-features --features="{provider}"
  3. Integration test: Create simple translation examples
  4. TypeScript generation: Verify TS types are generated correctly

9. Documentation

Update README.md:

  • Add provider to feature flags section
  • Update architecture diagram
  • Add usage examples

10. Common patterns by provider type

OpenAPI-based providers (OpenAI, Anthropic):

  • Can use automated generation from specs
  • Usually have consistent REST API patterns
  • Focus on chat completion endpoints

SDK-based providers (Bedrock, Google):

  • May need to work with existing SDKs
  • Handle SDK type conversion carefully
  • Consider optional dependencies for large SDKs

Custom API providers:

  • Manual type extraction from documentation
  • Focus on core chat/completion functionality
  • Implement streaming support if available

11. Best practices

  • Start minimal: Implement basic text chat first, add features incrementally
  • Follow existing patterns: Study OpenAI and Bedrock implementations
  • Test thoroughly: Verify type compatibility and serialization
  • Document differences: Note any provider-specific quirks or limitations
  • Consider streaming: Many providers support streaming responses

This process ensures consistent provider integration while maintaining type safety and zero-runtime overhead.