Skip to content

Latest commit

 

History

History
583 lines (444 loc) · 16 KB

File metadata and controls

583 lines (444 loc) · 16 KB

Language: English | Español

Statistical Validation Methodology

Using confidence intervals to measure LLM normalization quality and detect bugs

Overview

LLMs are probabilistic systems. Even with temperature=0, outputs can vary due to:

  • Model updates by the provider
  • Subtle prompt changes
  • Edge cases in input data
  • Systematic bugs in post-processing

Traditional software testing (unit tests, integration tests) is insufficient for LLM pipelines. You need statistical validation to:

  1. Measure normalization quality objectively
  2. Detect quality drift over time
  3. Catch systematic bugs through anomaly detection
  4. Provide confidence intervals for stakeholder reporting

Key Metrics

1. Coverage (Success Rate)

Definition: Percentage of fields successfully normalized without errors.

Formula:

Coverage = (Successful Normalizations / Total Fields) × 100

Example (Production):

Coverage = 4,246 / 4,280 = 99.2%

Interpretation:

  • >95%: Excellent - LLM handles almost all cases
  • 90-95%: Good - Minor edge cases failing
  • <90%: Poor - Investigate prompt/post-processing issues

Why it matters: Indicates reliability. Low coverage means manual review needed.

2. Improvement Rate

Definition: Percentage of fields that required changes (weren't already normalized).

Formula:

Improvement Rate = (Fields Changed / Total Fields) × 100

Example (Production):

Improvement Rate = 3,013 / 4,280 = 70.4%

Interpretation:

  • 60-80%: Expected for user-submitted data (lots of variation)
  • <30%: Data already clean OR normalization not aggressive enough
  • >90%: Suspicious - possible bug causing over-normalization

Why it matters: Validates that normalization is actually doing useful work.

3. Confidence Interval (95%)

Definition: Statistical range within which the true improvement rate lies.

Formula (Binomial proportion):

CI = p ± z * √(p(1-p)/n)

Where:
  p = sample proportion (improvement rate)
  z = 1.96 (for 95% confidence)
  n = sample size (total fields)

Example (Production):

p = 0.704 (70.4%)
n = 4,280
SE = √(0.704 × 0.296 / 4280) = 0.00697
Margin = 1.96 × 0.00697 = 0.0137 (1.37%)

CI = 70.4% ± 1.4%
   = [69.0%, 71.8%]

Interpretation: We can say with 95% confidence that the true improvement rate is between 69.0% and 71.8%.

Why it matters: Provides statistical rigor for reporting to stakeholders. "70% improvement" sounds vague; "70% ± 1.4% (95% CI)" sounds authoritative.

Production Results (Real Data)

Overall Statistics

Metric Value 95% CI
Total Leads 652 -
Total Fields 4,280 (652 × 7 fields - missing data) -
Successfully Normalized 4,246 -
Coverage 99.2% [98.8%, 99.5%]
Fields Requiring Changes 3,013 -
Improvement Rate 70.4% [69.0%, 71.8%]
Errors 34 (0.8%) -

Per-Field Analysis

Field Total Normalized Improvement Rate 95% CI Interpretation
nombres 652 647 3.8% (25/652) [2.4%, 5.2%] ✅ Mostly already formatted
apellidos 652 648 5.2% (34/652) [3.6%, 6.8%] ✅ Mostly already formatted
ciudad 652 650 55.8% (364/652) [52.0%, 59.6%] ✅ Expected (many variants)
direccion 652 643 65.7% (428/652) [62.0%, 69.4%] ⚠️ Anomaly detected
nivelEducativo 648 645 78.4% (507/648) [75.2%, 81.6%] ✅ Expected (free text)
ocupacionActual 612 605 82.5% (505/612) [79.5%, 85.5%] ✅ Expected (abbreviations)
empresa 612 608 72.9% (444/612) [69.4%, 76.4%] ✅ Expected (capitalization)

Anomaly Detection: The Double-Dot Bug

Red flag detected: direccion improvement rate (65.7%) was unusually high for a structured field.

Expected behavior: Addresses should be ~15-20% improvement (fixing "CRA 15 NO 100 25" → "Cra. 15 # 100 - 25").

Actual behavior: 65.7% improvement suggests post-processing is changing already-formatted addresses.

Bug Detection Case Study

Discovery Process

Step 1: Calculate per-field improvement rates

const stats = {
  nombres: { total: 652, changed: 25, rate: 3.8% },
  apellidos: { total: 652, changed: 34, rate: 5.2% },
  ciudad: { total: 652, changed: 364, rate: 55.8% },
  direccion: { total: 652, changed: 428, rate: 65.7% },  // ← Outlier!
  nivelEducativo: { total: 648, changed: 507, rate: 78.4% },
  ocupacionActual: { total: 612, changed: 505, rate: 82.5% },
  empresa: { total: 612, changed: 444, rate: 72.9% }
};

Step 2: Flag outliers

Expected improvement rates:
- Names: 0-10% (mostly already formatted)
- Cities: 50-60% (many variants: "bogota", "BOGOTÁ", etc.)
- Addresses: 15-25% (structured format variations)  ← Expected
- Education/Job: 70-85% (free text with abbreviations)

Actual:
- Addresses: 65.7%  ← 3x higher than expected!

Step 3: Manual spot-checking

Randomly sample 20 addresses from normalizedData:

// Before (original):
"Cra. 15 # 100 - 25"

// After (normalized):
"Cra. . 15 # 100 - 25"  // ← Double dot!

Found the bug in 18 out of 20 samples (90% affected).

Step 4: Root cause analysis

// In prompts.js - Original buggy code:
function normalizeAddress(address) {
  return address
    .replace(/\b(cra)\.?\s*/gi, 'Cra. ')  // ← BUG: applies to "Cra." → "Cra. ."
    .replace(/\bno\b\.?\s*/gi, '# ');
}

// Flow:
Input:  "Cra. 15 # 100 - 25"  (already formatted from LLM)
Regex:  /\b(cra)\.?\s*/gi matches "Cra. "
Replace: "Cra. " with "Cra. "  "Cra. . 15 # 100 - 25"

Step 5: Fix and validate

// Fixed code:
function normalizeAddress(address) {
  return address
    // Match "cra" OR "cra." but replace with "Cra. " only once
    .replace(/\b(carrera|cra|cr|kra)\.?\s*/gi, 'Cra. ')  // ← Fixed
    .replace(/\bno\b\.?\s*/gi, '# ')
    // Clean up any accidental double dots
    .replace(/\.\s*\./g, '.');  // Safety net
}

Step 6: Re-normalize and measure

# Force re-normalization of all 652 leads
aws lambda invoke \
  --function-name normalize-leads \
  --payload '{"forceAll": true}' \
  response.json

# New statistics:
# direccion improvement rate: 18.2% (119/652)
# ✅ Within expected 15-25% range

Statistical Confirmation

Before fix:

Improvement Rate = 65.7% ± 3.7%
Z-score = (65.7 - 20) / 3.7 = 12.3  (highly significant!)

After fix:

Improvement Rate = 18.2% ± 3.0%
Z-score = (18.2 - 20) / 3.0 = -0.6  (within expected range)

The fix brought the improvement rate from 12 standard deviations above expected to within 1 standard deviation - statistically confirming the bug was resolved.

How to Implement Statistical Validation

1. Track Normalization Metrics

Add to Lambda handler:

export const handler = async (event) => {
  const metrics = {
    totalFields: 0,
    normalized: 0,
    unchanged: 0,
    errors: 0,
    byField: {}
  };

  for (const lead of leads) {
    const fieldsData = extractFields(lead);
    const originalHash = hashFields(fieldsData);

    const normalized = await normalizeLead(lead, config.fieldsToNormalize);

    const normalizedHash = hashFields(normalized);

    // Track per-field changes
    for (const field of Object.keys(fieldsData)) {
      if (!metrics.byField[field]) {
        metrics.byField[field] = { total: 0, changed: 0 };
      }
      metrics.byField[field].total++;

      if (fieldsData[field] !== normalized[field]) {
        metrics.byField[field].changed++;
      }
    }

    metrics.totalFields += Object.keys(fieldsData).length;
    if (originalHash !== normalizedHash) {
      metrics.normalized++;
    } else {
      metrics.unchanged++;
    }
  }

  // Calculate improvement rates with confidence intervals
  const report = generateStatisticalReport(metrics);
  console.log(JSON.stringify(report, null, 2));

  return successResponse(report);
};

2. Calculate Confidence Intervals

function calculateConfidenceInterval(successes, total, confidenceLevel = 0.95) {
  const p = successes / total;
  const z = confidenceLevel === 0.95 ? 1.96 : 2.576; // 99% CI
  const se = Math.sqrt(p * (1 - p) / total);
  const margin = z * se;

  return {
    point: p,
    lower: Math.max(0, p - margin),
    upper: Math.min(1, p + margin),
    margin: margin,
    confidenceLevel: confidenceLevel
  };
}

function generateStatisticalReport(metrics) {
  const report = {
    summary: {
      totalFields: metrics.totalFields,
      normalized: metrics.normalized,
      coverage: calculateConfidenceInterval(
        metrics.normalized,
        metrics.totalFields
      ),
      improvementRate: calculateConfidenceInterval(
        metrics.normalized - metrics.unchanged,
        metrics.totalFields
      )
    },
    byField: {}
  };

  for (const [field, stats] of Object.entries(metrics.byField)) {
    report.byField[field] = {
      total: stats.total,
      changed: stats.changed,
      improvementRate: calculateConfidenceInterval(
        stats.changed,
        stats.total
      )
    };
  }

  return report;
}

3. Detect Anomalies

function detectAnomalies(report) {
  const anomalies = [];

  // Expected improvement rates per field type
  const expected = {
    nombres: { min: 0, max: 0.10 },
    apellidos: { min: 0, max: 0.10 },
    ciudad: { min: 0.40, max: 0.70 },
    direccion: { min: 0.10, max: 0.30 },  // ← Key constraint
    nivelEducativo: { min: 0.60, max: 0.90 },
    ocupacionActual: { min: 0.70, max: 0.90 },
    empresa: { min: 0.60, max: 0.85 }
  };

  for (const [field, stats] of Object.entries(report.byField)) {
    const rate = stats.improvementRate.point;
    const expectedRange = expected[field];

    if (expectedRange) {
      if (rate < expectedRange.min || rate > expectedRange.max) {
        anomalies.push({
          field,
          actualRate: rate,
          expectedRange,
          severity: Math.abs(rate - (expectedRange.min + expectedRange.max) / 2) > 0.2 ? 'HIGH' : 'MEDIUM'
        });
      }
    }
  }

  return anomalies;
}

4. Alert on Anomalies

if (anomalies.length > 0) {
  console.error('Anomalies detected:', anomalies);

  // Send SNS notification
  await sns.publish({
    TopicArn: process.env.ALERT_TOPIC_ARN,
    Subject: 'Normalization Quality Anomaly Detected',
    Message: JSON.stringify({
      message: 'Statistical analysis detected unexpected improvement rates',
      anomalies,
      action: 'Manual review required'
    }, null, 2)
  }).promise();
}

Interpreting Confidence Intervals

Example 1: Wide Interval (Low Confidence)

Sample size: 20 fields
Improvement rate: 50% (10/20)
95% CI: [27.1%, 72.9%]

Interpretation: Not enough data to be confident. Could be anywhere from 27% to 73%.

Action: Collect more data before making conclusions.

Example 2: Narrow Interval (High Confidence)

Sample size: 650 fields
Improvement rate: 70.4% (458/650)
95% CI: [67.0%, 73.8%]

Interpretation: High confidence that true rate is around 70%.

Action: Reliable metric for reporting to stakeholders.

Example 3: Overlapping Intervals (No Significant Difference)

Before optimization: 70.4% ± 1.8% = [68.6%, 72.2%]
After optimization:  72.1% ± 1.9% = [70.2%, 74.0%]

Interpretation: Intervals overlap → difference is NOT statistically significant.

Action: Don't claim improvement unless more data shows separation.

Example 4: Non-Overlapping Intervals (Significant Difference)

Before bug fix: 65.7% ± 3.7% = [62.0%, 69.4%]
After bug fix:  18.2% ± 3.0% = [15.2%, 21.2%]

Interpretation: No overlap → difference is statistically significant.

Action: Bug fix confirmed effective with 95% confidence.

When to Re-Normalize

Triggers for Re-Normalization

  1. Prompt changes: New examples, rule adjustments
  2. Model updates: Claude Haiku version update
  3. Bug fixes: Post-processing pipeline changes (like double-dot fix)
  4. Quality drift: Coverage drops below 95%
  5. Scheduled: Every 7 days (TTL) for fresh data

Re-Normalization Strategy

// Force re-normalization via API
POST /admin/normalize-leads
{
  "forceAll": true,
  "reason": "Bug fix: double-dot in addresses"
}

// Response:
{
  "message": "Re-normalization complete",
  "leadsProcessed": 652,
  "fieldsNormalized": 4280,
  "duration": 186000,
  "statistics": {
    "coverage": { point: 0.992, lower: 0.988, upper: 0.995 },
    "improvementRate": { point: 0.182, lower: 0.152, upper: 0.212 }
  }
}

Best Practices

1. Always Report Confidence Intervals

Bad: "Normalization improved 70% of fields" ✅ Good: "Normalization improved 70.4% ± 1.4% of fields (95% CI)"

2. Track Metrics Over Time

// Store in DynamoDB for trend analysis
{
  runId: "2026-01-24T07:00:00Z",
  coverage: 0.992,
  coverageCI: [0.988, 0.995],
  improvementRate: 0.704,
  improvementRateCI: [0.690, 0.718],
  byField: { ... }
}

3. Set Expected Ranges

Define acceptable ranges per field based on data characteristics:

const EXPECTED_RANGES = {
  nombres: { min: 0.00, max: 0.10 },  // Names already formatted
  ciudad: { min: 0.40, max: 0.70 },   // Many city variants
  direccion: { min: 0.10, max: 0.30 } // Structured format
};

4. Use Z-Scores for Outlier Detection

function calculateZScore(actual, expected, sampleSize) {
  const expectedMean = (expected.min + expected.max) / 2;
  const expectedSD = (expected.max - expected.min) / 4; // Rough estimate
  const SE = Math.sqrt(actual * (1 - actual) / sampleSize);

  return (actual - expectedMean) / Math.max(expectedSD, SE);
}

// Example:
// direccion: actual = 0.657, expected = [0.10, 0.30]
// z = (0.657 - 0.20) / 0.05 = 9.14 (highly unusual!)

5. Sample Size Guidelines

Sample Size Confidence Interval Width Use Case
20-50 ±10-15% Pilot testing
100-500 ±5-10% Development validation
500-1000 ±2-5% Production monitoring
>1000 <±2% High-precision reporting

Reporting to Stakeholders

Executive Summary Template

## Data Normalization Quality Report
**Period**: January 1-24, 2026

### Overall Performance
- **Records Processed**: 652 leads
- **Success Rate**: 99.2% (95% CI: 98.8%-99.5%)
- **Improvement Rate**: 70.4% (95% CI: 69.0%-71.8%)

### Interpretation
With 95% confidence, we can state that:
- At least 98.8% of data is successfully normalized
- Between 69.0% and 71.8% of fields required normalization

### Quality Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Coverage | >95% | 99.2% ± 0.4% | ✅ Exceeds |
| Improvement Rate | 60-80% | 70.4% ± 1.4% | ✅ On Target |
| Error Rate | <5% | 0.8% ± 0.3% | ✅ Exceeds |

### Anomalies Detected
1. **Address normalization** (Jan 23): Improvement rate 3x expected
   - **Cause**: Double-dot bug in post-processing
   - **Resolution**: Regex pattern updated, all records re-normalized
   - **Validation**: New rate 18.2% ± 3.0% (within expected 15-25%)

Conclusion

Statistical validation transforms LLM normalization from a "black box" to a measurable, trustworthy process.

Key takeaways:

  1. Always use confidence intervals for reporting
  2. Track per-field metrics to detect anomalies
  3. Set expected ranges based on data characteristics
  4. Re-normalize when needed (bugs, model updates, prompt changes)
  5. Monitor trends over time to catch quality drift early

The double-dot bug discovery proves the value of this approach: without statistical analysis, we might never have noticed 65.7% of addresses were being corrupted.

Next Steps


Last Updated: January 24, 2026