Using confidence intervals to measure LLM normalization quality and detect bugs
LLMs are probabilistic systems. Even with temperature=0, outputs can vary due to:
- Model updates by the provider
- Subtle prompt changes
- Edge cases in input data
- Systematic bugs in post-processing
Traditional software testing (unit tests, integration tests) is insufficient for LLM pipelines. You need statistical validation to:
- Measure normalization quality objectively
- Detect quality drift over time
- Catch systematic bugs through anomaly detection
- Provide confidence intervals for stakeholder reporting
Definition: Percentage of fields successfully normalized without errors.
Formula:
Coverage = (Successful Normalizations / Total Fields) × 100
Example (Production):
Coverage = 4,246 / 4,280 = 99.2%
Interpretation:
- >95%: Excellent - LLM handles almost all cases
- 90-95%: Good - Minor edge cases failing
- <90%: Poor - Investigate prompt/post-processing issues
Why it matters: Indicates reliability. Low coverage means manual review needed.
Definition: Percentage of fields that required changes (weren't already normalized).
Formula:
Improvement Rate = (Fields Changed / Total Fields) × 100
Example (Production):
Improvement Rate = 3,013 / 4,280 = 70.4%
Interpretation:
- 60-80%: Expected for user-submitted data (lots of variation)
- <30%: Data already clean OR normalization not aggressive enough
- >90%: Suspicious - possible bug causing over-normalization
Why it matters: Validates that normalization is actually doing useful work.
Definition: Statistical range within which the true improvement rate lies.
Formula (Binomial proportion):
CI = p ± z * √(p(1-p)/n)
Where:
p = sample proportion (improvement rate)
z = 1.96 (for 95% confidence)
n = sample size (total fields)
Example (Production):
p = 0.704 (70.4%)
n = 4,280
SE = √(0.704 × 0.296 / 4280) = 0.00697
Margin = 1.96 × 0.00697 = 0.0137 (1.37%)
CI = 70.4% ± 1.4%
= [69.0%, 71.8%]
Interpretation: We can say with 95% confidence that the true improvement rate is between 69.0% and 71.8%.
Why it matters: Provides statistical rigor for reporting to stakeholders. "70% improvement" sounds vague; "70% ± 1.4% (95% CI)" sounds authoritative.
| Metric | Value | 95% CI |
|---|---|---|
| Total Leads | 652 | - |
| Total Fields | 4,280 (652 × 7 fields - missing data) | - |
| Successfully Normalized | 4,246 | - |
| Coverage | 99.2% | [98.8%, 99.5%] |
| Fields Requiring Changes | 3,013 | - |
| Improvement Rate | 70.4% | [69.0%, 71.8%] |
| Errors | 34 (0.8%) | - |
| Field | Total | Normalized | Improvement Rate | 95% CI | Interpretation |
|---|---|---|---|---|---|
nombres |
652 | 647 | 3.8% (25/652) | [2.4%, 5.2%] | ✅ Mostly already formatted |
apellidos |
652 | 648 | 5.2% (34/652) | [3.6%, 6.8%] | ✅ Mostly already formatted |
ciudad |
652 | 650 | 55.8% (364/652) | [52.0%, 59.6%] | ✅ Expected (many variants) |
direccion |
652 | 643 | 65.7% (428/652) | [62.0%, 69.4%] | |
nivelEducativo |
648 | 645 | 78.4% (507/648) | [75.2%, 81.6%] | ✅ Expected (free text) |
ocupacionActual |
612 | 605 | 82.5% (505/612) | [79.5%, 85.5%] | ✅ Expected (abbreviations) |
empresa |
612 | 608 | 72.9% (444/612) | [69.4%, 76.4%] | ✅ Expected (capitalization) |
Red flag detected: direccion improvement rate (65.7%) was unusually high for a structured field.
Expected behavior: Addresses should be ~15-20% improvement (fixing "CRA 15 NO 100 25" → "Cra. 15 # 100 - 25").
Actual behavior: 65.7% improvement suggests post-processing is changing already-formatted addresses.
Step 1: Calculate per-field improvement rates
const stats = {
nombres: { total: 652, changed: 25, rate: 3.8% },
apellidos: { total: 652, changed: 34, rate: 5.2% },
ciudad: { total: 652, changed: 364, rate: 55.8% },
direccion: { total: 652, changed: 428, rate: 65.7% }, // ← Outlier!
nivelEducativo: { total: 648, changed: 507, rate: 78.4% },
ocupacionActual: { total: 612, changed: 505, rate: 82.5% },
empresa: { total: 612, changed: 444, rate: 72.9% }
};Step 2: Flag outliers
Expected improvement rates:
- Names: 0-10% (mostly already formatted)
- Cities: 50-60% (many variants: "bogota", "BOGOTÁ", etc.)
- Addresses: 15-25% (structured format variations) ← Expected
- Education/Job: 70-85% (free text with abbreviations)
Actual:
- Addresses: 65.7% ← 3x higher than expected!
Step 3: Manual spot-checking
Randomly sample 20 addresses from normalizedData:
// Before (original):
"Cra. 15 # 100 - 25"
// After (normalized):
"Cra. . 15 # 100 - 25" // ← Double dot!Found the bug in 18 out of 20 samples (90% affected).
Step 4: Root cause analysis
// In prompts.js - Original buggy code:
function normalizeAddress(address) {
return address
.replace(/\b(cra)\.?\s*/gi, 'Cra. ') // ← BUG: applies to "Cra." → "Cra. ."
.replace(/\bno\b\.?\s*/gi, '# ');
}
// Flow:
Input: "Cra. 15 # 100 - 25" (already formatted from LLM)
Regex: /\b(cra)\.?\s*/gi matches "Cra. "
Replace: "Cra. " with "Cra. " → "Cra. . 15 # 100 - 25"Step 5: Fix and validate
// Fixed code:
function normalizeAddress(address) {
return address
// Match "cra" OR "cra." but replace with "Cra. " only once
.replace(/\b(carrera|cra|cr|kra)\.?\s*/gi, 'Cra. ') // ← Fixed
.replace(/\bno\b\.?\s*/gi, '# ')
// Clean up any accidental double dots
.replace(/\.\s*\./g, '.'); // Safety net
}Step 6: Re-normalize and measure
# Force re-normalization of all 652 leads
aws lambda invoke \
--function-name normalize-leads \
--payload '{"forceAll": true}' \
response.json
# New statistics:
# direccion improvement rate: 18.2% (119/652)
# ✅ Within expected 15-25% rangeBefore fix:
Improvement Rate = 65.7% ± 3.7%
Z-score = (65.7 - 20) / 3.7 = 12.3 (highly significant!)
After fix:
Improvement Rate = 18.2% ± 3.0%
Z-score = (18.2 - 20) / 3.0 = -0.6 (within expected range)
The fix brought the improvement rate from 12 standard deviations above expected to within 1 standard deviation - statistically confirming the bug was resolved.
Add to Lambda handler:
export const handler = async (event) => {
const metrics = {
totalFields: 0,
normalized: 0,
unchanged: 0,
errors: 0,
byField: {}
};
for (const lead of leads) {
const fieldsData = extractFields(lead);
const originalHash = hashFields(fieldsData);
const normalized = await normalizeLead(lead, config.fieldsToNormalize);
const normalizedHash = hashFields(normalized);
// Track per-field changes
for (const field of Object.keys(fieldsData)) {
if (!metrics.byField[field]) {
metrics.byField[field] = { total: 0, changed: 0 };
}
metrics.byField[field].total++;
if (fieldsData[field] !== normalized[field]) {
metrics.byField[field].changed++;
}
}
metrics.totalFields += Object.keys(fieldsData).length;
if (originalHash !== normalizedHash) {
metrics.normalized++;
} else {
metrics.unchanged++;
}
}
// Calculate improvement rates with confidence intervals
const report = generateStatisticalReport(metrics);
console.log(JSON.stringify(report, null, 2));
return successResponse(report);
};function calculateConfidenceInterval(successes, total, confidenceLevel = 0.95) {
const p = successes / total;
const z = confidenceLevel === 0.95 ? 1.96 : 2.576; // 99% CI
const se = Math.sqrt(p * (1 - p) / total);
const margin = z * se;
return {
point: p,
lower: Math.max(0, p - margin),
upper: Math.min(1, p + margin),
margin: margin,
confidenceLevel: confidenceLevel
};
}
function generateStatisticalReport(metrics) {
const report = {
summary: {
totalFields: metrics.totalFields,
normalized: metrics.normalized,
coverage: calculateConfidenceInterval(
metrics.normalized,
metrics.totalFields
),
improvementRate: calculateConfidenceInterval(
metrics.normalized - metrics.unchanged,
metrics.totalFields
)
},
byField: {}
};
for (const [field, stats] of Object.entries(metrics.byField)) {
report.byField[field] = {
total: stats.total,
changed: stats.changed,
improvementRate: calculateConfidenceInterval(
stats.changed,
stats.total
)
};
}
return report;
}function detectAnomalies(report) {
const anomalies = [];
// Expected improvement rates per field type
const expected = {
nombres: { min: 0, max: 0.10 },
apellidos: { min: 0, max: 0.10 },
ciudad: { min: 0.40, max: 0.70 },
direccion: { min: 0.10, max: 0.30 }, // ← Key constraint
nivelEducativo: { min: 0.60, max: 0.90 },
ocupacionActual: { min: 0.70, max: 0.90 },
empresa: { min: 0.60, max: 0.85 }
};
for (const [field, stats] of Object.entries(report.byField)) {
const rate = stats.improvementRate.point;
const expectedRange = expected[field];
if (expectedRange) {
if (rate < expectedRange.min || rate > expectedRange.max) {
anomalies.push({
field,
actualRate: rate,
expectedRange,
severity: Math.abs(rate - (expectedRange.min + expectedRange.max) / 2) > 0.2 ? 'HIGH' : 'MEDIUM'
});
}
}
}
return anomalies;
}if (anomalies.length > 0) {
console.error('Anomalies detected:', anomalies);
// Send SNS notification
await sns.publish({
TopicArn: process.env.ALERT_TOPIC_ARN,
Subject: 'Normalization Quality Anomaly Detected',
Message: JSON.stringify({
message: 'Statistical analysis detected unexpected improvement rates',
anomalies,
action: 'Manual review required'
}, null, 2)
}).promise();
}Sample size: 20 fields
Improvement rate: 50% (10/20)
95% CI: [27.1%, 72.9%]
Interpretation: Not enough data to be confident. Could be anywhere from 27% to 73%.
Action: Collect more data before making conclusions.
Sample size: 650 fields
Improvement rate: 70.4% (458/650)
95% CI: [67.0%, 73.8%]
Interpretation: High confidence that true rate is around 70%.
Action: Reliable metric for reporting to stakeholders.
Before optimization: 70.4% ± 1.8% = [68.6%, 72.2%]
After optimization: 72.1% ± 1.9% = [70.2%, 74.0%]
Interpretation: Intervals overlap → difference is NOT statistically significant.
Action: Don't claim improvement unless more data shows separation.
Before bug fix: 65.7% ± 3.7% = [62.0%, 69.4%]
After bug fix: 18.2% ± 3.0% = [15.2%, 21.2%]
Interpretation: No overlap → difference is statistically significant.
Action: Bug fix confirmed effective with 95% confidence.
- Prompt changes: New examples, rule adjustments
- Model updates: Claude Haiku version update
- Bug fixes: Post-processing pipeline changes (like double-dot fix)
- Quality drift: Coverage drops below 95%
- Scheduled: Every 7 days (TTL) for fresh data
// Force re-normalization via API
POST /admin/normalize-leads
{
"forceAll": true,
"reason": "Bug fix: double-dot in addresses"
}
// Response:
{
"message": "Re-normalization complete",
"leadsProcessed": 652,
"fieldsNormalized": 4280,
"duration": 186000,
"statistics": {
"coverage": { point: 0.992, lower: 0.988, upper: 0.995 },
"improvementRate": { point: 0.182, lower: 0.152, upper: 0.212 }
}
}❌ Bad: "Normalization improved 70% of fields" ✅ Good: "Normalization improved 70.4% ± 1.4% of fields (95% CI)"
// Store in DynamoDB for trend analysis
{
runId: "2026-01-24T07:00:00Z",
coverage: 0.992,
coverageCI: [0.988, 0.995],
improvementRate: 0.704,
improvementRateCI: [0.690, 0.718],
byField: { ... }
}Define acceptable ranges per field based on data characteristics:
const EXPECTED_RANGES = {
nombres: { min: 0.00, max: 0.10 }, // Names already formatted
ciudad: { min: 0.40, max: 0.70 }, // Many city variants
direccion: { min: 0.10, max: 0.30 } // Structured format
};function calculateZScore(actual, expected, sampleSize) {
const expectedMean = (expected.min + expected.max) / 2;
const expectedSD = (expected.max - expected.min) / 4; // Rough estimate
const SE = Math.sqrt(actual * (1 - actual) / sampleSize);
return (actual - expectedMean) / Math.max(expectedSD, SE);
}
// Example:
// direccion: actual = 0.657, expected = [0.10, 0.30]
// z = (0.657 - 0.20) / 0.05 = 9.14 (highly unusual!)| Sample Size | Confidence Interval Width | Use Case |
|---|---|---|
| 20-50 | ±10-15% | Pilot testing |
| 100-500 | ±5-10% | Development validation |
| 500-1000 | ±2-5% | Production monitoring |
| >1000 | <±2% | High-precision reporting |
## Data Normalization Quality Report
**Period**: January 1-24, 2026
### Overall Performance
- **Records Processed**: 652 leads
- **Success Rate**: 99.2% (95% CI: 98.8%-99.5%)
- **Improvement Rate**: 70.4% (95% CI: 69.0%-71.8%)
### Interpretation
With 95% confidence, we can state that:
- At least 98.8% of data is successfully normalized
- Between 69.0% and 71.8% of fields required normalization
### Quality Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Coverage | >95% | 99.2% ± 0.4% | ✅ Exceeds |
| Improvement Rate | 60-80% | 70.4% ± 1.4% | ✅ On Target |
| Error Rate | <5% | 0.8% ± 0.3% | ✅ Exceeds |
### Anomalies Detected
1. **Address normalization** (Jan 23): Improvement rate 3x expected
- **Cause**: Double-dot bug in post-processing
- **Resolution**: Regex pattern updated, all records re-normalized
- **Validation**: New rate 18.2% ± 3.0% (within expected 15-25%)Statistical validation transforms LLM normalization from a "black box" to a measurable, trustworthy process.
Key takeaways:
- Always use confidence intervals for reporting
- Track per-field metrics to detect anomalies
- Set expected ranges based on data characteristics
- Re-normalize when needed (bugs, model updates, prompt changes)
- Monitor trends over time to catch quality drift early
The double-dot bug discovery proves the value of this approach: without statistical analysis, we might never have noticed 65.7% of addresses were being corrupted.
- LESSONS-LEARNED.md: Production insights and pitfalls
- COST-ANALYSIS.md: Cost optimization strategies
Last Updated: January 24, 2026