Real-world insights from implementing LLM-powered data normalization at scale
This pattern was implemented for an educational program registration system in Colombia, processing 652 leads with 4,280 fields normalized over a 2-week period. Here's what worked, what didn't, and what we'd do differently.
Decision: Use Haiku instead of Sonnet for cost savings.
Outcome:
- 12x cheaper than Sonnet ($0.043 vs $0.48 for 652 leads)
- 2-3s latency per batch (acceptable for overnight processing)
- Sufficient quality for structured normalization tasks
Lesson: For structured tasks (normalization, classification, extraction), Haiku is underrated. Save Sonnet for creative/reasoning tasks.
Evidence:
Cost per lead: $0.000066 (6.6 cents per 1000 leads)
Quality: 99.2% success rate
Stakeholder satisfaction: 10/10
Decision: Don't trust LLM outputs blindly - add regex post-processing.
Outcome:
- Caught 65.7% of addresses that LLM normalized correctly but post-processing broke (double-dot bug)
- Fixed inconsistent capitalization (LLM returned "bogota" vs "Bogotá")
- Enforced exact formats (always "Cra." with dot, never "Cra")
Lesson: LLMs + deterministic rules = best of both worlds.
Code pattern:
// Layer 1: LLM (context-aware)
const llmOutput = await callClaude(prompt);
// Layer 2: Post-processing (deterministic)
const finalOutput = postProcessField(fieldName, llmOutput);Decision: Store normalized data in separate normalizedData attribute, preserve originals.
Outcome:
- Rolled back double-dot bug without data loss
- A/B tested different normalization rules
- Compared before/after for quality validation
Lesson: Always preserve original data. Storage is cheap, lost data is expensive.
Schema:
{
leadId: "abc123",
nombres: "JUAN CARLOS", // Original - never modified
normalizedData: {
nombres: "Juan Carlos" // Normalized - can be regenerated
},
normalizedAt: 1706000000000
}Decision: Track improvement rates per field with confidence intervals.
Outcome: Detected the double-dot bug that affected 428 of 652 addresses (65.7%).
How it worked:
Expected improvement rate for addresses: 15-25%
Actual improvement rate: 65.7% ± 3.7%
Z-score: 12.3 (highly significant outlier)
Action: Manual spot-check → bug found → fixed → re-normalized
Lesson: LLMs are probabilistic. Treat quality as a statistical process, not a binary pass/fail.
Decision: Send 10 leads per Bedrock API call instead of 1 lead per call.
Outcome:
- Input tokens: 1,300 per batch vs 13,000 for 10 individual calls (10x reduction in prompt overhead)
- API calls: 65 calls vs 650 calls (10x reduction in API overhead)
- Latency: 130s vs 1,300s (10x faster)
Lesson: Batch aggressively. Prompts have fixed overhead - amortize it.
Evidence:
Individual calls: 650 calls × $0.0004 = $0.26
Batch calls: 65 calls × $0.0004 = $0.026 (10x cheaper)
Problem: Claude sometimes returned:
- "Cra. 15" vs "Cra 15" (missing dot)
- "Bogotá" vs "Bogota" (missing accent)
- "Juan Carlos" vs "Juan Carlos" (double space)
Why it failed: Temperature=0 is deterministic for same input, but varies across similar inputs.
Fix: Added post-processing pipeline to enforce exact formats.
Before:
// LLM-only approach
const normalized = await callClaude(prompt);
return normalized; // Inconsistent outputsAfter:
// Dual-layer approach
const llmOutput = await callClaude(prompt);
const normalized = postProcessField(fieldName, llmOutput);
return normalized; // Consistent outputsProblem: Regex pattern /.replace(/\b(cra)\.?\s*/gi, 'Cra. ') applied to already-formatted "Cra." resulted in "Cra. ."
Why it failed: Pattern matched "Cra." (with dot) and replaced with "Cra. " (adding another dot).
Fix: Updated regex to check for existing dot:
// Buggy
.replace(/\b(cra)\.?\s*/gi, 'Cra. ')
// Fixed
.replace(/\b(carrera|cra|cr|kra)\.?\s*/gi, 'Cra. ') // Matches full word
.replace(/\.\s*\./g, '.') // Safety net: remove double dotsLesson: Test post-processing with already-formatted data, not just raw data.
Test case added:
test('handles address with existing dot (prevents double-dot bug)', () => {
const response = '{"direccion": "Cra. 80 I # 51 - 09"}';
const result = parseNormalizationResponse(response);
expect(result.direccion).toBe('Cra. 80 I # 51 - 09');
expect(result.direccion).not.toContain('. .'); // ← Critical assertion
});Problem: LLM returned "Santafe de Bogota" (historical name), but our mapping only had "Bogota".
Why it failed: Didn't anticipate historical/alternative city names.
Fix: Expanded city mappings with variants:
const CITY_MAPPINGS = {
'bogota': 'Bogota D.C.',
'bogotá': 'Bogota D.C.',
'santafe de bogota': 'Bogota D.C.', // ← Added
'bogota d.c': 'Bogota D.C.',
'bogota dc': 'Bogota D.C.'
};Lesson: Build mappings iteratively. Start with common variants, add edge cases as you find them.
Problem: Claude sometimes wrapped JSON in markdown code blocks:
```json
{"nombres": "Juan Carlos"}
```
Why it failed: JSON.parse() can't handle markdown.
Fix: Added markdown stripping in parser:
export function parseNormalizationResponse(responseText) {
let jsonStr = responseText;
// Remove markdown code blocks if present
const jsonMatch = responseText.match(/```(?:json)?\s*([\s\S]*?)\s*```/);
if (jsonMatch) {
jsonStr = jsonMatch[1];
} else {
// Try to find raw JSON object
const objectMatch = responseText.match(/\{[\s\S]*\}/);
if (objectMatch) {
jsonStr = objectMatch[0];
}
}
return JSON.parse(jsonStr);
}Lesson: LLMs don't follow instructions perfectly. Make parsers defensive.
Problem: First invocation took 5-6 seconds (Bedrock SDK initialization).
Why it failed: Clients initialized inside handler function.
Fix: Moved client initialization outside handler:
// ❌ Slow (initialized on every invocation)
export const handler = async (event) => {
const bedrockClient = new BedrockRuntimeClient({ region: 'us-east-1' });
// ...
};
// ✅ Fast (initialized once per Lambda container)
const bedrockClient = new BedrockRuntimeClient({ region: 'us-east-1' });
export const handler = async (event) => {
// Client already initialized
};Lesson: Always initialize SDK clients outside handler for Lambda container reuse.
Impact:
- Cold start: 5s → 2s (60% reduction)
- Warm invocations: 3s → <1s (70% reduction)
✅ Good:
Responde UNICAMENTE con un JSON valido:
{
"campo1": "valor normalizado"
}
❌ Bad:
Normalize this data and return the results.
Lesson: Be explicit about output format. LLMs follow patterns, not vague instructions.
✅ Good:
Ejemplo: "CRA 15 NO 100 25" -> "Cra. 15 # 100 - 25"
❌ Bad:
Use standard Colombian address format with proper abbreviations.
Lesson: Examples are more effective than rules. Show, don't tell.
✅ Good: Prompt in Spanish (data is Spanish)
Normaliza los siguientes campos...
❌ Bad: Prompt in English (data is Spanish)
Normalize the following Spanish fields...
Lesson: Match prompt language to data language for better context.
❌ Bad:
If the field is empty or invalid, return an empty string.
Problem: LLM often returned null or omitted the field instead.
Fix: Handle empty/null in post-processing:
if (typeof value === 'string' && value.trim() !== '') {
normalized[key] = postProcessField(key, value);
}❌ Bad:
If the city name is not a valid Colombian city, return the original value.
Problem: LLM doesn't have comprehensive city databases - made up invalid cities.
Fix: Moved validation to post-processing with hardcoded list.
❌ Bad:
First, check if the value is already normalized. If yes, return it unchanged.
If no, apply the following rules: 1) Capitalize first letter, 2) Remove extra spaces...
Problem: LLM got confused with complex logic.
Fix: Simplified prompt, moved complex logic to post-processing.
What worked:
- Token usage logging: Caught cost spikes early
- Per-field success rates: Identified which fields needed prompt tuning
- CloudWatch metrics: Duration, errors, throttling
What we added later:
- Statistical reports: Improvement rates with confidence intervals
- Anomaly detection: Alerts when rates deviate from expected ranges
Code:
console.log('Token usage:', {
input: responseBody.usage?.input_tokens || 0,
output: responseBody.usage?.output_tokens || 0
});
// Alert if tokens spike
if (responseBody.usage?.input_tokens > 2000) {
console.warn('High token usage - investigate prompt size');
}What worked:
- Batch processing: 10x cost reduction
- TTL-based re-normalization: Only re-normalize every 7 days (vs daily)
- Max leads per run: Safety limit of 50 leads per execution
What we wish we'd done:
- Cost alerts: Set CloudWatch billing alarm at $1/day
- Token budgets: Fail-fast if approaching budget limit
What worked:
- Unit tests for prompt generation and parsing
- Manual spot-checking of 20 random normalizations
- Statistical validation to catch systematic bugs
What we should have done better:
- Regression tests with real production data
- Edge case database: Maintain a list of known problematic inputs
Test example:
test('handles already-formatted data without changes', () => {
const input = {
nombres: "Juan Carlos",
direccion: "Cra. 15 # 100 - 25",
ciudad: "Bogota D.C."
};
const result = normalizeLead(input);
// Should NOT mark as changed if already formatted
expect(result.changedFields).toEqual([]);
});❌ Mistake: return await callClaude(prompt);
✅ Fix: Add validation and post-processing layer.
❌ Mistake: Overwriting original fields with normalized values.
✅ Fix: Store normalized data separately (normalizedData attribute).
❌ Mistake: "Tests pass, ship it!"
✅ Fix: Track improvement rates and confidence intervals.
❌ Mistake: Default to most powerful model.
✅ Fix: Test with Haiku first, upgrade to Sonnet only if quality insufficient.
❌ Mistake: const client = new BedrockClient() inside handler.
✅ Fix: Initialize outside handler for container reuse.
❌ Mistake: Only test with messy data ("CRA 15 NO 100 25").
✅ Fix: Test with clean data ("Cra. 15 # 100 - 25") to catch over-normalization.
❌ Mistake: Field lists, batch sizes in environment variables.
✅ Fix: Store in DynamoDB for no-deploy config changes.
| Metric | Target | Actual | Action if Missed |
|---|---|---|---|
| Coverage | >95% | 99.2% | Investigate failures, improve error handling |
| Improvement Rate | 60-80% | 70.4% | Too low = prompt not aggressive; too high = over-normalizing |
| Error Rate | <5% | 0.8% | Debug failures, add edge cases to tests |
| Metric | Target | Actual | Action if Exceeded |
|---|---|---|---|
| Cost per lead | <$0.001 | $0.000066 | Switch to Haiku, increase batch size |
| Monthly cost | <$1 | $0.043 | On track (projected $1.29/month for 30K leads) |
| Metric | Target | Actual | Action if Missed |
|---|---|---|---|
| Duration | <5 min | 3 min | Reduce batch size or max leads per run |
| Cold start | <3s | 2s | Acceptable for overnight batch job |
| Failures | 0 | 0 | Retry logic, error handling |
- Start small: Test with 50-100 records before full deployment
- Use Haiku: Upgrade to Sonnet only if quality insufficient
- Batch aggressively: 10-20 records per API call
- Preserve originals: Non-destructive storage is mandatory
- Add post-processing: Don't trust LLM outputs blindly
- Monitor costs: Set CloudWatch billing alarms
- Track statistics: Improvement rates with confidence intervals
- Detect anomalies: Alert when rates deviate from expected
- Version prompts: Track changes to prompts in git
- A/B test: Compare normalization quality before/after changes
- Add comprehensive logging: Token usage, field success rates, errors
- Set up alerts: Cost spikes, quality drift, failures
- Document edge cases: Maintain a list of known problematic inputs
- Plan for re-normalization: Have a strategy for prompt/model updates
- Communicate with stakeholders: Use confidence intervals in reports
- Haiku was better than expected: 99.2% success rate for $0.043
- Statistical validation caught bugs: Double-dot bug would have gone unnoticed
- Non-destructive storage paid off: Rolled back bug without data loss
- Batch processing was easy: 10x cost/speed improvement with minimal code
- Post-processing bugs were sneaky: Double-dot bug affected 65.7% of data silently
- LLM consistency wasn't perfect: Even at temp=0, needed validation
- Edge cases emerged slowly: City/institution variants appeared over weeks
What we'd keep:
- Claude 3 Haiku (perfect cost/quality balance)
- Post-processing pipeline (essential for consistency)
- Statistical validation (caught critical bugs)
- Non-destructive storage (enabled experimentation)
- Batch processing (10x efficiency gain)
What we'd change:
- Test with already-formatted data earlier (would have caught double-dot bug)
- Set cost alerts from day 1 (peace of mind)
- Build edge case database proactively (not reactively)
Bottom line: This pattern works exceptionally well for structured data normalization. The dual-layer approach (LLM + post-processing) achieves 99.2% quality at $0.000066 per record - a cost/quality ratio that's hard to beat.
- COST-ANALYSIS.md: Detailed cost breakdown and optimization strategies
- README.md: Pattern overview and quick start
Last Updated: January 24, 2026