JSON Validation
Why LLMs produce broken JSON
LLMs are probabilistic text generators. Even when instructed to output JSON, they occasionally:
- Wrap the JSON in markdown code fences (
```json ... ```) - Add explanatory text before or after the JSON
- Include trailing commas in objects or arrays
- Omit closing braces or brackets
- Use single quotes instead of double quotes
- Truncate the response mid-object
Without a repair layer, any of these failures crashes your pipeline.
The three-stage pipeline
Stage 1 — Structural repair
Before validation, the raw LLM output passes through a repair pipeline:
- Strip markdown fences — removes
```json,```, and similar wrappers - Strip leading text — removes any text before the first
{or[ - Strip trailing text — removes any text after the last
}or] jsonrepair— fixes trailing commas, missing brackets, single quotes, unescaped characters, and 100+ other common patterns
Stage 2 — Zod validation
The repaired string is parsed and validated against the JSON Resume v1 Zod schema. If it passes, the result is returned immediately.
Stage 3 — Self-correcting retry
If validation fails, the specific Zod error is formatted into a correction prompt and sent back to the LLM:
CORRECTION REQUIRED (Attempt 1):
Validation failed:
- "work.0.startDate" : Expected string, received number
- "basics.email" : Invalid email
Return ONLY the corrected JSON. Stop immediately after the closing brace.
The LLM is asked to fix exactly the fields that failed. This loop runs up to maxRetries times (default: 3).
Using the Zod schema directly
import { JsonResumeSchema } from '@edwinfom/resume-intel'
// Validate your own data
const result = JsonResumeSchema.safeParse(myData)
if (!result.success) {
console.error(result.error.issues)
}
// Parse with throwing
const data = JsonResumeSchema.parse(myData)Deduplication
After extraction, array fields are deduplicated by composite key to remove entries that appear multiple times (common in multi-page scanned CVs where headers repeat):
| Field | Deduplication key |
|---|---|
work |
name + position + startDate |
education |
institution + studyType + startDate |
skills |
name |
languages |
language (case-insensitive) |
projects |
name |
certificates |
name + issuer |
awards |
title + awarder |
Reliability
In production testing across text-native and scanned CVs:
- Stage 1 alone resolves ~70% of malformed outputs
- Stage 1 + 2 resolves ~85%
- Stage 1 + 2 + 3 (1 retry) resolves ~97%
- Stage 1 + 2 + 3 (3 retries) resolves ~99%+