@edwinfom/resume-intel
LLM-first resume parsing infrastructure. Model-agnostic · Spatial extraction · OCR fallback · PII redaction · Streaming · JSON Resume compatible
What's new in v0.2.1
redactPiioption — redacts email, phone, addresses, and URLs before sending to the LLM. Real values are reinjected after extraction. GDPR-friendly.confidenceScoreinsectionResults— per-section reliability signal (0.0–1.0) based on retry count and field completeness.redactPii,reinjectPii,describePiiRedactionexports — utility functions for custom redaction pipelines.
See the full Changelog for details.
The Problem
Extracting structured data from resume PDFs is harder than it looks. Most tools either use brittle regex patterns that break on modern CV designs, or they wrap a single AI provider's API and lock you in forever.
Here's what actually goes wrong in practice:
- Multi-column layouts — column 1 and column 2 get interleaved. Dates mix with job descriptions. The LLM receives semantic chaos and hallucinates.
- DeepSeek and similar models can't read raw PDFs — they are text-only models. You must extract and clean the text first, or every call fails silently.
- LLMs produce broken JSON — missing closing braces, trailing commas, JSON wrapped in markdown fences. Without a repair layer, your pipeline crashes.
- Vendor lock-in — if your parser only works with Claude or GPT-4, you can't switch to a cheaper or local model without rewriting your integration.
- Scanned PDFs — no text layer at all. A basic extractor returns an empty string and you never know why.
- Data privacy — sending raw CVs with personal emails and phone numbers to third-party LLM APIs may violate GDPR or internal data policies.
resume-intel is a four-layer pipeline that solves all of this:
import { parseResume } from '@edwinfom/resume-intel'
import { createDeepSeek } from '@ai-sdk/deepseek'
import { readFileSync } from 'node:fs'
const result = await parseResume(readFileSync('./resume.pdf'), {
model: createDeepSeek({ apiKey: process.env.DEEPSEEK_API_KEY })('deepseek-chat'),
redactPii: true, // GDPR-friendly: LLM never sees raw personal data
})
console.log(result.data.basics?.name) // "Jane Doe"
console.log(result.data.work?.length) // 3
console.log(result.meta.ocrFallback) // true if scanned PDF
console.log(result.meta.sectionResults) // per-section diagnostics with confidence scoresHow it works
1 — Scan Detection
Before any extraction, the package checks whether the PDF has an embedded text layer. If text density is below a threshold, the PDF is classified as scanned and the OCR pipeline triggers automatically.
2a — Spatial PDF Extraction (text-native PDFs)
Standard extractors read text in rendering order, not reading order. For a two-column CV this produces:
2020 Senior Engineer TypeScript Node.js
2018 Junior Engineer React PostgreSQL
resume-intel extracts bounding box coordinates, detects column boundaries using gap analysis, sorts blocks within each column by vertical position, and concatenates columns left-to-right. The LLM receives clean, ordered text.
2b — OCR Fallback (scanned PDFs)
When a scanned PDF is detected:
- Rasterizes each page to PNG using
pdfjs-dist+@napi-rs/canvasat 150 DPI - Runs Tesseract.js (WASM) locally on each page image
- Cleans OCR artefacts (progress bars, separators, page numbers)
No external OCR service. No network call. Everything runs locally.
2c — PII Redaction (optional)
When redactPii: true, personal data is replaced with deterministic placeholders before the LLM call:
john.doe@gmail.com → __PII_EMAIL_0__
+1 (555) 123-4567 → __PII_PHONE_1__
https://johndoe.dev → __PII_URL_2__
After extraction, the real values are reinjected. The final output is identical to a non-redacted run.
3 — Task Decomposition
Instead of one monolithic LLM call, resume-intel runs parallel focused extractions per section — each with its own prompt, schema, maxTokens cap, and retry loop. This reduces hallucinations and isolates failures.
4 — JSON Repair and Validation
- Repair — strips markdown fences, fixes trailing commas, missing brackets via
jsonrepair - Validate — Zod schema enforcement
- Self-correct — Zod errors are fed back to the LLM for targeted correction (up to
maxRetriestimes)
Features
| Feature | Description |
|---|---|
| Spatial Extraction | Bounding box algorithm reconstructs multi-column reading order |
| OCR Fallback | Tesseract.js + @napi-rs/canvas for scanned PDFs, fully local |
| PII Redaction | Redacts email, phone, addresses, URLs before LLM — reinjects after |
| Model Agnostic | Vercel AI SDK — works with DeepSeek, OpenAI, Anthropic, Gemini, Ollama |
| JSON Resume v1 | Output conforms to the open standard used by hundreds of tools |
| 15 Sections | basics, work, education, skills, languages, projects, awards, certificates, publications, volunteer, interests, references |
| Custom Sections | sections option — extract only what you need |
| Custom Schema | outputSchema option — replace JSON Resume with your own Zod schema |
| Zod Validation | Full schema enforcement with self-correcting retry loop |
| JSON Repair | jsonrepair handles markdown fences, trailing commas, truncated responses |
| Task Decomposition | Parallel per-section extraction with independent retry |
| Confidence Scores | Per-section reliability signal in sectionResults |
| Streaming | streamResume() — AsyncGenerator for progressive UI updates |
| CLI | resume-intel parse <file.pdf> — parse from the terminal |
| Serverless Ready | Worker via MessageChannel — works on Vercel and AWS Lambda |
| TypeScript First | Full type safety, ESM + CJS dual build |