@edwinfom/resume-intel
LLM-first resume parsing infrastructure. Model-agnostic · Spatial extraction · OCR fallback · JSON Resume compatible
What's new in v0.2.0
streamResume()— AsyncGenerator that yields events as each section is extracted. Update your UI progressively instead of waiting for the full result.YYYY-01date fix —"2025-01"→"2025"(month-only padding now stripped)- Empty arrays removed —
volunteer: [],interests: []are now omitted from output - Empty skill categories removed — skills with no keywords are filtered out
StreamResumeEventtype export
See the full Changelog for details.
The Problem
Extracting structured data from resume PDFs is harder than it looks. Most tools either use brittle regex patterns that break on modern CV designs, or they wrap a single AI provider's API and lock you in forever.
Here's what actually goes wrong in practice:
- Multi-column layouts — column 1 and column 2 get interleaved. Dates mix with job descriptions. The LLM receives semantic chaos and hallucinates.
- DeepSeek and similar models can't read raw PDFs — they are text-only models. You must extract and clean the text first, or every call fails silently.
- LLMs produce broken JSON — missing closing braces, trailing commas, JSON wrapped in markdown fences. Without a repair layer, your pipeline crashes.
- Vendor lock-in — if your parser only works with Claude or GPT-4, you can't switch to a cheaper or local model without rewriting your integration.
- Scanned PDFs — no text layer at all. A basic extractor returns an empty string and you never know why.
resume-intel is a four-layer pipeline that solves all of this:
import { parseResume } from '@edwinfom/resume-intel'
import { createDeepSeek } from '@ai-sdk/deepseek'
import { readFileSync } from 'node:fs'
const result = await parseResume(readFileSync('./resume.pdf'), {
model: createDeepSeek({ apiKey: process.env.DEEPSEEK_API_KEY })('deepseek-chat'),
})
console.log(result.data.basics?.name) // "Jane Doe"
console.log(result.data.work?.length) // 3
console.log(result.meta.ocrFallback) // true if scanned PDF
console.log(result.meta.sectionResults) // per-section diagnosticsHow it works
1 — Scan Detection
Before any extraction, the package checks whether the PDF has an embedded text layer. If text density is below a threshold, the PDF is classified as scanned and the OCR pipeline triggers automatically.
2a — Spatial PDF Extraction (text-native PDFs)
Standard extractors read text in rendering order, not reading order. For a two-column CV this produces:
2020 Senior Engineer TypeScript Node.js
2018 Junior Engineer React PostgreSQL
resume-intel extracts bounding box coordinates, detects column boundaries using gap analysis, sorts blocks within each column by vertical position, and concatenates columns left-to-right. The LLM receives clean, ordered text.
2b — OCR Fallback (scanned PDFs)
When a scanned PDF is detected:
- Rasterizes each page to PNG using
pdfjs-dist+@napi-rs/canvasat 150 DPI - Runs Tesseract.js (WASM) locally on each page image
- Cleans OCR artefacts (progress bars, separators, page numbers)
No external OCR service. No network call. Everything runs locally.
3 — Task Decomposition
Instead of one monolithic LLM call, resume-intel runs parallel focused extractions per section — each with its own prompt, schema, maxTokens cap, and retry loop. This reduces hallucinations and isolates failures.
4 — JSON Repair and Validation
- Repair — strips markdown fences, fixes trailing commas, missing brackets via
jsonrepair - Validate — Zod schema enforcement
- Self-correct — Zod errors are fed back to the LLM for targeted correction (up to
maxRetriestimes)
Features
| Feature | Description |
|---|---|
| Spatial Extraction | Bounding box algorithm reconstructs multi-column reading order |
| OCR Fallback | Tesseract.js + @napi-rs/canvas for scanned PDFs, fully local |
| Model Agnostic | Vercel AI SDK — works with DeepSeek, OpenAI, Anthropic, Gemini, Ollama |
| JSON Resume v1 | Output conforms to the open standard used by hundreds of tools |
| 15 Sections | basics, work, education, skills, languages, projects, awards, certificates, publications, volunteer, interests, references |
| Custom Sections | sections option — extract only what you need |
| Custom Schema | outputSchema option — replace JSON Resume with your own Zod schema |
| Zod Validation | Full schema enforcement with self-correcting retry loop |
| JSON Repair | jsonrepair handles markdown fences, trailing commas, truncated responses |
| Task Decomposition | Parallel per-section extraction with independent retry |
| OCR Text Cleaning | Strips Tesseract artefacts before LLM submission |
| Deduplication | Removes duplicate array entries from multi-page scans |
| Observability | sectionResults + sectionsRequested in meta |
| CLI | resume-intel parse <file.pdf> — parse from the terminal |
| Serverless Ready | Worker via MessageChannel — works on Vercel and AWS Lambda |
| TypeScript First | Full type safety, ESM + CJS dual build |