@edwinfom/resume-intel is a LLM-first resume parsing infrastructure — model-agnostic, spatial PDF extraction, automatic OCR fallback, JSON Resume v1 output, and production-grade JSON validation.

@edwinfom/resume-intel

LLM-first resume parsing infrastructure. Model-agnostic · Spatial extraction · OCR fallback · JSON Resume compatible

What's new in v0.2.0

streamResume() — AsyncGenerator that yields events as each section is extracted. Update your UI progressively instead of waiting for the full result.
YYYY-01 date fix — "2025-01" → "2025" (month-only padding now stripped)
Empty arrays removed — volunteer: [], interests: [] are now omitted from output
Empty skill categories removed — skills with no keywords are filtered out
StreamResumeEvent type export

See the full Changelog for details.

Extracting structured data from resume PDFs is harder than it looks. Most tools either use brittle regex patterns that break on modern CV designs, or they wrap a single AI provider's API and lock you in forever.

Here's what actually goes wrong in practice:

Multi-column layouts — column 1 and column 2 get interleaved. Dates mix with job descriptions. The LLM receives semantic chaos and hallucinates.
DeepSeek and similar models can't read raw PDFs — they are text-only models. You must extract and clean the text first, or every call fails silently.
LLMs produce broken JSON — missing closing braces, trailing commas, JSON wrapped in markdown fences. Without a repair layer, your pipeline crashes.
Vendor lock-in — if your parser only works with Claude or GPT-4, you can't switch to a cheaper or local model without rewriting your integration.
Scanned PDFs — no text layer at all. A basic extractor returns an empty string and you never know why.

resume-intel is a four-layer pipeline that solves all of this:

import { parseResume } from '@edwinfom/resume-intel'
import { createDeepSeek } from '@ai-sdk/deepseek'
import { readFileSync } from 'node:fs'
 
const result = await parseResume(readFileSync('./resume.pdf'), {
  model: createDeepSeek({ apiKey: process.env.DEEPSEEK_API_KEY })('deepseek-chat'),
})
 
console.log(result.data.basics?.name)    // "Jane Doe"
console.log(result.data.work?.length)    // 3
console.log(result.meta.ocrFallback)     // true if scanned PDF
console.log(result.meta.sectionResults) // per-section diagnostics

How it works

1 — Scan Detection

Before any extraction, the package checks whether the PDF has an embedded text layer. If text density is below a threshold, the PDF is classified as scanned and the OCR pipeline triggers automatically.

2a — Spatial PDF Extraction (text-native PDFs)

Standard extractors read text in rendering order, not reading order. For a two-column CV this produces:

2020 Senior Engineer TypeScript Node.js
2018 Junior Engineer React PostgreSQL

resume-intel extracts bounding box coordinates, detects column boundaries using gap analysis, sorts blocks within each column by vertical position, and concatenates columns left-to-right. The LLM receives clean, ordered text.

2b — OCR Fallback (scanned PDFs)

When a scanned PDF is detected:

Rasterizes each page to PNG using pdfjs-dist + @napi-rs/canvas at 150 DPI
Runs Tesseract.js (WASM) locally on each page image
Cleans OCR artefacts (progress bars, separators, page numbers)

No external OCR service. No network call. Everything runs locally.

3 — Task Decomposition

Instead of one monolithic LLM call, resume-intel runs parallel focused extractions per section — each with its own prompt, schema, maxTokens cap, and retry loop. This reduces hallucinations and isolates failures.

4 — JSON Repair and Validation

Repair — strips markdown fences, fixes trailing commas, missing brackets via jsonrepair
Validate — Zod schema enforcement
Self-correct — Zod errors are fed back to the LLM for targeted correction (up to maxRetries times)

Features

Feature	Description
Spatial Extraction	Bounding box algorithm reconstructs multi-column reading order
OCR Fallback	Tesseract.js + `@napi-rs/canvas` for scanned PDFs, fully local
Model Agnostic	Vercel AI SDK — works with DeepSeek, OpenAI, Anthropic, Gemini, Ollama
JSON Resume v1	Output conforms to the open standard used by hundreds of tools
15 Sections	basics, work, education, skills, languages, projects, awards, certificates, publications, volunteer, interests, references
Custom Sections	`sections` option — extract only what you need
Custom Schema	`outputSchema` option — replace JSON Resume with your own Zod schema
Zod Validation	Full schema enforcement with self-correcting retry loop
JSON Repair	`jsonrepair` handles markdown fences, trailing commas, truncated responses
Task Decomposition	Parallel per-section extraction with independent retry
OCR Text Cleaning	Strips Tesseract artefacts before LLM submission
Deduplication	Removes duplicate array entries from multi-page scans
Observability	`sectionResults` + `sectionsRequested` in meta
CLI	`resume-intel parse <file.pdf>` — parse from the terminal
Serverless Ready	Worker via `MessageChannel` — works on Vercel and AWS Lambda
TypeScript First	Full type safety, ESM + CJS dual build