"@edwinfom/resume-intel is a LLM-first resume parsing infrastructure — model-agnostic, spatial PDF extraction, automatic OCR fallback, PII redaction, streaming, and JSON Resume v1 output."

@edwinfom/resume-intel

LLM-first resume parsing infrastructure. Model-agnostic · Spatial extraction · OCR fallback · PII redaction · Streaming · JSON Resume compatible

npm version license typescript

What's new in v0.2.1

  • redactPii option — redacts email, phone, addresses, and URLs before sending to the LLM. Real values are reinjected after extraction. GDPR-friendly.
  • confidenceScore in sectionResults — per-section reliability signal (0.0–1.0) based on retry count and field completeness.
  • redactPii, reinjectPii, describePiiRedaction exports — utility functions for custom redaction pipelines.

See the full Changelog for details.

The Problem

Extracting structured data from resume PDFs is harder than it looks. Most tools either use brittle regex patterns that break on modern CV designs, or they wrap a single AI provider's API and lock you in forever.

Here's what actually goes wrong in practice:

  • Multi-column layouts — column 1 and column 2 get interleaved. Dates mix with job descriptions. The LLM receives semantic chaos and hallucinates.
  • DeepSeek and similar models can't read raw PDFs — they are text-only models. You must extract and clean the text first, or every call fails silently.
  • LLMs produce broken JSON — missing closing braces, trailing commas, JSON wrapped in markdown fences. Without a repair layer, your pipeline crashes.
  • Vendor lock-in — if your parser only works with Claude or GPT-4, you can't switch to a cheaper or local model without rewriting your integration.
  • Scanned PDFs — no text layer at all. A basic extractor returns an empty string and you never know why.
  • Data privacy — sending raw CVs with personal emails and phone numbers to third-party LLM APIs may violate GDPR or internal data policies.

resume-intel is a four-layer pipeline that solves all of this:

import { parseResume } from '@edwinfom/resume-intel'
import { createDeepSeek } from '@ai-sdk/deepseek'
import { readFileSync } from 'node:fs'
 
const result = await parseResume(readFileSync('./resume.pdf'), {
  model: createDeepSeek({ apiKey: process.env.DEEPSEEK_API_KEY })('deepseek-chat'),
  redactPii: true, // GDPR-friendly: LLM never sees raw personal data
})
 
console.log(result.data.basics?.name)    // "Jane Doe"
console.log(result.data.work?.length)    // 3
console.log(result.meta.ocrFallback)     // true if scanned PDF
console.log(result.meta.sectionResults) // per-section diagnostics with confidence scores

How it works

1 — Scan Detection

Before any extraction, the package checks whether the PDF has an embedded text layer. If text density is below a threshold, the PDF is classified as scanned and the OCR pipeline triggers automatically.

2a — Spatial PDF Extraction (text-native PDFs)

Standard extractors read text in rendering order, not reading order. For a two-column CV this produces:

2020 Senior Engineer TypeScript Node.js
2018 Junior Engineer React PostgreSQL

resume-intel extracts bounding box coordinates, detects column boundaries using gap analysis, sorts blocks within each column by vertical position, and concatenates columns left-to-right. The LLM receives clean, ordered text.

2b — OCR Fallback (scanned PDFs)

When a scanned PDF is detected:

  1. Rasterizes each page to PNG using pdfjs-dist + @napi-rs/canvas at 150 DPI
  2. Runs Tesseract.js (WASM) locally on each page image
  3. Cleans OCR artefacts (progress bars, separators, page numbers)

No external OCR service. No network call. Everything runs locally.

2c — PII Redaction (optional)

When redactPii: true, personal data is replaced with deterministic placeholders before the LLM call:

john.doe@gmail.com  →  __PII_EMAIL_0__
+1 (555) 123-4567   →  __PII_PHONE_1__
https://johndoe.dev →  __PII_URL_2__

After extraction, the real values are reinjected. The final output is identical to a non-redacted run.

3 — Task Decomposition

Instead of one monolithic LLM call, resume-intel runs parallel focused extractions per section — each with its own prompt, schema, maxTokens cap, and retry loop. This reduces hallucinations and isolates failures.

4 — JSON Repair and Validation

  1. Repair — strips markdown fences, fixes trailing commas, missing brackets via jsonrepair
  2. Validate — Zod schema enforcement
  3. Self-correct — Zod errors are fed back to the LLM for targeted correction (up to maxRetries times)

Features

Feature Description
Spatial Extraction Bounding box algorithm reconstructs multi-column reading order
OCR Fallback Tesseract.js + @napi-rs/canvas for scanned PDFs, fully local
PII Redaction Redacts email, phone, addresses, URLs before LLM — reinjects after
Model Agnostic Vercel AI SDK — works with DeepSeek, OpenAI, Anthropic, Gemini, Ollama
JSON Resume v1 Output conforms to the open standard used by hundreds of tools
15 Sections basics, work, education, skills, languages, projects, awards, certificates, publications, volunteer, interests, references
Custom Sections sections option — extract only what you need
Custom Schema outputSchema option — replace JSON Resume with your own Zod schema
Zod Validation Full schema enforcement with self-correcting retry loop
JSON Repair jsonrepair handles markdown fences, trailing commas, truncated responses
Task Decomposition Parallel per-section extraction with independent retry
Confidence Scores Per-section reliability signal in sectionResults
Streaming streamResume() — AsyncGenerator for progressive UI updates
CLI resume-intel parse <file.pdf> — parse from the terminal
Serverless Ready Worker via MessageChannel — works on Vercel and AWS Lambda
TypeScript First Full type safety, ESM + CJS dual build