OCR Fallback
What is a scanned PDF?
A scanned PDF is a PDF that contains images of pages rather than embedded text. It is created by scanning a physical document and saving it as a PDF. Unlike text-native PDFs (created by Word, LaTeX, or a PDF printer), scanned PDFs have no text layer — they are just images.
When pdfjs-dist extracts text from a scanned PDF, it returns an empty string. Without OCR, the LLM receives nothing to work with.
Automatic detection
resume-intel automatically detects scanned PDFs by measuring text density:
- If the average extracted text is < 50 characters per page → classified as scanned (95% confidence)
- If the average is < 150 characters per page → classified as likely scanned (70% confidence)
- Otherwise → classified as text-native
When a scanned PDF is detected, the OCR pipeline triggers automatically. No configuration needed.
The OCR pipeline
PDF buffer
↓
pdfjs-dist renders each page to PNG
↓
@napi-rs/canvas provides the canvas context
↓
Tesseract.js (WASM) runs OCR on each PNG
↓
OCR text cleaner strips artefacts
↓
Clean text → LLM extraction
Step 1 — Rasterization
Each PDF page is rendered to a PNG image at 150 DPI using pdfjs-dist with @napi-rs/canvas as the rendering backend. 150 DPI provides a good balance between OCR accuracy and processing speed.
Step 2 — OCR
Tesseract.js runs locally in WASM mode. A single worker is created and reused across all pages to avoid re-initializing the WASM engine for each page.
Step 3 — Text cleaning
OCR output contains artefacts that LLMs interpret literally. The cleaner removes:
- Block-fill characters from skill meters (
████░░░,■■■□□) - Pipe/dash separator lines (
| | | |,----------) - Isolated single special characters (
|,•,·) - Page number patterns (
Page 1 of 3,- 2 -) - Repeated header/footer lines (same line appearing 2+ times)
This eliminates corrupted field values like "fluency": "|" that were common in v0.1.0.
Checking if OCR was used
const result = await parseResume(pdfBuffer, { model })
if (result.meta.ocrFallback) {
console.log('Scanned PDF detected — OCR was used')
}Performance
OCR is significantly slower than text extraction:
| Method | Typical time (3-page CV) |
|---|---|
| Text-native extraction | 200–500ms |
| OCR fallback | 8–20 seconds |
The extra time comes from Tesseract.js initialization and per-page image processing. This is a one-time cost per document.
Limitations
- OCR accuracy depends on scan quality. Low-resolution or blurry scans may produce poor results.
- Bounding box coordinates are not available from OCR output — spatial reconstruction is skipped.
- Languages other than English require specifying the
languageoption (see below).
Language support
// Default is English
const result = await parseResume(pdfBuffer, {
model,
// OCR language is configured internally — English by default
})For multilingual CVs, the OCR language is currently fixed to English (eng). Support for additional Tesseract language packs is planned for a future release.