Ocr And Documents — Extract text from PDFs/scans (pymupdf, marker-pdf)
Ocr And Documents
Section titled “Ocr And Documents”Extract text from PDFs/scans (pymupdf, marker-pdf).
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/productivity/ocr-and-documents |
| Version | 2.3.0 |
| Author | Hermes Agent |
| License | MIT |
| Tags | PDF, Documents, Research, Arxiv, Text-Extraction, OCR |
| Related skills | powerpoint |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”Info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
PDF & Document Extraction
Section titled “PDF & Document Extraction”For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.
Step 1: Remote URL Available?
Section titled “Step 1: Remote URL Available?”If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])web_extract(urls=["https://example.com/report.pdf"])This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
Step 2: Choose Local Extractor
Section titled “Step 2: Choose Local Extractor”| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
“This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations.”
pymupdf (lightweight)
Section titled “pymupdf (lightweight)”pip install pymupdf pymupdf4llmVia helper script:
python scripts/extract_pymupdf.py document.pdf # Plain textpython scripts/extract_pymupdf.py document.pdf --markdown # Markdownpython scripts/extract_pymupdf.py document.pdf --tables # Tablespython scripts/extract_pymupdf.py document.pdf --images out/ # Extract imagespython scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pagespython scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pagesInline:
python3 -c "import pymupdfdoc = pymupdf.open('document.pdf')for page in doc: print(page.get_text())"marker-pdf (high-quality OCR)
Section titled “marker-pdf (high-quality OCR)”# Check disk space firstpython scripts/extract_marker.py --check
pip install marker-pdfVia helper script:
python scripts/extract_marker.py document.pdf # Markdownpython scripts/extract_marker.py document.pdf --json # JSON with metadatapython scripts/extract_marker.py document.pdf --output_dir out/ # Save imagespython scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracyCLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./outputmarker /path/to/folder --workers 4 # BatchArxiv Papers
Section titled “Arxiv Papers”# Abstract only (fast)web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paperweb_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Searchweb_search(query="arxiv GRPO reinforcement learning 2026")Split, Merge & Search
Section titled “Split, Merge & Search”pymupdf handles these natively — use execute_code or inline Python:
# Split: extract pages 1-5 to a new PDFimport pymupdfdoc = pymupdf.open("report.pdf")new = pymupdf.open()for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i)new.save("pages_1-5.pdf")# Merge multiple PDFsimport pymupdfresult = pymupdf.open()for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path))result.save("merged.pdf")# Search for text across all pagesimport pymupdfdoc = pymupdf.open("report.pdf")for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text"))No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
web_extractis always first choice for URLs- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept
--helpfor full usage - marker-pdf downloads ~2.5GB of models to
~/.cache/huggingface/on first use - For Word docs:
pip install python-docx(better than OCR — parses actual structure) - For PowerPoint: see the
powerpointskill (uses python-pptx)