PDF to Excel/CSV/JSON with OCR and PDF light remediation. Per-page pricing, audited deliverables, milestone-based, fast turnaround. Clean tables, accurate totals, client-ready files.
I convert native or scanned PDFs into clean, structured data and publication-ready documents. My focus is accuracy, auditability, and pace: spreadsheets you can trust, text you can search, and results you can review quickly with clear acceptance criteria.
What I do best
• PDF → Excel/CSV/JSON (schema-aligned if you provide one), with normalized numbers, dates, and IDs.
• OCR for scans and image-heavy PDFs to ensure selectable, searchable text.
• Table extraction that preserves logical rows/columns (no broken cells), with light cleanup for merged headers.
• Light PDF remediation (on request): sensible reading order, headings (H1–H3), bookmarks/TOC, and basic alt text for key images.
• QA evidence: spot checks, checksum manifest on request, and simple cross-totals to prevent silent errors.
Where this helps
• Invoices/receipts/statements to Excel (totals aligned, taxes separated).
• Research papers and reports to editable Word/Markdown and analyzable CSVs.
• Forms and simple registers that need to become structured datasets.
How I work
Discovery: you share sample pages and the desired output (or a column schema).
Quote & milestones: I price per page and set SafePay milestones (e.g., 20 pages each).
Pilot: I deliver a small milestone first. You verify structure and correctness.
Scale-up: we proceed milestone by milestone until the batch is complete.
Final QA: I provide clean files and a concise note of what changed and why.
Light vs. full remediation
Light remediation keeps documents usable (reading order, headings, bookmarks/TOC, basic alt text). Full accessibility to PDF/UA or WCAG involves comprehensive tagging, complex tables, forms, and assistive-tech testing; I can scope it as a separate, custom engagement or refer you to a specialist partner.
What you can expect
• Fast turnarounds with realistic commitments.
• A per-page model that keeps budgets predictable.
• Honest red flags early (e.g., low-quality scans or handwritten content).
• Calm, clear communication and zero drama.
Toolchain
• OCR & layout: Tesseract/ABBYY, Ghostscript/PDF utilities.
• Parsing & data checks: Python (pdfplumber, pandas), Tabula/Camelot when appropriate.
• Formats: XLSX, CSV, JSON, Word/Markdown; ZIP packages with manifests on request.
Acceptance criteria (typical)
• Machine-readable text (after OCR).
• Tables align by rows/columns and totals add up.
• Dates and numbers are consistently formatted (locale can be agreed).
• For text deliverables, headings and reading order are logical, with a minimal TOC.
Boundaries
• Complex accessibility remediation (PDF/UA), interactive forms, mathematical typesetting, and heavy graphical layouts are not included by default. If you need them, I’ll quote them separately so we keep scope clean.
If you’re unsure where to start, send two pages: one “typical”, one “worst case”. I’ll return a short pilot so you can see quality, timing, and price before committing to a full batch.
Length: 2,962 characters • 448 words
Work Terms
Scope & pricing
• Per-page pricing, defined as one PDF page or ~250 words of extracted text (whichever is clearer in the brief).
• Includes OCR (where needed), table/text extraction, light cleanup, and basic QA checks.
• Optional add-ons: light remediation (reading order, headings, bookmarks/TOC), checksum manifest, and schema alignment.
Milestones & payments
• SafePay/escrow is required before work begins. Each milestone typically covers 10–25 pages.
• Delivery occurs per milestone; approval releases funds and starts the next milestone.
Turnaround & revisions
• Estimated delivery is provided for each milestone. Rush work may incur a surcharge.
• One focused revision per milestone is included to address extraction errors or formatting inconsistencies.
• Changes in scope (new pages, different schema, complex tables/forms) are quoted as change requests.
Quality & acceptance
• Acceptance criteria must be objective (e.g., totals match; columns present; dates normalized).
• If a page is unreadable (e.g., illegible scans), I’ll flag it and propose options (reruns, manual typing, exclusion).
Data protection & confidentiality
• Files are handled confidentially and retained only as long as needed to deliver and support the project.
• Upon completion, I can delete working copies on request and provide a brief handover note of methods used.
Intellectual property
• You own the final exported data and documents upon full payment. I retain generic methods and templates.
Communication
• Preferred channel: Guru messages. I respond within one business day (faster during active milestones).
Cancellations & refunds
• If a funded milestone is canceled before work starts, it can be fully refunded.
• If partially completed, I’ll deliver the completed portion and refund the remainder pro-rata.
Warranties
• I correct extraction or formatting defects reported within 7 days of delivery at no extra charge (like-for-like fixes).
Length: 1,942 characters • 290 words