OCR with a Mac: Automate PDF-processing via shell

OCR with a Mac is easy if you use an OCR-program like Abbyy or Prizmo for OSX, put all your files to Evernote (and subscribe to the premium service) or spend 400 USD on a Fujitsu ScanSnap ix 500. All those solutions didn’t really fit for me. I was looking for something (apparently) unusual.

My Requirements for OCR with OSX

The main goal is producing searchable PDF-files from all incoming snail mail documents, as easy as possible, and suitable for my home office environment. That led to the following requirements:

  • Hardware
    • Small footprint (e.g. small device)
    • Automatic document feeder
    • Duplex scanning
    • Ability to scan without the Mac running
  • Software
    • Fast OCR-processing
    • Small output-files
  • Hard- and Software
    • Multi-user-friendly
    • Not too expensive
    • „local-only“-Solution / no cloud-services invoked

My OCR 0.1 solution: Epson XP-820, OCRKit for Mac & a shell script

  1. The Epson XP-820 [Homepage] is a really good printer-scanner-fax-combination which retails for approx. 130 USD (in the U.S. When you’re in Germany, it’s 160 Euro / 180 USD…).
  2. OCRKit [Homepage] is a fast, neat OCR-Tool focused on text-recognition (not layout-preservation) and batch-processing. On my MacBook Pro Mid-2009, OCRKit takes 50 seconds to process a 22-pages-PDF. The input-file is 2.3mb, the generated PDF with text 1.1mb. The rate of recognition is perfectly fine for me.
  3. The ocr shell script [source on GitHub] is the part i did on my own. (Read: Beware, this is not quality software and you’re completely on you own: There’s absolutely no warranty or liability for defects.)

My OCR-workflow: „Scan and forget“

My main aim is to get rid of paper „to process“ (a.k.a. clutter) and establish a „postpone-friendly“ workflow.

  1. Scan incoming documents near-time without a Mac directly on the Epson XP-820 to a SD card.
    • Each user has his own SD card such that files from different users don’t get mixed up.
    • Paper is put away immediately after the scan (in an unsorted box…)
    • Scans may reside on the SD card for some time – there’s no need for immediate action (it’s a feature!)
  2. OCR on a Mac
    • A shell-script runs hourly (or on demand) on the Mac that has OCRKit installed. (Access the source of the „smb ocr shell script“ on GITHub.)
    • The script connects to the scanner via SMB, processes the files and does other things depending on the card. Some files stay on the card so users without OCRKit can still retrieve them from the card.

OCR 0.2: What to implement next into the shell script

The script produces quite clumsy filenames right now. Next step would be to name the files according to the content, e.g. put the name of (known) senders in it.

The XP 820 is a device for home-users, therefore it has no native support for multiuser-scenarios. However one could use the mail-feature of the scanner: Mail the documents to a local mailbox (Raspberry PI), process files there.

Not my cup of tea (several actually)

Prizmo OCR-Software for Mac

One year ago, I got Prizmo within a software-bundle. It’s quite slow on my MacBook Pro from 2009 when I work with PDFs: It takes 17 seconds to load a 2.3mb PDF, 17 seconds to OCR a single page of a 22-page-dokument.

If you work with the proprietary Prizmo file format, file sizes are OK. What’s not OK for me is that the size of exported files are significantly larger than the original PDF. The 2.3mb testfile ocr’d and exported becomes a 45mb file. Another downside is that the Pro-Pack (that is required for this automation) costs extra.

It however outperfomes other programs with its‘ GUI: Prizmo displays scan and recognized text side-by-side and jumps to the scan when you move the cursor in the text. It’s very comfortable to manually check or correct scanned text. Plus, you can define which text areas are OCR’d, and in which order.

Abbyy OCR Software for Mac

Impressive feature list with lots of features that I really like, but don’t need. Of course, the features translate into a decent price. Abby brags that its OCR-technology is up to 99.8% accurate in internal tests. Would love to see external ones…

Hardware from Brother

I’ve had a Brother MFC-820CW for several years; it worked like a charm. Would have bought another Brother but the product-lineup does not have a printer-scanner-combination with a duplex-unit right now.

Hardware from Doxie

The Doxie Go Wifi for 230 USD looks really good: Hard- and software from one vendor (besides the OCR, that’s is from Abbyy), good reputation, runs on battery / is totally mobile. In Germany however, the Amazon-seller wants 420 Euro / 480 USD. Nah.

Hardware from Fujitsu

I really like the ScanSnap ix500 from Fujitsu, but it’s too advanced for me. It excels at scanning and surely at OCR, but the main drawback for me –besides the price– is that I don’t want another quite large computer device in my flat. I would have bought the ScanSnap if I’d required the superior speed or the ability to scan really thick paper (209 g/m²!)

Bonus

  • The German page druckerchannel.de lists (nearly?) all printers with their specifications and has filters so you can drill down easily. (e.g. show all printers with ADF and duplex-printing capabilities, sorted bei total height of the printer.) Using Google Translator to translate the drill-down-feature to english works OK
  • If you speak German and have an hour of spare time, watch this hilarious talk about printer/scanner-products that alter the text while scanning/copying. English text about the incident.

Ein Gedanke zu „OCR with a Mac: Automate PDF-processing via shell“

  1. Modern Talking был немецким дуэтом, сформированным в 1984 году. Он стал одним из самых ярких представителей евродиско и популярен благодаря своему неповторимому звучанию. Лучшие песни включают „You’re My Heart, You’re My Soul“, „Brother Louie“, „Cheri, Cheri Lady“ и „Geronimo’s Cadillac“. Их музыка оставила неизгладимый след в истории поп-музыки, захватывая слушателей своими заразительными мелодиями и запоминающимися текстами. Modern Talking продолжает быть популярным и в наши дни, оставаясь одним из символов эпохи диско. Музыка 2024 года слушать онлайн и скачать бесплатно mp3.

Kommentar verfassen