I have a genealogy of my Link family that was written on a typewriter and published at a copy center. To keep the page count as low as possible, they squished things together as much as possible. An amazing amount of research went into this document. Unfortunately their layout and photo-management skills were a bit lacking.
I have started scanning the document as PDF documents which, combined with Evernote, make it somewhat searchable. The punched holes for the combed binding and the poor layout make it a very ugly publication. It deserves more. So when I stumbled onto PDF Converter with OCR [Mac – $19.99], I had to check it out. Fortunately, they offer a free trial. I played with it for a few minutes and was convinced. This app is amazing!
Here you see PDF Converter’s work area. I’ve opened the first scanned PDF file and I’m looking at the first page. Over on the right you see the output format choices available. The boxes in green are areas designated as text. Other options are images (displayed in a red box) and tables (displayed in a purple box). Your first step is to go through the pages and make adjustments as necessary. I had to adjust box sizes on most of the images and I deleted text boxes containing footers and page numbers. Here’s the results from the first page . . .
Yes, there are some problems – most related to the typewriter that created the document. One page included a poem that had been printed using some kind of script font. The OCR’d page was almost impossible to read. My experiment with a simple Pages document that I printed, then scanned was converted without any errors.
Attempts to OCR tables and columns often end up as a jumbled disaster, but PDF Converter did an amazing job of maintaining the columnar text format you see in the example below. The dark lines demonstrate where PDF Converter is defining rows and columns. Yes, this example will require some cleanup but it’s mild compared to the usual results I’ve gotten from other apps.
There are a few quirks. Each text box you see in the app’s layout screen is saved as a text box in Pages. And, when I tried to apply styles to the text I got a list of styles in something that looks like Chinese lettering. My solution is to copy/paste the text out of the various boxes into a new document – one with appropriate styles for my project.
It will still take some time and effort to turn the entire publication into editable text. This delightful app means that an amazing piece of family history will get the layout and design attention it deserves.