PDF Converter With OCR

PDF Converter With OCR

I have a genealogy of my Link family that was written on a typewriter and published at a copy center. To keep the page count as low as possible, they squished things together as much as possible. An amazing amount of research went into this document. Unfortunately their layout and photo-management skills were a bit lacking.

I have started scanning the document as PDF documents which, combined…

View On WordPress

PDF Converter With OCR

I have a genealogy of my Link family that was written on a typewriter and published at a copy center. To keep the page count as low as possible, they squished things together as much as possible. An amazing amount of research went into this document. Unfortunately their layout and photo-management skills were a bit lacking.

I have started scanning the document as PDF documents which, combined with Evernote, make it somewhat searchable. The punched holes for the combed binding and the poor layout make it a very ugly publication. It deserves more. So when I stumbled onto PDF Converter with OCR [Mac – $19.99], I had to check it out. Fortunately, they offer a free trial. I played with it for a few minutes and was convinced. This app is amazing!

PDFconverter01.png
PDF Converter

Here you see PDF Converter’s work area. I’ve opened the first scanned PDF file and I’m looking at the first page. Over on the right you see the output format choices available. The boxes in green are areas designated as text. Other options are images (displayed in a red box) and tables (displayed in a purple box). Your first step is to go through the pages and make adjustments as necessary. I had to adjust box sizes on most of the images and I deleted text boxes containing footers and page numbers. Here’s the results from the first page . . .

WPtagcloud.png

Yes, there are some problems – most related to the typewriter that created the document. One page included a poem that had been printed using some kind of script font. The OCR’d page was almost impossible to read. My experiment with a simple Pages document that I printed, then scanned was converted without any errors.

Attempts to OCR tables and columns often end up as a jumbled disaster, but PDF Converter did an amazing job of maintaining the columnar text format you see in the example below. The dark lines demonstrate where PDF Converter is defining rows and columns. Yes, this example will require some cleanup but it’s mild compared to the usual results I’ve gotten from other apps.

PDF Converter
PDF Converter

There are a few quirks. Each text box you see in the app’s layout screen is saved as a text box in Pages. And, when I tried to apply styles to the text I got a list of styles in something that looks like Chinese lettering. My solution is to copy/paste the text out of the various boxes into a new document – one with appropriate styles for my project.

It will still take some time and effort to turn the entire publication into editable text. This delightful app means that an amazing piece of family history will get the layout and design attention it deserves.

Making room for more storage

Years ago, an engineering firm I worked for increased the size of their office by half again the existing square footage. Most of that additional space would go to file storage. At the time I had just gotten my first scanner and was just beginning to learn the joys of digitizing photos and documents. Surprisingly, the scanner’s software even included an OCR function and it worked quite well. I was just beginning my fight with carpal tunnel syndrome and this OCR thing was a real blessing to me. The engineering firm was using computers for documents and drawings, and while they did appreciate the ability to grab an existing digital document and edit it rather than start each one from scratch, they hadn’t realized the potential of digital storage. In fact, we were often making multiple copies of paper documents and filing them in different places just so it would be easier to find them later. HUH?

One day I was given an old paper proposal – a rather large one – to be typed so they could edit it for a new project they wanted to bid. I drove home, scanned it , OCR’d it and drove back to the office in half the time it would have taken to type. They were delighted I had it ready so quickly, but this scanning thing was just a flash in the pan to them. I even prepared a cost analysis comparing the cost of storing their paper files (office space, cabinets, paper, etc.) vs. digital ones (disk storage, software, scanners, etc.). Even back then the digital solution was significantly less expensive – and that didn’t include the amount of labor spent filing, managing and finding paper documents. I was told it would be too disruptive.

It was time to start looking for a new job.

This week I bought my second external hard drive. My existing 1TB WD My Book is almost full now that I have more time for scanning and other digital projects.  I got a 3TB WD My Book for less than $140. The Windows version is about the same price. These new drives take advantage of the USB 3.0 protocol which is significantly faster. If you have an older computer that only has USB 2.0 connections, you can still use these drives, but you won’t get the speed advantage. Once you upgrade your computer, the drives will perform at their top speed.

So now I have two drives – each the size of a good James Michener novel – sitting on my desk. A quick search can bring any document or photo to the screen in a matter of seconds. In addition to family ephemera, I’ve been working to take our household records paperless (or as close as possible) too. I should be in pretty good storage shape until I get ready to tackle my husband’s collection of slides. He’ll be buying that drive!

I still need to do some reorganization and remodeling to update my entire file system. Santa brought be a copy of Apple’s Aperture app for Christmas which I haven’t really put to good use yet because my photo collection needs some serious spring cleaning. Now’s a good the time to do that too. At least there won’t be any heavy lifting involved in this remodeling job.

I wonder if those engineers ever saw the digital light . . .

Evernote’s Photographic Memory

Did you know Evernote had a photographic memory?

Evernote has some pretty significant OCR (optical character recognition) capabilities. What this means to you is you can take pictures of words, save them as notes in Evernote, and Evernote can read (and search) those words. Not only that, but Evernote can even read some handwriting (print, not cursive and realizing there’s some handwriting no one can read).

Okay, so how do you put this photographic memory to work in your research? The first step is to make sure you’ve got the Evernote app on your phone (iOS or Android). Now, use that app to capture photo notes. Some of the things to photograph include:

  • historical markers
  • headstones
  • documents
  • pages in books
  • handwritten notes
  • whiteboard information during meetings or classes

In addition to capturing the text of a photographed object, the Evernote app and your phone are also recording geolocation data. Add a few tags to better define each note’s contents and you have created a photographic memory of your research field trip.

Yes, there are limits to Evernote’s OCR capabilities. The quality of the photograph – and the object itself – will affect Evernote’s ability to recognize the text it contains. You can supplement Evernote’s OCR effort by including audio notes dictating the contents of the item you just captured in a photo note.

If you don’t have the Evernote app on your phone, get it now. It costs nothing but it’s value is priceless.