Mar 29, 2016

Converting Unsearchable PDF Files (aka PDF scans) to Raw Text Using Command Line Tools `convert` and `tesseract`

Often, one gets a PDF file that is a scan of a book or text, which cannot be searched (boo!). A good (but not perfect) solution is to use Optical Character Recognition (OCR) to convert the pdf to a txt file and search that instead.

Here is my solution.

Requirements

Command line tools
- convert
- tesseract
I installed both using homebrew. I’m using Mac OS X 10.11.3. This is important because it affects the location of where these are install of my system /usr/local/.
Knowledge and comfort using command line. Helps if you understand how to use the find command.

Workflow

1. Convert `pdf` to `tiff`

Say we have pdf Bookscan.pdf. We can create a new directory tiffs/ and then use the command line tool convert to convert the pdf to a tiff.

Below, we create a new directory called tiffs/ in the same directory as Bookscan.pdf then convert the pdf to a tiff (here, its called bookdown.tiff).

mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '0x200' Bookscan.pdf tiffs/bookdown.tiff

To learn more about the commands, visit the imagemagick site. But in brief:

density adjust dpi
depth is the number of bits
monochrome black and white only
blur is useful for super sharp scans (thin letters are bad, thick good)
shave used to strip pixels from the output image (so you need to figure out the size of the final image). Useful when books have chapter names or numbers at the top (0 is width, 200 is height)

(Note that the option in the sample code above just happen to work for the set of documents I was converting.)

2. Make Sure We Ignore Annoying Characters Like ‘ligatures’

I found that, consistently, tesseract will add in ligatures, ruining the ability to search some words. But it is possible to keep tesseract from using them by creating a blacklist. I copied a list of ligatures from this page

One needs to add a file to /usr/local/share/tessdata/configs/ (this assumes a brew installation in Mac OS X) to a file which contains the following:

tessedit_char_blacklist ꜲꜳÆæꜴꜵꜶꜷꜸꜹꜼꜽǱǲǳǄǅǆﬀﬃﬄﬁﬂĲĳǇǈǉǊǋǌŒœꝎꝏﬅᵫꝠꝡ

Here is a screen shot of the file (named it ligatures).

And here is what the directory /usr/local/share/tessdata/configs/ looks like on my computer.

3. Convert the tiff to text

This is pretty straight-forward. cd into the folder with the tiffs then run the command:

cd tiffs/
tesseract bookscan.tiff bookscan -l eng ligatures

Here, we are assuming the text is in English (-l eng) and we load the ligature configs file (which loads the blacklist variable).

For other tips, see the Tesseract FAQ. Sometimes, files are just too noisy or tilted to work. Most of the scans I’ve worked with are pretty clean, so I’ve not had to struggle with something too complicated.

Granted, if there are problems with the image, the fixes would all have to be done in the conversion stage!

4. (optional) Looping Through Tiffs

If you cd into the folder full of tiffs, you can loop through all the tiffs and convert them to texts.

cd tiffs/
find . -type f | while read F; do tesseract ${F} ${F%.tiff} -l eng ligatures; done;

Boom. Loops through and converts any and all tiffs in the directory (here, called tiffs/).

And that’s it!

An Example

If you to try a real example, try it with the following pdf: bookscan.pdf

Let’s pretend you put bookscan.pdf in your downloads folder. We’ll make a new folder called tiffs/, convert the pdf, then use tesseract.

cd ~/Downloads
mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '200x450' bookscan.pdf tiffs/bookscan.tiff
cd tiffs
tesseract bookscan.tiff bookscan -l eng ligatures

I get the resulting tiff and txt files.