Converting Unsearchable PDF Files (aka PDF scans) to Raw Text Using Command Line Tools `convert` and `tesseract`
Often, one gets a PDF file that is a scan of a book or text, which cannot be searched (boo!). A good (but not perfect) solution is to use Optical Character Recognition (OCR) to convert the pdf to a txt file and search that instead.
Here is my solution.
Requirements
- 
Command line tools - convert
- tesseract
 I installed both using homebrew. I’m using Mac OS X 10.11.3. This is important because it affects the location of where these are install of my system/usr/local/.
- 
Knowledge and comfort using command line. Helps if you understand how to use the findcommand.
Workflow
1. Convert pdf to tiff
Say we have pdf Bookscan.pdf. We can create a new directory tiffs/ and then use the command line tool convert to convert the pdf to a tiff.
Below, we create a new directory called tiffs/ in the same directory as Bookscan.pdf then convert the pdf to a tiff (here, its called bookdown.tiff).
mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '0x200' Bookscan.pdf tiffs/bookdown.tiff
To learn more about the commands, visit the imagemagick site. But in brief:
- densityadjust dpi
- depthis the number of bits
- monochromeblack and white only
- bluris useful for super sharp scans (thin letters are bad, thick good)
- shaveused to strip pixels from the output image (so you need to figure out the size of the final image). Useful when books have chapter names or numbers at the top (- 0is width,- 200is height)
(Note that the option in the sample code above just happen to work for the set of documents I was converting.)
2. Make Sure We Ignore Annoying Characters Like ‘ligatures’
I found that, consistently, tesseract will add in ligatures, ruining the ability to search some words. But it is possible to keep tesseract from using them by creating a blacklist. I copied a list of ligatures from this page
One needs to add a file to /usr/local/share/tessdata/configs/ (this assumes a brew installation in Mac OS X) to a file which contains the following:
tessedit_char_blacklist ꜲꜳÆæꜴꜵꜶꜷꜸꜹꜼꜽDZDzdzDŽDždžffffifflfiflIJijLJLjljNJNjnjŒœꝎꝏſtᵫꝠꝡ


3. Convert the tiff to text
This is pretty straight-forward. cd into the folder with the tiffs then run the command:
cd tiffs/
tesseract bookscan.tiff bookscan -l eng ligatures                                                            
Here, we are assuming the text is in English (-l eng) and we load the ligature configs file (which loads the blacklist variable).
For other tips, see the Tesseract FAQ. Sometimes, files are just too noisy or tilted to work. Most of the scans I’ve worked with are pretty clean, so I’ve not had to struggle with something too complicated.
Granted, if there are problems with the image, the fixes would all have to be done in the conversion stage!
4. (optional) Looping Through Tiffs
If you cd into the folder full of tiffs, you can loop through all the tiffs and convert them to texts.
cd tiffs/
find . -type f | while read F; do tesseract ${F} ${F%.tiff} -l eng ligatures; done;                          
Boom. Loops through and converts any and all tiffs in the directory (here, called tiffs/).
And that’s it!
An Example
If you to try a real example, try it with the following pdf: bookscan.pdf
Let’s pretend you put bookscan.pdf in your downloads folder. We’ll make a new folder called tiffs/, convert the pdf, then use tesseract.
cd ~/Downloads
mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '200x450' bookscan.pdf tiffs/bookscan.tiff
cd tiffs
tesseract bookscan.tiff bookscan -l eng ligatures
I get the resulting tiff and txt files.