Converting Unsearchable PDF Files (aka PDF scans) to Raw Text Using Command Line Tools `convert` and `tesseract`
Often, one gets a PDF file that is a scan of a book or text, which cannot be searched (boo!). A good (but not perfect) solution is to use Optical Character Recognition (OCR) to convert the pdf to a txt file and search that instead.
Here is my solution.
Requirements
-
Command line tools
convert
tesseract
I installed both using
homebrew
. I’m using Mac OS X 10.11.3. This is important because it affects the location of where these are install of my system/usr/local/
. -
Knowledge and comfort using command line. Helps if you understand how to use the
find
command.
Workflow
1. Convert pdf
to tiff
Say we have pdf Bookscan.pdf
. We can create a new directory tiffs/
and then use the command line tool convert
to convert the pdf to a tiff.
Below, we create a new directory called tiffs/
in the same directory as Bookscan.pdf
then convert the pdf to a tiff (here, its called bookdown.tiff
).
mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '0x200' Bookscan.pdf tiffs/bookdown.tiff
To learn more about the commands, visit the imagemagick site. But in brief:
density
adjust dpidepth
is the number of bitsmonochrome
black and white onlyblur
is useful for super sharp scans (thin letters are bad, thick good)shave
used to strip pixels from the output image (so you need to figure out the size of the final image). Useful when books have chapter names or numbers at the top (0
is width,200
is height)
(Note that the option in the sample code above just happen to work for the set of documents I was converting.)
2. Make Sure We Ignore Annoying Characters Like ‘ligatures’
I found that, consistently, tesseract
will add in ligatures, ruining the ability to search some words. But it is possible to keep tesseract
from using them by creating a blacklist. I copied a list of ligatures from this page
One needs to add a file to /usr/local/share/tessdata/configs/
(this assumes a brew
installation in Mac OS X) to a file which contains the following:
tessedit_char_blacklist ꜲꜳÆæꜴꜵꜶꜷꜸꜹꜼꜽDZDzdzDŽDždžffffifflfiflIJijLJLjljNJNjnjŒœꝎꝏſtᵫꝠꝡ
3. Convert the tiff to text
This is pretty straight-forward. cd
into the folder with the tiffs then run the command:
cd tiffs/
tesseract bookscan.tiff bookscan -l eng ligatures
Here, we are assuming the text is in English (-l eng
) and we load the ligature
configs file (which loads the blacklist variable).
For other tips, see the Tesseract FAQ. Sometimes, files are just too noisy or tilted to work. Most of the scans I’ve worked with are pretty clean, so I’ve not had to struggle with something too complicated.
Granted, if there are problems with the image, the fixes would all have to be done in the conversion stage!
4. (optional) Looping Through Tiffs
If you cd
into the folder full of tiffs, you can loop through all the tiffs and convert them to texts.
cd tiffs/
find . -type f | while read F; do tesseract ${F} ${F%.tiff} -l eng ligatures; done;
Boom. Loops through and converts any and all tiffs in the directory (here, called tiffs/
).
And that’s it!
An Example
If you to try a real example, try it with the following pdf: bookscan.pdf
Let’s pretend you put bookscan.pdf
in your downloads folder. We’ll make a new folder called tiffs/
, convert the pdf, then use tesseract
.
cd ~/Downloads
mkdir tiffs
convert -density 600 -depth 4 -monochrome -background white -blur '0x2' -shave '200x450' bookscan.pdf tiffs/bookscan.tiff
cd tiffs
tesseract bookscan.tiff bookscan -l eng ligatures
I get the resulting tiff and txt files.