OCR Processing

Can be pretty handy to add selectable/searchable text to a scanned document right?

For whatever reason half my stuff isn’t so I needed to process it manually

Tesseract

The general gist is to use tesseract but apparently it is designed for use with an image for each page whereas in my case I’m starting from a pdf file (because the work’s scanners fancy.. ooooh!..)

pdfsandwich lets me start from a complete pdf

Pre-requisites

Read the code kids, don’t just blindly run code from the internet, no matter how simple…

sudo apt install tesseract-ocr tesseract-ocr-all pdfsandwich

Usage

Since I deal with mostly terrible quality manuals for things that aren’t around anymore, I like to keep the colour (using the -rgb flag) and disable preprocessing as otherwise I find a fair number of figures disappear (which is sad) using -nopreproc.

Further options are found on the pdfsandwich site

The options I generally use:

pdfsandwich -rgb -nopreproc test.pdf

References

Tesseract OCR
PDFSandwich