OCR Processing
Can be pretty handy to add selectable/searchable text to a scanned document right?
For whatever reason half my stuff isn’t so I needed to process it manually
Tesseract
The general gist is to use tesseract but apparently it is designed for use with an image for each page whereas in my case I’m starting from a pdf file (because the work’s scanners fancy.. ooooh!..)
pdfsandwich
lets me start from a complete pdf
Pre-requisites
Read the code kids, don’t just blindly run code from the internet, no matter how simple…
sudo apt install tesseract-ocr tesseract-ocr-all pdfsandwich
Usage
Since I deal with mostly terrible quality manuals for things that aren’t around anymore, I like to keep the colour (using the -rgb
flag) and disable preprocessing as otherwise I find a fair number of figures disappear (which is sad) using -nopreproc
.
Further options are found on the pdfsandwich site
The options I generally use:
pdfsandwich -rgb -nopreproc test.pdf