Sunday, January 29, 2012

Bash One-Liner and Function to grep PDFs


Sometimes you need to find some text in PDF files. The problem is that PDFs are often compressed, making standard grep useless. Spotlight in OS X keeps PDFs indexed, and there might be something similar in Windows. There is no obvious way of doing a simple search of PDFs in Linux, however.

My solution is the following bash one-liner that lists all PDFs in a directory and subdirectories that contain a search term. It uses the pdftotext utility from Xpdf (here, no compilation needed). Both search terms and PDF file names can include spaces.

$> find . -iname "*.pdf" -print0 | xargs -n1 -i -0 bash -ci 'if [ `pdftotext "{}" - | grep -ci "<search term>"` != 0 ] ; then echo "\"{}\"" ; fi ; '

Just replace <search term> with what you're looking for (and make sure you don't miss any of the spaces—they're important).

An even better option is to add this function to your .bashrc file:

function fpdf ()
{
acroread &
find . -iname "*.pdf" -print0 | xargs -n1 -i -0 bash -ci 'if [ `pdftotext "{}" - | grep -ci "$0"` != 0 ] ; then echo "\"{}\"" ; fi ; ' "$1" | xargs -n1 acroread
}

Now running $> fpdf "search term" in a directory will use acroread to open all PDFs in the directory and subdirectories that contain the search term. You can, of course, replace acroread with any PDF reader of your choice.

How does it work?

First, acroread is started as a background process. This keeps every subsequent invocation of acroread from blocking the search.

Next, bash replaces $1 with the search term.

Then, find . -iname "*.pdf" -print0 does a case-insensitive search for all files in the current directory that end with .pdf and prints them out as a NUL-separated list. This list is piped to xargs. The NUL-separated list is better at keeping file names from interfering with xargs than a new-line separated list.

Next, xargs runs bash once for each of the PDF files:
bash -ci 'if [ `pdftotext "{}" - | grep -ci "$0"` != 0 ] ; then echo "\"{}\"" ; fi ; ' "$1"
xargs replaces both instances of {} with the name of the PDF file; bash replaces $0 with the contents of $1. The command executed by the inner bash instance, then, is:
if [ `pdftotext "<a doc.pdf>" - | grep -ci "<search term>"` != 0 ] ; then echo "a doc.pdf" ;fi ;

The section in backticks (``) is executed first. pdftotext dumps the contents of the PDF in plain text. The text is piped into grep, which counts how many times the search term shows up. If the result is not zero, the name of the PDF, surrounded by quotation marks, is echoed to standard out. The result is a new-line separated list of double-quoted PDF file names.

Finally, this list is piped to the second xargs, which calls acroread for each file. The result is one instance of acroread that has open all files that match the search term.

No comments:

Post a Comment