Sometimes you need to find some text in
PDF files. The problem is that PDFs are often compressed, making
standard grep useless. Spotlight in OS X keeps PDFs indexed, and
there might be something similar in Windows. There is no obvious way
of doing a simple search of PDFs in Linux, however.
My solution is the following bash
one-liner that lists all PDFs in a directory and subdirectories that contain a search term. It uses the pdftotext utility from Xpdf
(here, no compilation needed). Both search terms and PDF file names can include spaces.
$> find . -iname "*.pdf"
-print0 | xargs -n1 -i -0 bash -ci 'if [ `pdftotext "{}" -
| grep -ci "<search term>"` != 0 ] ; then echo
"\"{}\"" ; fi ; '
Just replace <search term> with
what you're looking for (and make sure you don't miss any of the spaces—they're important).
An even better option is to add this
function to your .bashrc file:
function fpdf ()
{
acroread &
find . -iname "*.pdf" -print0 | xargs -n1 -i -0 bash -ci 'if [ `pdftotext "{}" - | grep -ci "$0"` != 0 ] ; then echo "\"{}\"" ; fi ; ' "$1" | xargs -n1 acroread
}
{
acroread &
find . -iname "*.pdf" -print0 | xargs -n1 -i -0 bash -ci 'if [ `pdftotext "{}" - | grep -ci "$0"` != 0 ] ; then echo "\"{}\"" ; fi ; ' "$1" | xargs -n1 acroread
}
Now running $> fpdf "search term"
in a directory will use acroread to open all PDFs in the directory and subdirectories that contain the search term. You can, of course, replace acroread with any PDF reader of
your choice.
How does it work?
First, acroread is started as a
background process. This keeps every subsequent invocation of acroread from blocking the search.
Next, bash replaces $1 with the search
term.
Then, find . -iname "*.pdf"
-print0 does a case-insensitive search for all files in the current
directory that end with .pdf and prints them out as a NUL-separated
list. This list is piped to xargs. The NUL-separated list is better
at keeping file names from interfering with xargs than a new-line
separated list.
Next, xargs runs bash once for each of
the PDF files:
bash -ci 'if [ `pdftotext "{}"
- | grep -ci "$0"` != 0 ] ; then echo "\"{}\""
; fi ; ' "$1"
xargs replaces both instances of {} with the name of the
PDF file; bash replaces $0 with the contents of $1. The command
executed by the inner bash instance, then, is:
if [ `pdftotext "<a doc.pdf>"
- | grep -ci "<search term>"` != 0 ] ; then echo "a
doc.pdf" ;fi ;
The section in backticks (``) is executed first. pdftotext dumps the contents of the PDF in plain text. The text is piped into grep, which counts how
many times the search term shows up. If the result is not
zero, the name of the PDF, surrounded by quotation marks, is echoed to
standard out. The result is a new-line separated list of
double-quoted PDF file names.
Finally, this list is piped to the
second xargs, which calls acroread for each file. The result is one
instance of acroread that has open all files that match the search
term.
No comments:
Post a Comment