pdfstrip.py - strip objects from PDF files by specifying objects IDs

Example of use:

  a. Uncompress the pdf file with PDFtk, e.g.:

      $ pdftk input.pdf output uncompressed.pdf uncompress

  b. Find the objects IDs you want to strip; for example for images, the
     program pdfimages from poppler-utils can be used like this:

      1. List all the images:

          $ pdfimages -list -all uncompressed.pdf > pdfimages.txt

      2. Extract all the images:

          $ mkdir images
          $ pdfimages -p -all uncompressed.pdf images/image

      3. Isolate unique images, for an easier analysis:

          $ mkdir uniq-images
          $ md5sum images/image-* | sort | uniq -w 32 | tr -s ' ' | cut -d ' ' -f 2 | while read file; do cp "$file" uniq-images/; done

      4. Compare the file names with the content of pdfimages.txt from 1. and
         find the objects IDs, the result is a list of objects IDs, like this:

          53,52,51,50,49,48,66,65,64,63,62,68,103,102,101,111,110,109,108,107,106

  c. Pass the list of objects to pdfstrip.py (here the list is shown sorted,
     just for readability):

      $ ./pdfstrip.py uncompressed.pdf stripped.pdf 48,49,50,51,52,53,62,63,64,65,66,68,101,102,103,106,107,108,109,110,111

  d. Re-compress the file:

      $ pdftk stripped.pdf output final.pdf compress


Limitations

Sometimes pdfimages misses images, or report them as inlined even when they
are not, so you may need to look at the PDF source to spot the missing IDs.

Inline images cannot be stripped by pdfstrip, but they are easy to spot in the
PDf source, they are delimited by markers "BI" and "EI" and there is always an
"ID" string between the two; removing the source code usually works but this
is a brute force approach.