README

   1 pdfstrip.py - strip objects from PDF files by specifying objects IDs
   2
   3 Example of use:
   4
   5   a. Uncompress the pdf file with PDFtk, e.g.:
   6
   7       $ pdftk input.pdf output uncompressed.pdf uncompress
   8
   9   b. Find the objects IDs you want to strip; for example for images, the
  10      program pdfimages from poppler-utils can be used like this:
  11
  12       1. List all the images:
  13
  14           $ pdfimages -list uncompressed.pdf > pdfimages.txt
  15
  16       2. Extract all the images:
  17
  18           $ mkdir images
  19           $ pdfimages -p -all uncompressed.pdf images/image
  20
  21       3. Isolate unique images, for an easier analysis:
  22
  23           $ mkdir uniq-images
  24           $ md5sum images/image-* | sort | uniq -w 32 | tr -s ' ' | cut -d ' ' -f 2 | while read file; do cp "$file" uniq-images/; done
  25
  26       4. Compare the file names with the content of pdfimages.txt from 1. and
  27          find the objects IDs, the result is a list of objects IDs, like this:
  28
  29           53,52,51,50,49,48,66,65,64,63,62,68,103,102,101,111,110,109,108,107,106
  30
  31   c. Pass the list of objects to pdfstrip.py (here the list is shown sorted,
  32      just for readability):
  33
  34       $ ./pdfstrip.py uncompressed.pdf stripped.pdf 48,49,50,51,52,53,62,63,64,65,66,68,101,102,103,106,107,108,109,110,111
  35
  36   d. Re-compress the file:
  37
  38       $ pdftk stripped.pdf output final.pdf compress
  39
  40
  41 Limitations
  42
  43 Sometimes pdfimages misses images, or report them as inlined even when they
  44 are not, so you may need to look at the PDF source to spot the missing IDs.
  45
  46 Inline images cannot be stripped by pdfstrip, but they are easy to spot in the
  47 PDf source, they are delimited by markers "BI" and "EI" and there is always an
  48 "ID" string between the two; removing the source code usually works but this
  49 is a brute force approach.