1 pdfstrip.py - strip objects from PDF files by specifying objects IDs
5 a. Uncompress the pdf file with PDFtk, e.g.:
7 $ pdftk input.pdf output uncompressed.pdf uncompress
9 b. Find the objects IDs you want to strip; for example for images, the
10 program pdfimages from poppler-utils can be used like this:
12 1. List all the images:
14 $ pdfimages -list uncompressed.pdf > pdfimages.txt
16 2. Extract all the images:
19 $ pdfimages -p -all uncompressed.pdf images/image
21 3. Isolate unique images, for an easier analysis:
24 $ md5sum images/image-* | sort | uniq -w 32 | tr -s ' ' | cut -d ' ' -f 2 | while read file; do cp "$file" uniq-images/; done
26 4. Compare the file names with the content of pdfimages.txt from 1. and
27 find the objects IDs, the result is a list of objects IDs, like this:
29 53,52,51,50,49,48,66,65,64,63,62,68,103,102,101,111,110,109,108,107,106
31 c. Pass the list of objects to pdfstrip.py (here the list is shown sorted,
32 just for readability):
34 $ ./pdfstrip.py uncompressed.pdf stripped.pdf 48,49,50,51,52,53,62,63,64,65,66,68,101,102,103,106,107,108,109,110,111
36 d. Re-compress the file:
38 $ pdftk stripped.pdf output final.pdf compress
43 Sometimes pdfimages misses images, or report them as inlined even when they
44 are not, so you may need to look at the PDF source to spot the missing IDs.
46 Inline images cannot be stripped by pdfstrip, but they are easy to spot in the
47 PDf source, they are delimited by markers "BI" and "EI" and there is always an
48 "ID" string between the two; removing the source code usually works but this
49 is a brute force approach.