pdfstrip.py - strip objects from PDF files by specifying objects IDs Example of use: a. Uncompress the pdf file with PDFtk, e.g.: $ pdftk input.pdf output uncompressed.pdf uncompress b. Find the objects IDs you want to strip; for example for images, the program pdfimages from poppler-utils can be used like this: 1. List all the images: $ pdfimages -list -all uncompressed.pdf > pdfimages.txt 2. Extract all the images: $ mkdir images $ pdfimages -p -all uncompressed.pdf images/image 3. Isolate unique images, for an easier analysis: $ mkdir uniq-images $ md5sum images/image-* | sort | uniq -w 32 | tr -s ' ' | cut -d ' ' -f 2 | while read file; do cp "$file" uniq-images/; done 4. Compare the file names with the content of pdfimages.txt from 1. and find the objects IDs, the result is a list of objects IDs, like this: 53,52,51,50,49,48,66,65,64,63,62,68,103,102,101,111,110,109,108,107,106 c. Pass the list of objects to pdfstrip.py (here the list is shown sorted, just for readability): $ ./pdfstrip.py uncompressed.pdf stripped.pdf 48,49,50,51,52,53,62,63,64,65,66,68,101,102,103,106,107,108,109,110,111 d. Re-compress the file: $ pdftk stripped.pdf output final.pdf compress Limitations Sometimes pdfimages misses images, or report them as inlined even when they are not, so you may need to look at the PDF source to spot the missing IDs. Inline images cannot be stripped by pdfstrip, but they are easy to spot in the PDf source, they are delimited by markers "BI" and "EI" and there is always an "ID" string between the two; removing the source code usually works but this is a brute force approach.