![]() ![]() ![]() PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports.Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software.non scanned ones who always have more than 1000chars/page Scanned files have less than 1000chars/page vs.In general, for the files I could find on my computer and your test files, following is true: If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content. the sorting is correct concerning text versus scanned/images.Is it OK to use this shellscript in this directory? (y/N) y List-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0 Text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdfĪR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdfĪR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdfĭescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image Scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf S-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress). Unidentified file objects, 'UFOs', remain in the current directory. s-and-t (for documents with both images and text content).Identified files are moved to the following subdirectories Make the shellscript executable, chmod ugo+x pdf-text-or-imageĬhange directory to where you have the pdf files and run the shellscript. Read -p "Is it OK to use this shellscript in this directory? (y/N) " ans The shellscript looks for the text strings /Image/ and /Text in the pdf files. I made the shellscript pdf-text-or-image, and it might work in most cases with your files. In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned). If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/. The number of images per page are much bigger (about one per page)! >/MediaBox/Parent 2 0 R/Resources>/Font>/ProcSet[/PDF/Text/ImageB/ImageC>streamge> >/MediaBox/Parent 2 0 R/Resources>stream>/StructParents 3/Tabs/S/Type/Page> >/MediaBox/Parent 2 0 R/Resources>/ProcSet[>streamarents 2/Tabs/S/Type/Page> >/MediaBox/Parent 2 0 R/Resources>/ProcSet[>streamarents 1/Tabs/S/Type/Page> >/Metadata 167 0 R/Pages 2 0 R/StructTreeR>/MediaBox/Parent 2 0 R/Resources>/ProcSet>/StructParents 0/Tabs/S/Type/>stream Not Scanned: grep -color -a 'Image' AR-G1003.pdf >/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>stream ![]() >/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>stream >/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>stream Scanned: grep -color -a 'Image' AR-G1002.pdf Look at the difference between a scanned to a not-scanned PDF: The proposal due to Sudodus in the comments below seems to be very interesting. They are very different, but the scanned ones as mentioned below one can find some text due to a precarious OCR process coupled to the scan. So I need a script to test all PDF files that belong to a directory. I have thousand of documents and some of them are scanned. ![]()
0 Comments
Leave a Reply. |