exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -


pypdf throws exception:

pypdf.utils.pdfreaderror: eof marker not found

i don't need fix pypdf, need eof error cause "except" block execute , skip on file, doesn't work. still causes program stop running.

background:

batch ocr program pdfs

python, pypdf, adobe pdf ocr error: unsupported filter /lzwdecode

... saga continues.

i got 10,000 pdfs in folder. ocrd, not. can't tell 'em apart. step 1 figure out ones not ocrd , ocr (see other threads details).

so i'm using pypdf. exceptions related unrecognized characters , unsupported filters when try read text. guestimated if throws exception, it's got text in , doens't go in list. problem solved, right? so:

      pypdf import pdffilewriter, pdffilereader       import sys, os, pypdf, re        path = 'c:\users\homer\documents\my pdfs'        filelist = os.listdir(path)        has_text_list = []       does_not_have_text_list = []      pdf_name in filelist:         pdf_file_with_directory = os.path.join(path, pdf_name)         pdf = pypdf.pdffilereader(open(pdf_file_with_directory, 'rb'))         print pdf_name         in range(0, pdf.getnumpages()):             try:                 pdf.write("%%eof")                 content = pdf.getpage(i).extracttext()                 does_it_have_text = re.findall(r'\w{2,}', content)                  if does_it_have_text == []:                     does_not_have_text_list.append(pdf_name)                     print pdf_name                 else:                     has_text_list.append(pdf_name)             except:                 has_text_list.append(pdf_name)  print does_not_have_text_list 

but error:

pypdf.utils.pdfreaderror: eof marker not found

seems comes lot (from google):

http://pdfposter.origo.ethz.ch/node/31

i think means pypdf opened file, did attempt @ text processing, raised whatever exception, did except: block, unable go next step b/c doesn't know file has eneded.

there other threads , allege has been fixed, doesn't seem have been.

then has function here write eof character .pdf first.

http://code.activestate.com/lists/python-list/589529/

i stuck in "pdf.write("%%eof")" line try mimick this, no dice.

so how error run except block? i'm using wing ide if there's way use debugger skip on these files, possible too. thx.

put pypdf call(s) inside try/except block also.


Comments

Popular posts from this blog

c# - how to write client side events functions for the combobox items -