根据我们在OP中的谈话。这里有一些选择供你考虑。
       
       
        
         Option 1:
        
       
       
        如果你直接使用PDF作为你的输入文件
       
       import fitz
input_file = '/path/to/your/pdfs/'
pdf_file = input_file
doc = fitz.open(pdf_file)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) 
    pageTextblocks = page.getText('blocks') # This creates a list of items (x0,y0,x1,y1,"line1\nline2\nline3...",...)
    pageTextblocks.sort(key=lambda block: block[3]) 
    for block in pageTextblocks:
        targetBlock = block[4] # This gets to the content of each block and you can work your logic here to get relevant data 
Option 2:
如果你用图像作为你的输入,并且你需要在使用选项1的代码片段处理它之前将其转换为PDF。
doc = fitz.open(input_file)
pdfbytes = doc.convertToPDF() # open it as a pdf file
pdf = fitz.open("pdf", pdfbytes) # extract data as a pdf file
在PyMuPDF中处理图像的一个有用的提示是,如果图像有些难以识别,可以使用zoom因子来提高分辨率。
zoom = 1.2 # scale the image by 120%
mat = fitz.Matrix(zoom,zoom)
Option 3:
既然你提到了tesseract,那就用PyMuPDF和pytesseract的混合方法。我不确定这种方法是否符合你提取希伯来语的需要,但这是一个想法。这个例子是用于PDF的。
import fitz
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract/cmd'
input_file = '/path/to/pdfs'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file)
zoom = 1.2
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) #number of page
    pix = page.getPixmap(matrix = mat)
    output = '/path/to/save/image' + str(pageNo) + '.jpg'
    pix.writePNG(output)
    print('Converting PDFs to Image ... ' + output)
    text_of_each_page = str(((pytesseract.image_to_string(Image.open(output)))))