Bug report Thanks for finding the bug! To help us f

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I cannot replicate this with the latest version. Using <div clas

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt about pdfminer.six HOT 5 CLOSED

dotrunghieu96 commented on June 10, 2024

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt

from pdfminer.six.

Comments (5)

pietermarsman commented on June 10, 2024

Hi @dotrunghieu96,

Thanks for the bug report and the corresponding PR.

Could you share a PDF and some code that you use to reproduce this bug? That will allow me to understand the impact of your suggested change better.

from pdfminer.six.

dotrunghieu96 commented on June 10, 2024

Hi @pietermarsman, this is the file that I used
GitGuide.pdf

In code, first I was parsing the LTObjects via pages

from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
# open the pdf file
fp = open(pdf_doc, "rb")
# create a parser object associated with the file object
parser = PDFParser(fp)
# create a PDFDocument object that stores the document structure
doc = PDFDocument(parser)
# connect the parser and document objects
parser.set_document(doc)

Then parse the LTObjects

device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for i, page in enumerate(PDFPage.create_pages(doc)):
    print("parsing pages:", i + 1, flush=True)
    interpreter.process_page(page)
    # receive the LTPage object for this page
    layout = device.get_result()
    for lt_obj in layout:
         if isinstance(lt_obj, LTImage):
            saved_file = save_image(lt_obj, page_number, images_folder)

In save_image, I used the ImageWriter class:

from pdfminer.image import ImageWriter

def save_image(lt_image: LTImage, page_number, images_folder):
    image_writer = ImageWriter(images_folder)
    file_name = image_writer.export_image(lt_image)

The problem here is that the images in the PDF are FLATE_DECODE, but ImageWriter saved them as .bmp image, which corrupt them.

So I moved FLATE_DECODE to a higher priority so that the _save_bytes() method is used first, and saved the image as ".jpg" which have the saved images perfectly viewable.

from pdfminer.six.

pietermarsman commented on June 10, 2024

I cannot replicate this with the latest version.

Using

python tools/pdf2txt.py ~/Downloads/GitGuide.pdf --output-dir images

I get all the images properly formatted. Some jpg's.

And a bunch as bmp's (converted to jpg so that it can be shown by GitHub).

from pdfminer.six.

pietermarsman commented on June 10, 2024

Let me know if the issue is still there for you, and we can reopen this issue. In that case, could you specify what you mean by "corrupt"?

from pdfminer.six.

iraykhel commented on June 10, 2024

Yup, extracting .bmp doesn't work.
Crashes here:
if params and "Predictor" in params:
TypeError: argument of type 'PDFObjRef' is not iterable

If this check is bypassed, extracted .bmp is corrupted.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage
from pdfminer.image import ImageWriter

pdf_path = 'path/to/bmp'
W = ImageWriter('path/to/storage')
pages = extract_pages(pdf_path)
for element in pages.__next__():
    if isinstance(element, LTFigure):
        for sub in element:
            if isinstance(sub, LTImage):
                W.export_image(sub)

bmpsample2.pdf

from pdfminer.six.

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt about pdfminer.six HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent