Coder Social home page Coder Social logo

Comments (5)

pietermarsman avatar pietermarsman commented on June 10, 2024

Hi @dotrunghieu96,

Thanks for the bug report and the corresponding PR.

Could you share a PDF and some code that you use to reproduce this bug? That will allow me to understand the impact of your suggested change better.

from pdfminer.six.

dotrunghieu96 avatar dotrunghieu96 commented on June 10, 2024

Hi @pietermarsman, this is the file that I used
GitGuide.pdf

In code, first I was parsing the LTObjects via pages

from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
# open the pdf file
fp = open(pdf_doc, "rb")
# create a parser object associated with the file object
parser = PDFParser(fp)
# create a PDFDocument object that stores the document structure
doc = PDFDocument(parser)
# connect the parser and document objects
parser.set_document(doc)

Then parse the LTObjects

device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for i, page in enumerate(PDFPage.create_pages(doc)):
    print("parsing pages:", i + 1, flush=True)
    interpreter.process_page(page)
    # receive the LTPage object for this page
    layout = device.get_result()
    for lt_obj in layout:
         if isinstance(lt_obj, LTImage):
            saved_file = save_image(lt_obj, page_number, images_folder)

In save_image, I used the ImageWriter class:

from pdfminer.image import ImageWriter

def save_image(lt_image: LTImage, page_number, images_folder):
    image_writer = ImageWriter(images_folder)
    file_name = image_writer.export_image(lt_image)

The problem here is that the images in the PDF are FLATE_DECODE, but ImageWriter saved them as .bmp image, which corrupt them.

So I moved FLATE_DECODE to a higher priority so that the _save_bytes() method is used first, and saved the image as ".jpg" which have the saved images perfectly viewable.

from pdfminer.six.

pietermarsman avatar pietermarsman commented on June 10, 2024

I cannot replicate this with the latest version.

Using

python tools/pdf2txt.py ~/Downloads/GitGuide.pdf --output-dir images

I get all the images properly formatted. Some jpg's.

X8

And a bunch as bmp's (converted to jpg so that it can be shown by GitHub).

X44

from pdfminer.six.

pietermarsman avatar pietermarsman commented on June 10, 2024

Let me know if the issue is still there for you, and we can reopen this issue. In that case, could you specify what you mean by "corrupt"?

from pdfminer.six.

iraykhel avatar iraykhel commented on June 10, 2024

Yup, extracting .bmp doesn't work.
Crashes here:
if params and "Predictor" in params:
TypeError: argument of type 'PDFObjRef' is not iterable

If this check is bypassed, extracted .bmp is corrupted.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage
from pdfminer.image import ImageWriter

pdf_path = 'path/to/bmp'
W = ImageWriter('path/to/storage')
pages = extract_pages(pdf_path)
for element in pages.__next__():
    if isinstance(element, LTFigure):
        for sub in element:
            if isinstance(sub, LTImage):
                W.export_image(sub)

bmpsample2.pdf

from pdfminer.six.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.