Coder Social home page Coder Social logo

PDF parsing is poor about lurnby HOT 6 OPEN

roznoshchik avatar roznoshchik commented on August 17, 2024
PDF parsing is poor

from lurnby.

Comments (6)

Artaud avatar Artaud commented on August 17, 2024 1

I have had great success with parsing very complex pdf to html using pdf2htmlex, especially this fork https://github.com/pdf2htmlEX/pdf2htmlEX (the original is unmaintained).
Doesn't do ocr though.

from lurnby.

Roznoshchik avatar Roznoshchik commented on August 17, 2024

Thanks @Artaud,

I'll try to play with this and see how it works. I think my biggest concern is how it would work on mobile, but I guess that should be secondary to actually having it work for the majority of inputs.

A brief look at some of the samples, showed that on mobile there isn't any rerendering, the whole page just shrinks to a tiny size.

from lurnby.

Roznoshchik avatar Roznoshchik commented on August 17, 2024

Looking at this closer, pdf2htmlEX does seem promising, but it's not a python package. Which means to use it on Heroku where I'm currently hosting the app, would require some extra work.

I'm not sure how to compile C apps to run on Heroku, so the best bet seems to be to convert to a docker deployment and deploy the docker image to heroku.

I've started that process, but it involves quite a lot of changes so will see how it goes.

from lurnby.

Roznoshchik avatar Roznoshchik commented on August 17, 2024

I was able to get pdf2htmlEX running on the docker container, but it's not working with some of the pdfs. Likely some missing font libraries.

But on closer look I realized that I was mostly able to get the same output using pymupdf which I was already using. I just wasn't using the automatic html conversion. I was building the html manually.

And I remembered why I made that decision. Both pymupdf and pdf2htmlEX convert the pdf to html, but they do so with a lot of inline css to render the page exactly the same.

This kills many of Lurnby's reader functions like dark/light mode, font size adjustments, etc. And makes it a bit annoying to try and highlight text due to the way it's rendered. Removing the inline css also doesn't lead to great layouts.

All of this is maybe fine, but the way in which I'm currently rendering the article content into the reader means that many pdfs, even those converted to html using those libraries will completely break and destroy the page layout. To pursue that option, I would need to render a separate reader for pdfs to account for any changes.

Which isn't necessarily a bad thing. Just requires a lot more research and testing to determine if that's the best way forward or not.

Another not so great option that I'm considering is to work with pdfs in image format. pymupdf has an option to convert a pdf page to an image. This has it's own drawbacks obviously. The text isn't selectable, it doesn't work for mobile and desktop, etc.

But, it aligns with another feature I'm considering which is the ability to highlight images.

I'm looking at incorporating Mozilla's screenshot library.

This would allow me to capture a portion of a page and then save that image. This way, an image pdf would possibly still be able to be annotated and worked on.

In short, looking at a bunch of seemingly sub optimal options.

from lurnby.

 avatar commented on August 17, 2024

@Roznoshchik do we have any updates on this?

from lurnby.

Roznoshchik avatar Roznoshchik commented on August 17, 2024

No unfortunately. I have been too busy to be able to do anything on this and the readwise team has been killing it, so it hasn't felt like there was a strong need for this.

I personally haven't been reading to many pdfs either so it hasn't been a priority.

from lurnby.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.