The current pdf library leaves a lot to be desired. It only works fo

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

PDF parsing is poor about lurnby HOT 6 OPEN

roznoshchik commented on August 17, 2024

PDF parsing is poor

from lurnby.

Comments (6)

Artaud commented on August 17, 2024 1

I have had great success with parsing very complex pdf to html using pdf2htmlex, especially this fork https://github.com/pdf2htmlEX/pdf2htmlEX (the original is unmaintained).
Doesn't do ocr though.

from lurnby.

Roznoshchik commented on August 17, 2024

Thanks @Artaud,

I'll try to play with this and see how it works. I think my biggest concern is how it would work on mobile, but I guess that should be secondary to actually having it work for the majority of inputs.

A brief look at some of the samples, showed that on mobile there isn't any rerendering, the whole page just shrinks to a tiny size.

from lurnby.

Roznoshchik commented on August 17, 2024

Looking at this closer, pdf2htmlEX does seem promising, but it's not a python package. Which means to use it on Heroku where I'm currently hosting the app, would require some extra work.

I'm not sure how to compile C apps to run on Heroku, so the best bet seems to be to convert to a docker deployment and deploy the docker image to heroku.

I've started that process, but it involves quite a lot of changes so will see how it goes.

from lurnby.

Roznoshchik commented on August 17, 2024

I was able to get pdf2htmlEX running on the docker container, but it's not working with some of the pdfs. Likely some missing font libraries.

But on closer look I realized that I was mostly able to get the same output using pymupdf which I was already using. I just wasn't using the automatic html conversion. I was building the html manually.

And I remembered why I made that decision. Both pymupdf and pdf2htmlEX convert the pdf to html, but they do so with a lot of inline css to render the page exactly the same.

This kills many of Lurnby's reader functions like dark/light mode, font size adjustments, etc. And makes it a bit annoying to try and highlight text due to the way it's rendered. Removing the inline css also doesn't lead to great layouts.

All of this is maybe fine, but the way in which I'm currently rendering the article content into the reader means that many pdfs, even those converted to html using those libraries will completely break and destroy the page layout. To pursue that option, I would need to render a separate reader for pdfs to account for any changes.

Which isn't necessarily a bad thing. Just requires a lot more research and testing to determine if that's the best way forward or not.

Another not so great option that I'm considering is to work with pdfs in image format. pymupdf has an option to convert a pdf page to an image. This has it's own drawbacks obviously. The text isn't selectable, it doesn't work for mobile and desktop, etc.

But, it aligns with another feature I'm considering which is the ability to highlight images.

I'm looking at incorporating Mozilla's screenshot library.

This would allow me to capture a portion of a page and then save that image. This way, an image pdf would possibly still be able to be annotated and worked on.

In short, looking at a bunch of seemingly sub optimal options.

from lurnby.

commented on August 17, 2024

@Roznoshchik do we have any updates on this?

from lurnby.

Roznoshchik commented on August 17, 2024

No unfortunately. I have been too busy to be able to do anything on this and the readwise team has been killing it, so it hasn't felt like there was a strong need for this.

I personally haven't been reading to many pdfs either so it hasn't been a priority.

from lurnby.

PDF parsing is poor about lurnby HOT 6 OPEN

Comments (6)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent