I would like it to be possible to filter elements by parts of font names. E.g. .filter

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Document regular expression font mapping about py-pdf-parser HOT 9 CLOSED

Aceto1 commented on August 28, 2024

Document regular expression font mapping

from py-pdf-parser.

Comments (9)

jstockwin commented on August 28, 2024 1

Hey @Aceto1, I think that's a nice idea! Particularly being able to filter for more structured font info (e.g. just the size, or just the name). I don't think I have e.g. "Bold" as an attribute, though - it just comes as a string. But I could split the "Arial,Bold" from the size, certainly.

I think in your case there is an out-of-the-box solution, though. You can pass a font_mapping, to map the real font names (e.g. CIBPIK+Arial,Bold,12.0) to your own font names (e.g. arial_bold_12, or even better something more meaningful like page_header or bold_text). You are able to pass the font mapping as regular expressions if you pass font_mapping_is_regex=True. (See the font mapping overview, the Order Summary example, and the PDFDocument reference in the docs).

The regular expression font mapping is not well documented, this could be improved. In particular, the order summary example has a similar issue (but the weird prefixes don't change for each font name), so we could easily point this out there.

In your case, then, perhaps something similar to the following would help solve your issue:

_FONT_MAPPING = {
    r"\w{6}\+Arial,Bold,12.0": "bold_text",  # (or other sensible names)
   # Add more fonts here as needed.
}
document = load_file(..., font_mapping=_FONT_MAPPING, font_mapping_is_regex=True)

document.elements.filter_by_font("bold_text")

You can use whatever regular expression you like, the \w{6}\+ will match any 6 letters and a + at the start of the font.

Let me know if that helps you and we can work out how to move forward with this issue! If anything is unclear, just ask!

from py-pdf-parser.

Aceto1 commented on August 28, 2024

Hey, thanks for the quick answer. I tested it and it works like a charm.
I could provide a PR for the docs if you want.

from py-pdf-parser.

jstockwin commented on August 28, 2024

Glad it worked. Sure, a documentation PR would be fantastic!

I would personally do the following:

Add a note to the Font Mapping Overview, just mentioning that regular expression mapping is possible.
Add a new "Step 3" to the Order Summary Example which basically does step 2 again but now with regular expressions, making the comment that for some PDFs the first characters can change. Note that (a) there is a "full code" section at the bottom of each example, and (b) there are tests to ensure this continues to be correct. There will both need updating too.

Let me know if you need any help, otherwise I look forward to a PR!

from py-pdf-parser.

jstockwin commented on August 28, 2024

Note: I've just converted this issue to be a documentation issue. Whilst other comments (about more structured font filtering) are still very valid, I don't really have the bandwidth to do them right now. If someone else is interested in that, do feel free to open a separate issue and reference this one and we can consider it again.

from py-pdf-parser.

jstockwin commented on August 28, 2024

Hey @Aceto1, just wanted to check if you're still interested in producing a documentation PR for this?

I appreciate time is always short, but if you can let me know whether you expect to get around to it or not that would be great. Thanks!

from py-pdf-parser.

Aceto1 commented on August 28, 2024

Hey, sorry been busy with exams and life stuff. I'll get to it next week for sure!

from py-pdf-parser.

jstockwin commented on August 28, 2024

Amazing, thank you!

from py-pdf-parser.

Aceto1 commented on August 28, 2024

Alright, is there anything else left to do on this issue that I can help with?

from py-pdf-parser.

jstockwin commented on August 28, 2024

All good in my opinion, thanks very much!

Issue closed by #245

from py-pdf-parser.

Document regular expression font mapping about py-pdf-parser HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent