Coder Social home page Coder Social logo

Comments (9)

jstockwin avatar jstockwin commented on August 28, 2024 1

Hey @Aceto1, I think that's a nice idea! Particularly being able to filter for more structured font info (e.g. just the size, or just the name). I don't think I have e.g. "Bold" as an attribute, though - it just comes as a string. But I could split the "Arial,Bold" from the size, certainly.

I think in your case there is an out-of-the-box solution, though. You can pass a font_mapping, to map the real font names (e.g. CIBPIK+Arial,Bold,12.0) to your own font names (e.g. arial_bold_12, or even better something more meaningful like page_header or bold_text). You are able to pass the font mapping as regular expressions if you pass font_mapping_is_regex=True. (See the font mapping overview, the Order Summary example, and the PDFDocument reference in the docs).

The regular expression font mapping is not well documented, this could be improved. In particular, the order summary example has a similar issue (but the weird prefixes don't change for each font name), so we could easily point this out there.

In your case, then, perhaps something similar to the following would help solve your issue:

_FONT_MAPPING = {
    r"\w{6}\+Arial,Bold,12.0": "bold_text",  # (or other sensible names)
   # Add more fonts here as needed.
}
document = load_file(..., font_mapping=_FONT_MAPPING, font_mapping_is_regex=True)

document.elements.filter_by_font("bold_text")

You can use whatever regular expression you like, the \w{6}\+ will match any 6 letters and a + at the start of the font.

Let me know if that helps you and we can work out how to move forward with this issue! If anything is unclear, just ask!

from py-pdf-parser.

Aceto1 avatar Aceto1 commented on August 28, 2024

Hey, thanks for the quick answer. I tested it and it works like a charm.
I could provide a PR for the docs if you want.

from py-pdf-parser.

jstockwin avatar jstockwin commented on August 28, 2024

Glad it worked. Sure, a documentation PR would be fantastic!

I would personally do the following:

  1. Add a note to the Font Mapping Overview, just mentioning that regular expression mapping is possible.
  2. Add a new "Step 3" to the Order Summary Example which basically does step 2 again but now with regular expressions, making the comment that for some PDFs the first characters can change. Note that (a) there is a "full code" section at the bottom of each example, and (b) there are tests to ensure this continues to be correct. There will both need updating too.

Let me know if you need any help, otherwise I look forward to a PR!

from py-pdf-parser.

jstockwin avatar jstockwin commented on August 28, 2024

Note: I've just converted this issue to be a documentation issue. Whilst other comments (about more structured font filtering) are still very valid, I don't really have the bandwidth to do them right now. If someone else is interested in that, do feel free to open a separate issue and reference this one and we can consider it again.

from py-pdf-parser.

jstockwin avatar jstockwin commented on August 28, 2024

Hey @Aceto1, just wanted to check if you're still interested in producing a documentation PR for this?

I appreciate time is always short, but if you can let me know whether you expect to get around to it or not that would be great. Thanks!

from py-pdf-parser.

Aceto1 avatar Aceto1 commented on August 28, 2024

Hey, sorry been busy with exams and life stuff. I'll get to it next week for sure!

from py-pdf-parser.

jstockwin avatar jstockwin commented on August 28, 2024

Amazing, thank you!

from py-pdf-parser.

Aceto1 avatar Aceto1 commented on August 28, 2024

Alright, is there anything else left to do on this issue that I can help with?

from py-pdf-parser.

jstockwin avatar jstockwin commented on August 28, 2024

All good in my opinion, thanks very much!

Issue closed by #245

from py-pdf-parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.