prawnpdf / pdf-inspector Goto Github PK

A collection of PDF::Reader based analysis classes for inspecting PDF output. Mainly used for testing Prawn, but will work with any PDF.

Home Page: http://prawnpdf.org

License: Other

Ruby 100.00%

pdf-inspector's People

Contributors

Stargazers

Watchers

pdf-inspector's Issues

Text.analyze_file([MY_PATH]).strings returns array of characters

I'm testing the contents of a PDF generated by PDFKit. When I run Text.analyze_file([MY_PATH]).strings on the file I get an array which holds each character of the PDF content in it's own index. Spaces are stored as '' (empty string). I've been able to move forward by replacing all empty strings with a space character. However, I'm now up against content which contains new line characters. The new lines are not stored in the array, so the separation between the words is lost around the new line character. Ever see this sort of behaviour? I realize that there are a number of factors which could be screwing things up, including my own ignorance, and I'd love to find the root of the problem, but I have no time. Right now, I'd be happy with a hack to get my tests working.
Cheers!

EDIT: So I came up with a hack that'll get me through. I remove all the white space characters from the array (they weren't actually empty strings, as I had believed). Then join the characters with exactly one space, and downcase the whole thing.

def char_array_to_normalized_string(arr)
arr.delete_if{|s| s =~ /\s/ }.join(' ').downcase
end

After I put my test strings through the same process, by calling char_array_to_normalized_string("Test String".scan(/./)), I'm able to match them against the ouput of PDF inspector. It's not pretty, but it gets me where I need to go.
Cheers!

Pick a license for pdf-inspector

What license should pdf-inspector 1.0 be released with?

So far development has proceeded with no explicit license, but if we're upgrading this code to a gem in it's own right I feel we should be explicit.

Generic inspector

What about a generic inspector that catches everything, to easily compare two PDF ?

It could look like this : http://github.com/piglop/pdf-inspector/commit/511617be71eac11b318c52029f18a173feed5f33

Using it with assert_equal produces an unreadable test result, but it's great if you use unit_diff.

A string with a superscript char gets parsed into two strings

If a string contains a superscript char like m², then the PDF::Inspector::Text.analyze(some_pdf).strings array will produce 2 strings:

m
²

instead of a single array entry with m²

Actually publish it to Rubygems

Very useful code, but not on rubygems as advertised.

CHANGELOG does not reflect 1.2.0 updates

Looks like a minor change, but it got a dot-version bump, so might seem scary. Best to let us users know not to be afraid with a simple CHANGELOG update.

Numerals read as `\u0000` when using font feature settings

First of all, thanks for the work and effort you've put into this great library!

Bug description

We are having an issue with numerals not being read correctly by PDF::Inspector::Text.analyze. They get misinterpreted as \u0000 when we use font-feature-settings: 'tnum' as style. We are generating the PDF with Gotenberg from HTML templates.

Minimal reproducible example

<div>21.09.2023</div> gets read as 21.09.2023

while

<div style="font-feature-settings: 'tnum'">21.09.2023</div>gets read as \u0000\u0000.\u0000\u0000.\u0000\u0000\u0000\u0000.

PDFs

Here are two PDFs, one with the feature turned off and one with the feature turned on:
font_features_off.pdf
font_features_on.pdf

Further information

The UNIX tool pdftotext is able to read both versions correctly so I think the PDF is alright.
The font in use is Barlow if that makes any difference.

Any help would be appreciated!

P.S.: I'll also open an issue regarding this problem over at https://github.com/yob/pdf-reader so feel free to close this one if you think it should be handled there.

release 1.1

The next release of pdf-reader (1.4) will start printing deprecation warnings when the pre 1.0 API is used.

I'd like to release the current pdf-inspector master branch as 1.1 and move prawn to thatso the prawn specs don't print deprecation warnings.

The pdf-inspector API remains the same.

Any objections?

PDF::Inspector::Text.analyze does not return strings within repeat block

When using PDF::Inspector::Text.analyze(some_pdf).strings, it does not return strings that were added via

repeat(1..page_count) do
  text_box "Hello", at: [0, bounds.top]
end

Is there a way to get a hold of these strings?