Coder Social home page Coder Social logo

Comments (34)

josephwright avatar josephwright commented on May 12, 2024

Relates in some ways to #21: probably the same need for a general mechanism for 'non-log' testing will be required.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

We could do this various ways, for example a specific test mode, a switch to add the info to the .log, ...

from l3build.

car222222 avatar car222222 commented on May 12, 2024

a switch to add the info to the .log, ...

The general ability to Add stuff like this to the .log file (as much as possible) to the .log file would be very widely useful.

from l3build.

wspr avatar wspr commented on May 12, 2024

from l3build.

blefloch avatar blefloch commented on May 12, 2024

from l3build.

wspr avatar wspr commented on May 12, 2024

from l3build.

blefloch avatar blefloch commented on May 12, 2024

from l3build.

FrankMittelbach avatar FrankMittelbach commented on May 12, 2024

what about a simple flag in regression-test.tex, such as

\EXTERNALDATA{pdfinfo}{meta}

which outputs

<<PLACEHOLDER pdfinfo meta>>

in the log. The normalization (or a step next to it) could then run pdfinfo -meta on the pdf file and insert the result in place of that placeholder.

That step could support different external commands though I would limit that to a defined set initially just pdfinfo say

from l3build.

josephwright avatar josephwright commented on May 12, 2024

The idea of using pdfinfo or similar is quite a good one, at least in as far as testing tagging and similar goes. If you run pdfinfo -rawdates, you get basic info, e.g.

Title:          Testing Tagged PDF with LaTeX
Subject:        Testing paragraph split across pages in Tagged PDF
Author:         Dr. Ross Moore
Creator:        pdfTeX + pdfx.sty with a-1a option
Producer:       pdfTeX
CreationDate:   D:20170117160113+11'00'
ModDate:        D:20170117160113+11'00'
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          4
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      419873 bytes
Optimized:      no
PDF version:    1.4

(PDF from web.science.mq.edu.au/~ross/TaggedPDF/test-LaTeX-article-unc.pdf), and then can look at tag structure using pdfinfo -struct, which starts here

  Div "Topmatter"
    H "title" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
    P "author" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
      Reference (inline)
        Note (inline)
          Lbl (block)
    P "date" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
  NonStruct
    Object 64 0
  TOC "Contents"
    H "Contents" (block)
    TOCI
      Reference (inline)
        Link "to destination section.1" (inline)
          Object 74 0
      TOC "subsections"
        TOCI

I suspect that using a third-party tool is a better plan long-term than trying to use the PDF stream (#10), and avoids the issues that have become apparent in #21.

I'd welcome thoughts on this, particularly from @FrankMittelbach, @car222222 and @u-fischer. My idea at present would be to take the work I've already done on PDF-based testing and alter it to something like

  • Use .pvt to indicate a PDF-based test
  • Include in the .pvt instructions on which PDF tests to run, either in TeX form or perhaps
    as comments (thus parseable by Lua up-front): are there advantages to doing it from TeX?
  • Store the output of the analysis tools as a .tlg or perhaps some related extension (.plg?)
  • Use a simple diff with no need to normalise

That is all pretty easy, so the question is 'does the logic stack up'? It should cover tag testing, but I'm not sure what else might be wanted. Is should though allow us to add other tools later.

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

pdfinfo is certainly quite useful, and it would good to be able to include its output and compare it with some reference file in some test setups. But it doesn't test everything. E.g. if I remove the \tagmcend from https://github.com/u-fischer/tagpdf/blob/master/source/examples/structure/ex-patch-sectioning-koma.tex one gets an invalid pdf due to the missing EMC operator, but pdfinfo doesn't mind.

So I would need to be able to compare parts of the uncompressed pdf too.

The longer I ponder this the more I think something like the arara rules would be useful: a way to tell by file that e.g. this test should compare in the pdf everything from 24 0 obj to the next endstream.

It would be imho ok, if every lvt only does always only one or two tests, so that one doesn't need tons of different extensions for the reference files.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

@u-fischer I'm still getting a handle on this, but I suspect what makes sense for automated testing ('is the output as expected') isn't the same as what is needed to set up the tests ('is the output right'). One sees the same in a lot of box-based tests: in the end, it takes a human looking at the output to be sure they are correct, our tests then pick up if they change. So for tagging, I was imaging using Adobe Pro to check the PDFs are correct, then checking in appropriate data that the fact they don't get bust later can be verified.

I did wonder too about picking out objects: that is certainly doable, and of course ends up again with purely text files. Again, one might imaging using the same set up with some form of marker data. If doing as 'magic' comments something like

% l3build checkobjs <numbers>
% l3build pdfinfo --rawdates
% l3build pdfinfo --struct 

or if done at the macro level

\CHECKPDFOBJS{<numbers>}
\CHECKPDFINFO
\CHECKPDFSTRUCT

I guess the later approach would work with the 'standard' .lvt extension, but it does make the internal logic easier if a separate code path can be indicated by the input file name, so I'd favour using .pvt or similar.

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

Yes, the initial pdf must certainly be checked for correctness by a human with tools like preflight, pax3, pdfinfo and other things. The automatic testing often "only" needs to compare the two pdf or the pdfinfo output minus some normalization (I checked miktex against texlive and the differences where few: only producer, id and time) or by restricting the comparision to some parts of the pdf.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

@u-fischer The pdfinfo output can be coerced to be the same: things like date can be fixed (they all should be with l3build).

What I guess I'm wondering is do we feel testing individual obj blocks plus pdfinfo output is enough for automated testing of pre-checked PDFs. For example, you say that pdfinfo doesn't mind about missing ending tags, but what's important is that it shows up in the tag dump. Is that the case?

from l3build.

wspr avatar wspr commented on May 12, 2024

from l3build.

josephwright avatar josephwright commented on May 12, 2024

I've thought more about the extension business, and perhaps it is confusing needing two: if one goes with macro-based switches for doing 'info' steps, then it's strange for them to be only used with one type of test input. I think the logic in l3build would make sense:

  • Run test
  • Make .tlg file and compare
  • Parse .log (or .lvt) for 'magic' lines
  • Collect info in e.g. <name>.tpf (Test PdF) file and compare

from l3build.

josephwright avatar josephwright commented on May 12, 2024

(Just to note that having macro-based commands in the .lvt for extracting PDF data doesn't mean they have to be parsed from the log: as the .lvt is likely shorter, it's probably faster to just parse that using Lua in any case.)

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

Sorry I think I'm getting lost a bit, I'm not sure I really get the question.

  • I think we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.
  • I think that this pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.
  • I also think that the needed pdf normalization is not so difficult, probably even quite easier than the log-normalization -- if you don't try to make tests which compare pdf's of different engines, as the pdf output of pdftex and say xetex is too different: one should make engine specific tests.

So to get it working one need variables/tools to tell l3build the output filetype (pdf/log/something else) and the "normalization function" which should be used for a test.

Regarding per-test configuration: there is/will be a need for different test setups: pdftex tests for tagging e.g. need more compilation runs than luatex. But one can get around the problem by using config-files and different test folders. On travis one could probably use some environment variable and set the config-list depending on its value. So while it would neat to have more control, it is not so vital.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

@u-fischer OK, it sounds like an approach based on extracting some parts of PDFs into .tlg or similar files is workable. So the only real question is how best to express that in input terms. We have two basic options:

  • Call all input files .lvt and pick up that the PDF should be post-processed based on marker in the file
  • Use a different extension for 'parse the PDF' tests

I can see advantages to both approaches: at a technical level, using a different extension does make branching a bit easier. I'd welcome thoughts on what is clearer to the user.

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

\CHECKPDFSECTION{2 0 obj}{endobj}

which could of course have a focussed variant for objects (just the number).

I'll look to adjust the current code to at least demo how this might work: very much a fluid situation!

As you say, PDF-based testing is going to need to be single-engine with different configs: that all seems workable.

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

@josephwright I think we have a third basic option:

  • call all input files .lvt and pick up that the PDF should be post-processed based on setting in the config-file.

I would have no problem with putting all the pdf-tests in some testfiles-pdf folder.

Regarding a dedicated extension: how does your idea to parse and compare the output of pdfinfo fit in here?

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

That's a possibility, but imho one should check first if it is not enough to delete a few os-specific lines from the pdf.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

Certainly if it's possible then simply 'thinning out' PDFs sounds good: I think the main issue is binary content.

I guess I'm slightly in favour of input files somehow 'knowing' they are for PDF-based tests, either by extension or by content. I'm sure I can find a way of doing pdfinfo: one could just make that a blanket thing.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

I think I've got a good idea of how we might look to tackle things now: I'll probably adjust the current code shortly so we can see how 'PDF-to-text' testing works. For the present I'll likely keep the input as .pvt files, simply as that means minimal changes. We can then discuss what is the clearest for the user.

from l3build.

FrankMittelbach avatar FrankMittelbach commented on May 12, 2024

I would also prefer a separation by extension (or by magic line inside) - this is keeping this pretty clear already on directory level and chances that you want to use a single test in both ways are small imho. I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it.

from l3build.

car222222 avatar car222222 commented on May 12, 2024

I an pretty sure that Ulrike has (most of:-) the correct suggestions on this one.

Is this discussion going to Ross?

In particular I strongly agree with these two points from Ulrike:

Point 1:
. . . we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.

Point 2:
. . . pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

@FrankMittelbach

I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it

That would be certainly good. But currently already a test files doesn't always speak for itself - you need the configuration file to see which engines should compile it how often.

from l3build.

FrankMittelbach avatar FrankMittelbach commented on May 12, 2024

from l3build.

FrankMittelbach avatar FrankMittelbach commented on May 12, 2024

from l3build.

ozross avatar ozross commented on May 12, 2024

OK, I'm glad you guys are looking at all this stuff.

I've been using Preflight for several years now, to verify that I'm building the PDFs correctly. This includes checking that:

  • PDF syntax is correct in all regards (incl. /Alt and /ActualText delimiters, etc.);
  • that Metadata is recorded via the XMP packet – some of the author-supplied /Info fields are deprecated in later revisions of PDF/A and PDF 2.0, in favour of using XMP;
  • object streams are correctly delimited: pdfTeX had this wrong for awhile, fixed now;
  • fake spaces occur only within BDC ... EMC blocks;
  • BTW, it is allowable to nest BDC ... EMC blocks, but it is rarely useful to do so, as this can result in content being duplicated upon text-extraction; e.g., for screen reading.
  • the parent tree is built correctly, with an entry for each MCID occurring on each page, in the correct order;
  • /ToUnicode map entries are present for all font characters;
  • tagging is consistent with the declared standards;
  • images use the correct Colour space: else use Preflight to produce a new version of the image, which is essentially wrapping the original graphic up with a small piece of extra coding to do a colour conversion;
  • and many, many more things that may crop up, especially when using content/images produced external to TeX software.

There are hundreds of individual tests, grouped according to types, as in the attached image.
screen shot 2018-07-31 at 9 19 55 am

Having a look in detail may give some ideas about what test can be usefully done using TeX-based tools.

Hope this helps.
Ross

from l3build.

josephwright avatar josephwright commented on May 12, 2024

First attempt at stripping info from the PDF file not exactly encouraging: see Travis-CI failures. It seems that font data is stored entirely differently on Windows and on Linux in the PDF. More importantly, there's no obvious/easy pattern to pick up on in the very simple PDF I'm using to test this. With something more complex, I'm very doubtful that it would be easy to pull out just the 'right' parts.

I'm going to see if I can get something more self-consistent if I set the various compression settings differently. However, if that doesn't work then we are likely back needing to 'opt in' material for comparison, or using external tools (pdfinfo, etc.), or both.

from l3build.

u-fischer avatar u-fischer commented on May 12, 2024

Could you sent me an example of an uncompressed pdf from linux? (Along with the texfile, or by using one of my example files).

from l3build.

josephwright avatar josephwright commented on May 12, 2024

@u-fischer Problem solved: it's a question of getting the font set up correct (Type 1 vs Type 3): things look good now! I'll track down remaining minor issues, then I think we'll be able to run tests on the 'massaged' PDF, which also mean we can look at the output for debugging.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

Right, this does seem to work: probably I'll need over time to adjust the normalisation, but if we retain as much of the 'raw' PDF as we can, other tests (as outlined by @ozross) can be used to set up reference data whilst l3build can show what has changed when issues occur.

from l3build.

car222222 avatar car222222 commented on May 12, 2024

there is no difference in opinion really.
Sure,
But: Who ever suggested there was such a difference?

Note also that one of things I agreed with was “use only uncompressed pdfs”, so maybe there was a difference in suggested actions, if not of opinions.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

On further testing, with Type1 fonts we don't have to worry about PDFs varying at all between platforms, or at least not in the cases I've tried. I'll keep a separate extension in case we do have to 'mangle' the PDFs, but this looks much easier than perhaps expected.

from l3build.

josephwright avatar josephwright commented on May 12, 2024

I've stuck with removing binary data: when a test fails, this means we do get a useful .diff.

from l3build.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.