Coder Social home page Coder Social logo

pdf-font-loader-raku's Issues

Handle CFF fonts

Not that common, but intending to port Cairo OTF subsetting, which converts to CFF.

Font failing to decode

The attached PDF is failing to decode an Identity-H encoded font with its supplied ToUnicode CMAP.

This is apparent when processed via pdf-tag-dump.raku. This script also explores this a little further:

use PDF::Font::Loader;
use PDF::Font::Loader::FontObj;

use PDF::COS::Dict;
use PDF::Lite;
my PDF::Lite $pdf .= open: "/tmp/SSRN-id4337484.pdf";

my  PDF::COS::Dict:D $dict = $pdf.page(1)<Resources><Font><F9>;

my PDF::Font::Loader::FontObj:D $font = PDF::Font::Loader.load-font: :$dict;

my $str = "\x[3]~\0\x[4]\x[1]\x[F]\x[1]µ\x[1]l\x[1]u\x[1]\x[1E]\x[1]]\x[1]o\x[3]U\0\x[3]\x[1]\x[1E]\x[1]\x[9A]\0\x[3]\x[1]\x[2]\x[1]o\x[3]X\x[3]U\0\x[3]\x[3]î\x[3]ì\x[3]î\x[3]í\x[3]V\0";

say  $str.comb(/../).map({$font.decode($_, :str)}).join;

Produces: (bukmeiletal2021, whereas the rendered text is (Abukmeil, et al., 2021

SSRN-id4337484.pdf

Handle custom ligatures in string decoding

For example, the following CMap has a custom 'Th' ligature <00540068> (as well as a standard 'fl' <00660069>). Should it decode as :str to 'Th'?

3208 0 obj
<< /Length 609 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AAAAAA+F22+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /AAAAAA+F22+0 def
/CMapType 2 def
1 begincodespacerange <02> <90> endcodespacerange
3 beginbfchar
<20> <0020>
<3b> <003B>
<90> <2019>
endbfchar
9 beginbfrange
<28> <29> <0028>
<2c> <36> <002C>
<38> <39> <0038>
<41> <50> <0041>
<52> <54> <0052>
<56> <57> <0056>
<59> <5a> <0059>
<61> <7a> <0061>
<8d> <8e> <201C>
endbfrange
2 beginbfrange
<02> <02> [<00540068>]
<03> <03> [<00660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

endstream
endobj

Investigate handling of OpenType collections (*.otc)

I don't think these are as common as TrueType collection, but would be good to support this, if we can do it cheaply.

freetype claims to support these. Hoping handling may be similar to *.ttc (TrueType collections).

Large distribution size. Overuse of DejaVuSans.ttf in test suite

I'm thinking this is the reason this module isn't uploading to zef. Anyway it is an outlier in terms on distribution size, largely because
there are multiple saved PDF's with DejaVuSans fully embedded.

As a first cut, it can probably be substituted for the 10x smaller Vera.ttf in most/all tests.

/Identity-H + /ToUnicode decoding of ligatures

The attached golfed PDF has an Identity-H encoded subset with a /ToUnicode mapping that includes the entry <0193> <00660069>,
mapping ligature CID 0x0193 to 'fi`. This is not currently respected, as in:

use Test;
plan 1;
use PDF::Lite;
use PDF::Font::Loader;
use PDF::Font::Loader::FontObj;

my PDF::Lite $pdf .= open: "identity-h-lig.pdf";

my $dict = $pdf.page(1)<Resources><Font><C2_0>;

my PDF::Font::Loader::FontObj:D $font = PDF::Font::Loader.load-font: :$dict;

my $bytes = buf8.new(0x00,0x32, 0x00,0x49, 0x01,0x93, 0x00,0x46, 0x00,0x48).decode: "latin-1";
is $dict.decode($bytes), 'Office';

which produces:

1..1
not ok 1 - 
# Failed test at /tmp/identity-h-lig.t line 14
# expected: 'Office'
#      got: 'Offce'
# You failed 1 test of 1

identity-h-lig.pdf

Subsetted Identity-H CMaps appear to be incorrect.

As a simple experiment, if I cut and paste xpdf display from tmp/subset.pdf, adter running t/subset.t, I get junk for the first (ttf) and third, identity-h encoided fonts.

Seems the CMaps are not correct.

HarfBuzz shaping support

HarfBuzz does both shaping and font selection. It seems to be easiest to implement as be a variation of identity-h encoding.

kerning also needs to be taken over by the encoder. It's enabled by turning on the :kern feature in HarfBuzz fonts.

Glyph maps can't handle type1 charsets

Here's an example of a font that can't currently be represented. From 000377-001.pdf (attached):

19 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /GMOICK+MSTT31c531S00
  /Encoding 20 0 R
  /FirstChar 1
  /FontDescriptor 21 0 R
  /LastChar 61
  /Widths [ 1004 836 391 334 558 613 502 334 772 606 440 331 440 248 248 772 496 552 827 552 606 248 386 716 606 827 552 496 496 1004 613 613 662 496 496 552 496 613 893 388 659 769 247 769 988 604 714 823 385 604 659 659 823 604 604 823 494 659 329 329 823 ]
>>
endobj

20 0 obj
<<
  /Type /Encoding
  /Differences [ 1 /g179 /g50 /g85 /g76 /g74 /g81 /g68 /g79 /g36 /g69 /g86 /g87 /g70 /g17 /g3 /g46 /g72 /g92 /g90 /g82 /g71 /g29 /g45 /g53 /g83 /g39 /g89 /g23 /g26 /g48 /g75 /g88 /g55 /g20 /g21 /g41 /g22 /g78 /g80 /g73 /g61 /g56 /g15 /g57 /g58 /g54 /g38 /g43 /g44 /g40 /g60 /g37 /g42 /g51 /g47 /g49 /g93 /g59 /g77 /g16 /g52 ]
>>
endobj

21 0 obj
<<
  /Type /FontDescriptor
  /Ascent 0
  /CapHeight 0
  /CharSet (/g38/g179/g23/g37/g36/g52/g58/g72/g50/g20/g15/g69/g21/g73/g56/g85/g86/g16/g88/g61/g76/g87/g82/g60/g74/g43/g70/g78/g80/g81/g47/g89/g17/g57/g83/g68/g41/g3/g75/g79/g51/g44/g59/g26/g92/g93/g55/g54/g49/g46/g45/g71/g29/g53/g40/g90/g42/g77/g39/g22/g48)
  /Descent 0
  /Flags 4
  /FontBBox [ -16 -265 1004 727 ]
  /FontFile3 22 0 R
  /FontName /GMOICK+MSTT31c531S00
  /ItalicAngle 0
  /StemV 0
>>
endobj

It's completely making up it's own encoding with custom glyphs and no unicode map. We couldn't do much with it, rather than render it, but it does show our representations are correct. Also we should be taking account of /CharSet to setup custom encoding -> cid mappings.

Reusing a font within a PDF copies it

PDF::Font::Loader::Dict builds a Font::Loader::FreeType object with a brand new dictionary and stream.

Using this font to write back to the PDF works, but at the expense of replicated font object, including the underlying Font Descriptor. CMap and stream.

Simple.but inefficient.

Support for font shaping and subsetting

Work is underway on Raku bindings to the HarfBuzz font shaper.

Also separately underway on bindings to its font subsetting capability.

Would be good to integrate HarfBuzz for glyph selection at least. Ideally layout as well, but might require some work in PDF::Content to support an external shaper.

I think we need to combine it with Identity encoding and embedding.

Although both go well together. Font subsetting is somewhat independent of shaping.

Note: Most Linux distros don't support HarfBuzz's subsetting capability yet, which restricts its current use to those keen enough to build HarfBuzz from soruce, with subsetting enabled.

Freetype thread safety issues

I'm sometimes seeing issues when running pdf2image.raku from PDF::To::Cairo. For example:

david@box:~/git/PDF-To-Cairo-raku$ raku -I . bin/pdf2image.raku /tmp/out1.pdf 
saving page 1 -> PNG /tmp/out1-001.png...
saving page 9 -> PNG /tmp/out1-009.png...
saving page 17 -> PNG /tmp/out1-017.png...
saving page 25 -> PNG /tmp/out1-025.png...
saving page 33 -> PNG /tmp/out1-033.png...
saving page 41 -> PNG /tmp/out1-041.png...
saving page 49 -> PNG /tmp/out1-049.png...
loading font: Times-Bold -> /usr/share/fonts/opentype/urw-base35/NimbusRoman-Bold.otf
error processing glyph index: 44: FreeType Error: invalid argument
  in block  at /home/david/git/rakudo/install/share/perl6/site/sources/BB3ACC8ADEDFA6495C127C660154B220EF90E5A2 (PDF::Font::Loader::Enc) line 100
A worker in a parallel iteration (hyper or race) initiated here:
  in method save-as at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 574
  in sub MAIN at bin/pdf2image.raku line 33
  in block <unit> at bin/pdf2image.raku line 18

Died at:
    bad Cairo status 1 CAIRO_STATUS_NO_MEMORY after ShowText(Raku by example 101) operation
      in sub  at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 541
      in block  at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 947
      in method op at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 945
      in block  at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 1024
      in method ops at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 1023
      in method render at /home/david/git/rakudo/install/share/perl6/site/sources/94B207285FE37A6B2E4D3E0BD474EC06051A8163 (PDF::Content::Graphics) line 81
      in method render at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 60
      in method save-as-image at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 559
      in block  at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 587

Freetype is not thread-safe for concurrent access to face objects, which seems to be going on here.

Investigation needed,

What happens if 'find-font' finds multiple files?

From a trial it seems to return the first file found meeting the input criteria. So the docs should say that.

The docs should also clearly state the :find-font option requires Raku module 'FontConfig'.

Investigate font-subsetting

I'm considering doing a trial implementation with font-forge scripting.

Will most likely be implemented in PDF::Font::Loader::FreeType.cb-finish() method

Intending to create subset branch

Metrics mismatch for core fonts

Currently, FontConfig is used to load any system font that best matches the core font and takes the metrics from there.

This may cause problems depending on the font selected and how well it matches the core font metrics.

I'm seeing evidence of this with PDF::To::Cairo rendering of Pod::To::PDF::API6 produced PDF's with core fonts.

Note that other un-embedded fonts use widths (/Widths or /W). So also setting the widths using core fonts metrics is probably better.

Ideal solution is to choose, or have fonts available that closely match core font metrics.

More investigation needed.

Example is confusing for understanding the difference between finding and loading a font

To me it is clearer to separate the two actions, if I understand correctly:

use PDF::Font::Loader :load-font, :find-font;
my $font-file = find-font: :family<DejaVu>, :slant<italic>, :lang<en>;
my $font      = load-font: $font-file;

# use the font in PDF-Lite
# ...
$pdf.add-page.text: {
    .font = $font;
    .text-position = [10, 600];
    .say: 'Hello, world';
}
# ...

SparkyCI builds fails

Hi!

Here is the report - http://sparrowhub.io:2222/report/433

21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-fixed.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf16.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf32.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf8.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-variable.t ............ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/fontobj.t .................. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/load-font.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/pdf-text-align.t ........... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-cid.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-type1.t .............. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-unembedded.t ......... Dubious, test returned 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All 2 subtests passed
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/subset.t ................... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/threads.t .................. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-add-encoding.t ....... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-encoding.t ........... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-encoding_issue#12.t .. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-stream.t ............. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1.t .................... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type3-basic.t .............. Dubious, test returned 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All 8 subtests passed
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All tests successful.
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader]
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Test Summary Report
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] -------------------
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-unembedded.t (Wstat: 65280 Tests: 0 Failed: 0)
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Non-zero exit status: 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Parse errors: Bad plan. You planned 2 tests but ran 0.
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type3-basic.t (Wstat: 65280 Tests: 1 Failed: 0)
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Non-zero exit status: 255
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Parse errors: Bad plan. You planned 8 tests but ran 1.
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Files=19, Tests=130, 156 wallclock secs
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Result: FAILED
21:08:51 05/12/2022 [bash: run tests] :: ===> Testing [FAIL]: PDF::Font::Loader:ver<0.6.2>:auth
21:08:51 05/12/2022 [bash: run tests] :: task exit status: 1

HTH

Aleksei

More forgiving handling of TrueType font collections

The main issue with handling TrueType collections is that it's unsafe to embed them.

So firstly, we shouldn't need to throw an error, if the font is being loaded with :!embed.

We may also be better off warning and unsetting the embed flag rather than dying.

I'm also curious to see how TrueType collections interact with subsets. Some more testing needed in the Font::Subset module.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.