pdf-raku / pdf-font-loader-raku Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 9.11 MB

Font loader for the PDF tool-chain

License: Artistic License 2.0

Makefile 0.67% Raku 99.33%

pdf-font-loader-raku's Introduction

[Raku PDF Project] / PDF

PDF-raku

Overview

This is a low-level Raku module for accessing and manipulating data from PDF documents.

It presents a seamless view of the data in PDF or FDF documents; behind the scenes handling indexing, compression, encryption, fetching of indirect objects and unpacking of object streams. It is capable of reading, editing and creation or incremental update of PDF files.

This module understands physical data structures rather than the logical document structure. It is primarily intended as base for higher level modules; or to explore or patch data in PDF or FDF files.

It is possible to construct basic documents and perform simple edits by direct manipulation of PDF data. This requires some knowledge of how PDF documents are structured. Please see 'The Basics' and 'Recommended Reading' sections below.

Classes/roles in this module include:

PDF - PDF document root (trailer)
PDF::IO::Reader - for indexed random access to PDF files
PDF::IO::Filter - a collection of standard PDF decoding and encoding tools for PDF data streams
PDF::IO::IndObj - base class for indirect objects
PDF::IO::Serializer - data marshalling utilities for the preparation of full or incremental updates
PDF::IO::Crypt - decryption / encryption
PDF::IO::Writer - for the creation or update of PDF files
PDF::COS - Raku Bindings to PDF objects [Carousel Object System, see COS]

Example Usage

To create a one page PDF that displays 'Hello, World!'.

#!/usr/bin/env raku
# creates examples/helloworld.pdf
use PDF;
use PDF::COS::Name;
use PDF::COS::Dict;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

sub prefix:</>($s) { PDF::COS::Name.COERCE($s) };

# construct a simple PDF document from scratch
my PDF $pdf .= new;
my PDF::COS::Dict $catalog = $pdf.Root = { :Type(/'Catalog') };

my @MediaBox  = 0, 0, 250, 100;

# define font /F1 as core-font Helvetica
my %Resources = :Procset[ /'PDF', /'Text'],
                :Font{
                    :F1{
                        :Type(/'Font'),
                        :Subtype(/'Type1'),
                        :BaseFont(/'Helvetica'),
                        :Encoding(/'MacRomanEncoding'),
                    },
                };

my PDF::COS::Dict $page-index = $catalog<Pages> = { :Type(/'Pages'), :@MediaBox, :%Resources, :Kids[], :Count(0) };
# add some standard metadata
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.CreationDate = DateTime.now;
$info.Producer = "Raku PDF";

# define some basic content
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET" ) };

# create a new page. add it to the page tree
$page-index<Kids>.push: { :Type(/'Page'), :Parent($page-index), :$Contents };
$page-index<Count>++;

# save the PDF to a file
$pdf.save-as: 'examples/helloworld.pdf';

Then to update the PDF, adding another page:

#!/usr/bin/env raku
use PDF;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

my PDF $pdf .= open: 'examples/helloworld.pdf';

# locate the document root and page tree
my $catalog = $pdf<Root>;
my $Parent = $catalog<Pages>;

# create additional content, use existing font /F1
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 16 Tf  15 25 Td (Goodbye for now!) Tj ET" ) };

# create a new page. add it to the page-tree
$Parent<Kids>.push: { :Type( :name<Page> ), :$Parent, :$Contents };
$Parent<Count>++;

# update or create document metadata. set modification date
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.ModDate = DateTime.now;

# incrementally update the existing PDF
$pdf.update;

Description

A PDF file consists of data structures, including dictionaries (hashes) arrays, numbers and strings, plus streams for holding graphical data such as images, fonts and general content.

PDF files are also indexed for random access and may also have internal compression and/or encryption.

They have a reasonably well specified structure. The document starts from the Root entry in the trailer dictionary, which is the main entry point into a PDF.

This module is based on the PDF 32000-1:2008 1.7 specification. It implements syntax, basic data-types, serialization and encryption rules as described in the first four chapters of the specification. Read and write access to data structures is via direct manipulation of tied arrays and hashes.

The Basics

The examples/helloworld.pdf file that we created above contains:

%PDF-1.3
%...(control characters)
1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>>
endobj

2 0 obj <<
  /Type /Catalog
  /Pages 3 0 R
>>
endobj

3 0 obj <<
  /Type /Pages
  /Count 1
  /Kids [ 4 0 R ]
  /MediaBox [ 0 0 250 100 ]
  /Resources <<
    /Font <<
      /F1 6 0 R
    >>
    /Procset [ /PDF /Text ]
  >>
>>
endobj

4 0 obj <<
  /Type /Page
  /Contents 5 0 R
  /Parent 3 0 R
>>
endobj

5 0 obj <<
  /Length 44
>> stream
BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET
endstream
endobj

6 0 obj <<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
  /Encoding /MacRomanEncoding
>>
endobj

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n 
trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The PDF is composed of a series indirect objects, for example, the first object is:

1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>> endobj

It's an indirect object with object number 1 and generation number 0, with a << ... >> delimited dictionary containing the author and the date that the document was created. This PDF dictionary is roughly equivalent to the Raku hash:

{ :CreationDate("D:20151225000000Z00'00'"), :Producer("Raku PDF"), }

The bottom of the PDF contains:

trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The << ... >> delimited section is the trailer dictionary and the main entry point into the document. The entry /Info 1 0 R is an indirect reference to the first object (object number 1, generation 0) described above. The entry /Root 2 0 R points the root of the actual PDF document, commonly known as the Document Catalog.

Immediately above the trailer is the cross reference table:

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n

This indexes the indirect objects in the PDF by byte offset (generation number) for random access.

We can quickly put PDF to work using the Raku REPL, to better explore the document:

snoopy: ~/git/PDF-raku $ raku -M PDF
> my $pdf = PDF.open: "examples/helloworld.pdf"
ID => [CÜ{ÃHADCN:C CÜ{ÃHADCN:C], Info => ind-ref => [1 0], Root => ind-ref => [2 0]
> $pdf.keys
(Root Info ID)

This is the root of the PDF, loaded from the trailer dictionary

> $pdf<Info>
{CreationDate => D:20151225000000Z00'00', ModDate => D:20151225000000Z00'00', Producer => Raku PDF}

That's the document information entry, commonly used to store basic meta-data about the document.

(PDF::IO has conveniently fetched indirect object 1 from the PDF, when we dereferenced this entry).

> $pdf<Root>
{Pages => ind-ref => [3 0], Type => Catalog}

The trailer Root entry references the document catalog, which contains the actual PDF content. Exploring further; the catalog potentially contains a number of pages, each with content.

> $pdf<Root><Pages>
{Count => 1, Kids => [ind-ref => [4 0]], MediaBox => [0 0 420 595], Resources => Font => F1 => ind-ref => [6 0], Type => Pages}
> $pdf<Root><Pages><Kids>[0]
{Contents => ind-ref => [5 0], Parent => ind-ref => [3 0], Type => Page}
> $pdf<Root><Pages><Kids>[0]<Contents>
{Length => 44}
"BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET"

The page /Contents entry is a PDF stream which contains graphical instructions. In the above example, to output the text Hello, world! at coordinates 100, 250.

Reading and Writing of PDF files:

PDF is a base class for opening or creating PDF documents.

my $pdf = PDF.open("mydoc.pdf" :repair) Opens an input PDF (or FDF) document.
- :!repair causes the read to load only the trailer dictionary and cross reference tables from the tail of the PDF (Cross Reference Table or a PDF 1.5+ Stream). Remaining objects will be lazily loaded on demand.
- :repair causes the reader to perform a full scan, ignoring and recalculating the cross reference stream/index and stream lengths. This can be handy if the PDF document has been hand-edited.
$pdf.update This performs an incremental update to the input pdf, which must be indexed PDF (not applicable to PDFs opened with :repair, FDF or JSON files). A new section is appended to the PDF that contains only updated and newly created objects. This method can be used as a fast and efficient way to make small updates to a large existing PDF document.
- :diffs(IO::Handle $fh) - saves just the updates to an alternate location. This can be later appended to the base PDF to reproduce the updated PDF.
$pdf.save-as("mydoc-2.pdf", :compress, :stream, :preserve, :rebuild) Saves a new document, including any updates. Options:
- :compress - compress objects for minimal size
- :!compress - uncompress objects for human readability
- :stream - write the PDF progressively
- :preserve - copy the input PDF, then incrementally update. This is generally faster and ensures that any digital signatures are not invalidated,
- :rebuild - discard any unreferenced objects. renumber remaining objects. It may be a good idea to rebuild a PDF Document, that's been incrementally updated a number of times.

Note that the :compress and :rebuild options are a trade-off. The document may take longer to save, however file-sizes and the time needed to reopen the document may improve.

$pdf.save-as("mydoc.json", :compress, :rebuild); my $pdf2 = $pdf.open: "mydoc.json" Documents can also be saved and opened from an intermediate JSON representation. This can be handy for debugging, analysis and/or ad-hoc patching of PDF files.

Reading PDF Files

The .open method loads a PDF index (cross reference table and/or stream). The document can then be access randomly via the .ind.obj(...) method.

The document can be traversed by dereferencing Array and Hash objects. The reader will load indirect objects via the index, as needed.

use PDF::IO::Reader;
use PDF::COS::Name;

my PDF::IO::Reader $reader .= new;
$reader.open: 'examples/helloworld.pdf';

# objects can be directly fetched by object-number and generation-number:
my $page1 = $reader.ind-obj(4, 0).object;

# Hashes and arrays are tied. This is usually more convenient for navigating
my $pdf = $reader.trailer<Root>;
$page1 = $pdf<Pages><Kids>[0];

# Tied objects can also be updated directly.
$reader.trailer<Info><Creator> = PDF::COS::Name.COERCE: 't/helloworld.t';

Utility Scripts

pdf-rewriter.raku [--repair] [--rebuild] [--stream] [--[/]compress] [--password=Xxx] [--decrypt] [--class=Module] [--render] <pdf-or-json-file-in> [<pdf-or-json-file-out>] This script is a thin wrapper for the PDF .open and .save-as methods. It can typically be used to:
- uncompress or render a PDF for human readability
- repair a PDF who's cross-reference index or stream lengths have become invalid
- convert between PDF and JSON

Decode Filters

Filters are used to compress or decompress stream data in objects of type PDF::COS::Stream. These are implemented as follows:

Filter Name	Short Name	Filter Class
ASCIIHexDecode	AHx	PDF::IO::Filter::ASCIIHex
ASCII85Decode	A85	PDF::IO::Filter::ASCII85
CCITTFaxDecode	CCF	NYI
Crypt		NYI
DCTDecode	DCT	NYI
FlateDecode	Fl	PDF::IO::Filter::Flate
LZWDecode	LZW	PDF::IO::Filter::LZW (`decode` only)
JBIG2Decode		NYI
JPXDecode		NYI
RunLengthDecode	RL	PDF::IO::Filter::RunLength

Input to all filters is byte strings, with characters in the range \x0 ... \0xFF. latin-1 encoding is recommended to enforce this.

Each filter has encode and decode methods, which accept and return latin-1 encoded strings, or binary blobs.

my Blob $encoded = PDF::IO::Filter.encode( :dict{ :Filter<RunLengthDecode> },
                                      "This    is waaay toooooo loooong!");
say $encoded.bytes;

Encryption

PDF::IO::Crypt supports RC4 and AES encryption (revisions /R 2 - 4 and versions /V 1 - 4 of PDF Encryption).

To open an encrypted PDF document, specify either the user or owner password: PDF.open( "enc.pdf", :password<ssh!>)

A document can be encrypted using the encrypt method: $pdf.encrypt( :owner-pass<ssh1>, :user-pass<abc>, :aes )

:aes encrypts the document using stronger V4 AES encryption, introduced with PDF 1.6.

Note that it's quite common to leave the user-password blank. This indicates that the document is readable by anyone, but may have restrictions on update, printing or copying of the PDF.

An encrypted PDF can be saved as JSON. It will remain encrypted and passwords may be required, to reopen it.

Built-in objects

PDF::COS also provides a few essential derived classes, that are needed read and write PDF files, including encryption, object streams and cross reference streams.

Class	Base Class	Description
PDF	PDF::COS::Dict	document entry point - the trailer dictionary
PDF::COS::Type::Encrypt	PDF::COS::Dict	PDF Encryption/Permissions dictionary
PDF::COS::Type::Info	PDF::COS::Dict	Document Information Dictionary
PDF::COS::Type::ObjStm	PDF::COS::Stream	PDF 1.5+ Object stream (packed indirect objects)
PDF::COS::Type::XRef	PDF::COS::Stream	PDF 1.5+ Cross Reference stream
PDF::COS::TextString	PDF::COS::ByteString	Implements the 'text-string' data-type

pdf-font-loader-raku's People

Contributors

Stargazers

Watchers

Forkers

melezhik dwarring tbrowder

pdf-font-loader-raku's Issues

Investigate handling of OpenType collections (*.otc)

I don't think these are as common as TrueType collection, but would be good to support this, if we can do it cheaply.

freetype claims to support these. Hoping handling may be similar to *.ttc (TrueType collections).

Incorrect font construction

The two attached PDFs compare HTML::Canvas::To::Cairo and HTML::Canvas::To::PDF rendering of the same font.

PDF::Font::Loader's version is different and seems to be incorrect.
cairo03.pdf
pdf-fontloader03.pdf

Example is confusing for understanding the difference between finding and loading a font

To me it is clearer to separate the two actions, if I understand correctly:

use PDF::Font::Loader :load-font, :find-font;
my $font-file = find-font: :family<DejaVu>, :slant<italic>, :lang<en>;
my $font      = load-font: $font-file;

# use the font in PDF-Lite
# ...
$pdf.add-page.text: {
    .font = $font;
    .text-position = [10, 600];
    .say: 'Hello, world';
}
# ...

Handle CID-keyed CFF fonts

This module cannot currently handle these fonts. Example attached.

NotoSand-Reg.zip

I think these should map to font sub-type Type01C.

SparrowCI test keeps failing in recently

Hi! This is a log - https://ci.sparrowhub.io/report/4063

HTH

Alexey

Variable CMap encoding not supported

Example PDF with variable encoding attached. See font /F2 8 0 R on page 1 and CMap 9 0 R
ttfont_demo_jp.pdf

Investigate font-subsetting

I'm considering doing a trial implementation with font-forge scripting.

Will most likely be implemented in PDF::Font::Loader::FreeType.cb-finish() method

Intending to create subset branch

If possible, for find-font, add named param to select: mono, serif, or sans

It looks like one of the fontconfig bin programs (fc-list) shows sans, serif, or mono as "atoms" for fonts it lists. You are not presently using the "style" as a named parameter that I see so far.

HarfBuzz shaping support

HarfBuzz does both shaping and font selection. It seems to be easiest to implement as be a variation of identity-h encoding.

kerning also needs to be taken over by the encoder. It's enabled by turning on the :kern feature in HarfBuzz fonts.

Add a :kern attribute to 'find-font' for limiting font searches to fonts with kerning ability

When

Large distribution size. Overuse of DejaVuSans.ttf in test suite

I'm thinking this is the reason this module isn't uploading to zef. Anyway it is an outlier in terms on distribution size, largely because
there are multiple saved PDF's with DejaVuSans fully embedded.

As a first cut, it can probably be substituted for the 10x smaller Vera.ttf in most/all tests.

Support for font shaping and subsetting

Work is underway on Raku bindings to the HarfBuzz font shaper.

Also separately underway on bindings to its font subsetting capability.

Would be good to integrate HarfBuzz for glyph selection at least. Ideally layout as well, but might require some work in PDF::Content to support an external shaper.

I think we need to combine it with Identity encoding and embedding.

Although both go well together. Font subsetting is somewhat independent of shaping.

Note: Most Linux distros don't support HarfBuzz's subsetting capability yet, which restricts its current use to those keen enough to build HarfBuzz from soruce, with subsetting enabled.

SparkyCI builds fails

Hi!

Here is the report - http://sparrowhub.io:2222/report/433

21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-fixed.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf16.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf32.t ............... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-utf8.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/cmap-variable.t ............ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/fontobj.t .................. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/load-font.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/pdf-text-align.t ........... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-cid.t ................ ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-type1.t .............. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-unembedded.t ......... Dubious, test returned 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All 2 subtests passed
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/subset.t ................... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/threads.t .................. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-add-encoding.t ....... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-encoding.t ........... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-encoding_issue#12.t .. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1-stream.t ............. ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type1.t .................... ok
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type3-basic.t .............. Dubious, test returned 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All 8 subtests passed
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] All tests successful.
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader]
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Test Summary Report
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] -------------------
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/reuse-unembedded.t (Wstat: 65280 Tests: 0 Failed: 0)
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Non-zero exit status: 255
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Parse errors: Bad plan. You planned 2 tests but ran 0.
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] t/type3-basic.t (Wstat: 65280 Tests: 1 Failed: 0)
21:08:50 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Non-zero exit status: 255
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Parse errors: Bad plan. You planned 8 tests but ran 1.
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Files=19, Tests=130, 156 wallclock secs
21:08:51 05/12/2022 [bash: run tests] :: [PDF::Font::Loader] Result: FAILED
21:08:51 05/12/2022 [bash: run tests] :: ===> Testing [FAIL]: PDF::Font::Loader:ver<0.6.2>:auth
21:08:51 05/12/2022 [bash: run tests] :: task exit status: 1

HTH

Aleksei

Encode new CMaps as UTF8

Rework PDF::Font::Loader::Enc to directly write utf-8 variable encoding for new CMaps.

Codespace range for UTF-8 encoding is, from https://adobe-type-tools.github.io/font-tech-notes/pdfs/5099.CMapResources.pdf:

4 begincodespacerange
  <00>       <7F>
  <C080>     <DFBF>
  <E08080>   <EFBFBF>
  <F0808080> <F7BFBFBF
endcodespacerange

Font failing to decode

The attached PDF is failing to decode an Identity-H encoded font with its supplied ToUnicode CMAP.

This is apparent when processed via pdf-tag-dump.raku. This script also explores this a little further:

use PDF::Font::Loader;
use PDF::Font::Loader::FontObj;

use PDF::COS::Dict;
use PDF::Lite;
my PDF::Lite $pdf .= open: "/tmp/SSRN-id4337484.pdf";

my  PDF::COS::Dict:D $dict = $pdf.page(1)<Resources><Font><F9>;

my PDF::Font::Loader::FontObj:D $font = PDF::Font::Loader.load-font: :$dict;

my $str = "\x[3]~\0\x[4]\x[1]\x[F]\x[1]µ\x[1]l\x[1]u\x[1]\x[1E]\x[1]]\x[1]o\x[3]U\0\x[3]\x[1]\x[1E]\x[1]\x[9A]\0\x[3]\x[1]\x[2]\x[1]o\x[3]X\x[3]U\0\x[3]\x[3]î\x[3]ì\x[3]î\x[3]í\x[3]V\0";

say  $str.comb(/../).map({$font.decode($_, :str)}).join;

Produces: (bukmeiletal2021, whereas the rendered text is (Abukmeil, et al., 2021

SSRN-id4337484.pdf

Add an option to return a list of all font files meeting the search criteria.

The user may have thousands of fonts available meeting the input criteria and want to make his own choice.

When using .print in PDF::Lite, $font.underline-thickness, $font.underline-position, and $font.height are needed

They are helpful for fine-tuning special effects. And they should be sized accordingly by :$font-size.

Subsetted Identity-H CMaps appear to be incorrect.

As a simple experiment, if I cut and paste xpdf display from tmp/subset.pdf, adter running t/subset.t, I get junk for the first (ttf) and third, identity-h encoided fonts.

Seems the CMaps are not correct.

load() method :name option, inconsistent with PDF::Content::Font :family option.

I'm included to rename :name to :family. Will invesitigate

Is $font.height applied with $gfx.say for the resulting text position?

More forgiving handling of TrueType font collections

The main issue with handling TrueType collections is that it's unsafe to embed them.

So firstly, we shouldn't need to throw an error, if the font is being loaded with :!embed.

We may also be better off warning and unsetting the embed flag rather than dying.

I'm also curious to see how TrueType collections interact with subsets. Some more testing needed in the Font::Subset module.

Custom CFF glyph names breaking encoding

Found running pdf2image.raku (PDF::To::Cairo non-module) on PDF file 000377.pdf.

Embedded CFF font have custom glyph names, e.g., GMOICK+MSTT31c531S00 which has /g38 /g179...

What happens if 'find-font' finds multiple files?

From a trial it seems to return the first file found meeting the input criteria. So the docs should say that.

The docs should also clearly state the :find-font option requires Raku module 'FontConfig'.

Handle CFF fonts

Not that common, but intending to port Cairo OTF subsetting, which converts to CFF.

Handle custom ligatures in string decoding

For example, the following CMap has a custom 'Th' ligature <00540068> (as well as a standard 'fl' <00660069>). Should it decode as :str to 'Th'?

3208 0 obj
<< /Length 609 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AAAAAA+F22+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /AAAAAA+F22+0 def
/CMapType 2 def
1 begincodespacerange <02> <90> endcodespacerange
3 beginbfchar
<20> <0020>
<3b> <003B>
<90> <2019>
endbfchar
9 beginbfrange
<28> <29> <0028>
<2c> <36> <002C>
<38> <39> <0038>
<41> <50> <0041>
<52> <54> <0052>
<56> <57> <0056>
<59> <5a> <0059>
<61> <7a> <0061>
<8d> <8e> <201C>
endbfrange
2 beginbfrange
<02> <02> [<00540068>]
<03> <03> [<00660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

endstream
endobj

Reusing a font within a PDF copies it

PDF::Font::Loader::Dict builds a Font::Loader::FreeType object with a brand new dictionary and stream.

Using this font to write back to the PDF works, but at the expense of replicated font object, including the underlying Font Descriptor. CMap and stream.

Simple.but inefficient.

Encoding detection/setup confused by a PDF with simple WinAnsiiEncoding encoded font

Also has UCS in its ToUnicode CMap name?
Example attached. Noticed while running PDF::To::Cairo's pdf2image.raku on it.
ucs-encode.pdf

Glyph maps can't handle type1 charsets

Here's an example of a font that can't currently be represented. From 000377-001.pdf (attached):

19 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /GMOICK+MSTT31c531S00
  /Encoding 20 0 R
  /FirstChar 1
  /FontDescriptor 21 0 R
  /LastChar 61
  /Widths [ 1004 836 391 334 558 613 502 334 772 606 440 331 440 248 248 772 496 552 827 552 606 248 386 716 606 827 552 496 496 1004 613 613 662 496 496 552 496 613 893 388 659 769 247 769 988 604 714 823 385 604 659 659 823 604 604 823 494 659 329 329 823 ]
>>
endobj

20 0 obj
<<
  /Type /Encoding
  /Differences [ 1 /g179 /g50 /g85 /g76 /g74 /g81 /g68 /g79 /g36 /g69 /g86 /g87 /g70 /g17 /g3 /g46 /g72 /g92 /g90 /g82 /g71 /g29 /g45 /g53 /g83 /g39 /g89 /g23 /g26 /g48 /g75 /g88 /g55 /g20 /g21 /g41 /g22 /g78 /g80 /g73 /g61 /g56 /g15 /g57 /g58 /g54 /g38 /g43 /g44 /g40 /g60 /g37 /g42 /g51 /g47 /g49 /g93 /g59 /g77 /g16 /g52 ]
>>
endobj

21 0 obj
<<
  /Type /FontDescriptor
  /Ascent 0
  /CapHeight 0
  /CharSet (/g38/g179/g23/g37/g36/g52/g58/g72/g50/g20/g15/g69/g21/g73/g56/g85/g86/g16/g88/g61/g76/g87/g82/g60/g74/g43/g70/g78/g80/g81/g47/g89/g17/g57/g83/g68/g41/g3/g75/g79/g51/g44/g59/g26/g92/g93/g55/g54/g49/g46/g45/g71/g29/g53/g40/g90/g42/g77/g39/g22/g48)
  /Descent 0
  /Flags 4
  /FontBBox [ -16 -265 1004 727 ]
  /FontFile3 22 0 R
  /FontName /GMOICK+MSTT31c531S00
  /ItalicAngle 0
  /StemV 0
>>
endobj

It's completely making up it's own encoding with custom glyphs and no unicode map. We couldn't do much with it, rather than render it, but it does show our representations are correct. Also we should be taking account of /CharSet to setup custom encoding -> cid mappings.

Freetype thread safety issues

I'm sometimes seeing issues when running pdf2image.raku from PDF::To::Cairo. For example:

david@box:~/git/PDF-To-Cairo-raku$ raku -I . bin/pdf2image.raku /tmp/out1.pdf 
saving page 1 -> PNG /tmp/out1-001.png...
saving page 9 -> PNG /tmp/out1-009.png...
saving page 17 -> PNG /tmp/out1-017.png...
saving page 25 -> PNG /tmp/out1-025.png...
saving page 33 -> PNG /tmp/out1-033.png...
saving page 41 -> PNG /tmp/out1-041.png...
saving page 49 -> PNG /tmp/out1-049.png...
loading font: Times-Bold -> /usr/share/fonts/opentype/urw-base35/NimbusRoman-Bold.otf
error processing glyph index: 44: FreeType Error: invalid argument
  in block  at /home/david/git/rakudo/install/share/perl6/site/sources/BB3ACC8ADEDFA6495C127C660154B220EF90E5A2 (PDF::Font::Loader::Enc) line 100
A worker in a parallel iteration (hyper or race) initiated here:
  in method save-as at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 574
  in sub MAIN at bin/pdf2image.raku line 33
  in block <unit> at bin/pdf2image.raku line 18

Died at:
    bad Cairo status 1 CAIRO_STATUS_NO_MEMORY after ShowText(Raku by example 101) operation
      in sub  at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 541
      in block  at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 947
      in method op at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 945
      in block  at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 1024
      in method ops at /home/david/git/rakudo/install/share/perl6/site/sources/ADBE257FC9A2D570E7135DBB29F8E0E9C69FB6F9 (PDF::Content::Ops) line 1023
      in method render at /home/david/git/rakudo/install/share/perl6/site/sources/94B207285FE37A6B2E4D3E0BD474EC06051A8163 (PDF::Content::Graphics) line 81
      in method render at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 60
      in method save-as-image at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 559
      in block  at /home/david/git/PDF-To-Cairo-raku/lib/PDF/To/Cairo.rakumod (PDF::To::Cairo) line 587

Freetype is not thread-safe for concurrent access to face objects, which seems to be going on here.

Investigation needed,

Metrics mismatch for core fonts

Currently, FontConfig is used to load any system font that best matches the core font and takes the metrics from there.

This may cause problems depending on the font selected and how well it matches the core font metrics.

I'm seeing evidence of this with PDF::To::Cairo rendering of Pod::To::PDF::API6 produced PDF's with core fonts.

Note that other un-embedded fonts use widths (/Widths or /W). So also setting the widths using core fonts metrics is probably better.

Ideal solution is to choose, or have fonts available that closely match core font metrics.

More investigation needed.

/Identity-H + /ToUnicode decoding of ligatures

The attached golfed PDF has an Identity-H encoded subset with a /ToUnicode mapping that includes the entry <0193> <00660069>,
mapping ligature CID 0x0193 to 'fi`. This is not currently respected, as in:

use Test;
plan 1;
use PDF::Lite;
use PDF::Font::Loader;
use PDF::Font::Loader::FontObj;

my PDF::Lite $pdf .= open: "identity-h-lig.pdf";

my $dict = $pdf.page(1)<Resources><Font><C2_0>;

my PDF::Font::Loader::FontObj:D $font = PDF::Font::Loader.load-font: :$dict;

my $bytes = buf8.new(0x00,0x32, 0x00,0x49, 0x01,0x93, 0x00,0x46, 0x00,0x48).decode: "latin-1";
is $dict.decode($bytes), 'Office';

which produces:

1..1
not ok 1 - 
# Failed test at /tmp/identity-h-lig.t line 14
# expected: 'Office'
#      got: 'Offce'
# You failed 1 test of 1

identity-h-lig.pdf