pdf-raku / pdf-api6 Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 518 KB

Facilitates the creation and modification of PDF files

License: Artistic License 2.0

Raku 98.23% Makefile 1.77%

pdf raku-module

pdf-api6's Introduction

[Raku PDF Project] / PDF

PDF-raku

Overview

This is a low-level Raku module for accessing and manipulating data from PDF documents.

It presents a seamless view of the data in PDF or FDF documents; behind the scenes handling indexing, compression, encryption, fetching of indirect objects and unpacking of object streams. It is capable of reading, editing and creation or incremental update of PDF files.

This module understands physical data structures rather than the logical document structure. It is primarily intended as base for higher level modules; or to explore or patch data in PDF or FDF files.

It is possible to construct basic documents and perform simple edits by direct manipulation of PDF data. This requires some knowledge of how PDF documents are structured. Please see 'The Basics' and 'Recommended Reading' sections below.

Classes/roles in this module include:

PDF - PDF document root (trailer)
PDF::IO::Reader - for indexed random access to PDF files
PDF::IO::Filter - a collection of standard PDF decoding and encoding tools for PDF data streams
PDF::IO::IndObj - base class for indirect objects
PDF::IO::Serializer - data marshalling utilities for the preparation of full or incremental updates
PDF::IO::Crypt - decryption / encryption
PDF::IO::Writer - for the creation or update of PDF files
PDF::COS - Raku Bindings to PDF objects [Carousel Object System, see COS]

Example Usage

To create a one page PDF that displays 'Hello, World!'.

#!/usr/bin/env raku
# creates examples/helloworld.pdf
use PDF;
use PDF::COS::Name;
use PDF::COS::Dict;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

sub prefix:</>($s) { PDF::COS::Name.COERCE($s) };

# construct a simple PDF document from scratch
my PDF $pdf .= new;
my PDF::COS::Dict $catalog = $pdf.Root = { :Type(/'Catalog') };

my @MediaBox  = 0, 0, 250, 100;

# define font /F1 as core-font Helvetica
my %Resources = :Procset[ /'PDF', /'Text'],
                :Font{
                    :F1{
                        :Type(/'Font'),
                        :Subtype(/'Type1'),
                        :BaseFont(/'Helvetica'),
                        :Encoding(/'MacRomanEncoding'),
                    },
                };

my PDF::COS::Dict $page-index = $catalog<Pages> = { :Type(/'Pages'), :@MediaBox, :%Resources, :Kids[], :Count(0) };
# add some standard metadata
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.CreationDate = DateTime.now;
$info.Producer = "Raku PDF";

# define some basic content
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET" ) };

# create a new page. add it to the page tree
$page-index<Kids>.push: { :Type(/'Page'), :Parent($page-index), :$Contents };
$page-index<Count>++;

# save the PDF to a file
$pdf.save-as: 'examples/helloworld.pdf';

Then to update the PDF, adding another page:

#!/usr/bin/env raku
use PDF;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

my PDF $pdf .= open: 'examples/helloworld.pdf';

# locate the document root and page tree
my $catalog = $pdf<Root>;
my $Parent = $catalog<Pages>;

# create additional content, use existing font /F1
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 16 Tf  15 25 Td (Goodbye for now!) Tj ET" ) };

# create a new page. add it to the page-tree
$Parent<Kids>.push: { :Type( :name<Page> ), :$Parent, :$Contents };
$Parent<Count>++;

# update or create document metadata. set modification date
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.ModDate = DateTime.now;

# incrementally update the existing PDF
$pdf.update;

Description

A PDF file consists of data structures, including dictionaries (hashes) arrays, numbers and strings, plus streams for holding graphical data such as images, fonts and general content.

PDF files are also indexed for random access and may also have internal compression and/or encryption.

They have a reasonably well specified structure. The document starts from the Root entry in the trailer dictionary, which is the main entry point into a PDF.

This module is based on the PDF 32000-1:2008 1.7 specification. It implements syntax, basic data-types, serialization and encryption rules as described in the first four chapters of the specification. Read and write access to data structures is via direct manipulation of tied arrays and hashes.

The Basics

The examples/helloworld.pdf file that we created above contains:

%PDF-1.3
%...(control characters)
1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>>
endobj

2 0 obj <<
  /Type /Catalog
  /Pages 3 0 R
>>
endobj

3 0 obj <<
  /Type /Pages
  /Count 1
  /Kids [ 4 0 R ]
  /MediaBox [ 0 0 250 100 ]
  /Resources <<
    /Font <<
      /F1 6 0 R
    >>
    /Procset [ /PDF /Text ]
  >>
>>
endobj

4 0 obj <<
  /Type /Page
  /Contents 5 0 R
  /Parent 3 0 R
>>
endobj

5 0 obj <<
  /Length 44
>> stream
BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET
endstream
endobj

6 0 obj <<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
  /Encoding /MacRomanEncoding
>>
endobj

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n 
trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The PDF is composed of a series indirect objects, for example, the first object is:

1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>> endobj

It's an indirect object with object number 1 and generation number 0, with a << ... >> delimited dictionary containing the author and the date that the document was created. This PDF dictionary is roughly equivalent to the Raku hash:

{ :CreationDate("D:20151225000000Z00'00'"), :Producer("Raku PDF"), }

The bottom of the PDF contains:

trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The << ... >> delimited section is the trailer dictionary and the main entry point into the document. The entry /Info 1 0 R is an indirect reference to the first object (object number 1, generation 0) described above. The entry /Root 2 0 R points the root of the actual PDF document, commonly known as the Document Catalog.

Immediately above the trailer is the cross reference table:

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n

This indexes the indirect objects in the PDF by byte offset (generation number) for random access.

We can quickly put PDF to work using the Raku REPL, to better explore the document:

snoopy: ~/git/PDF-raku $ raku -M PDF
> my $pdf = PDF.open: "examples/helloworld.pdf"
ID => [CÜ{ÃHADCN:C CÜ{ÃHADCN:C], Info => ind-ref => [1 0], Root => ind-ref => [2 0]
> $pdf.keys
(Root Info ID)

This is the root of the PDF, loaded from the trailer dictionary

> $pdf<Info>
{CreationDate => D:20151225000000Z00'00', ModDate => D:20151225000000Z00'00', Producer => Raku PDF}

That's the document information entry, commonly used to store basic meta-data about the document.

(PDF::IO has conveniently fetched indirect object 1 from the PDF, when we dereferenced this entry).

> $pdf<Root>
{Pages => ind-ref => [3 0], Type => Catalog}

The trailer Root entry references the document catalog, which contains the actual PDF content. Exploring further; the catalog potentially contains a number of pages, each with content.

> $pdf<Root><Pages>
{Count => 1, Kids => [ind-ref => [4 0]], MediaBox => [0 0 420 595], Resources => Font => F1 => ind-ref => [6 0], Type => Pages}
> $pdf<Root><Pages><Kids>[0]
{Contents => ind-ref => [5 0], Parent => ind-ref => [3 0], Type => Page}
> $pdf<Root><Pages><Kids>[0]<Contents>
{Length => 44}
"BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET"

The page /Contents entry is a PDF stream which contains graphical instructions. In the above example, to output the text Hello, world! at coordinates 100, 250.

Reading and Writing of PDF files:

PDF is a base class for opening or creating PDF documents.

my $pdf = PDF.open("mydoc.pdf" :repair) Opens an input PDF (or FDF) document.
- :!repair causes the read to load only the trailer dictionary and cross reference tables from the tail of the PDF (Cross Reference Table or a PDF 1.5+ Stream). Remaining objects will be lazily loaded on demand.
- :repair causes the reader to perform a full scan, ignoring and recalculating the cross reference stream/index and stream lengths. This can be handy if the PDF document has been hand-edited.
$pdf.update This performs an incremental update to the input pdf, which must be indexed PDF (not applicable to PDFs opened with :repair, FDF or JSON files). A new section is appended to the PDF that contains only updated and newly created objects. This method can be used as a fast and efficient way to make small updates to a large existing PDF document.
- :diffs(IO::Handle $fh) - saves just the updates to an alternate location. This can be later appended to the base PDF to reproduce the updated PDF.
$pdf.save-as("mydoc-2.pdf", :compress, :stream, :preserve, :rebuild) Saves a new document, including any updates. Options:
- :compress - compress objects for minimal size
- :!compress - uncompress objects for human readability
- :stream - write the PDF progressively
- :preserve - copy the input PDF, then incrementally update. This is generally faster and ensures that any digital signatures are not invalidated,
- :rebuild - discard any unreferenced objects. renumber remaining objects. It may be a good idea to rebuild a PDF Document, that's been incrementally updated a number of times.

Note that the :compress and :rebuild options are a trade-off. The document may take longer to save, however file-sizes and the time needed to reopen the document may improve.

$pdf.save-as("mydoc.json", :compress, :rebuild); my $pdf2 = $pdf.open: "mydoc.json" Documents can also be saved and opened from an intermediate JSON representation. This can be handy for debugging, analysis and/or ad-hoc patching of PDF files.

Reading PDF Files

The .open method loads a PDF index (cross reference table and/or stream). The document can then be access randomly via the .ind.obj(...) method.

The document can be traversed by dereferencing Array and Hash objects. The reader will load indirect objects via the index, as needed.

use PDF::IO::Reader;
use PDF::COS::Name;

my PDF::IO::Reader $reader .= new;
$reader.open: 'examples/helloworld.pdf';

# objects can be directly fetched by object-number and generation-number:
my $page1 = $reader.ind-obj(4, 0).object;

# Hashes and arrays are tied. This is usually more convenient for navigating
my $pdf = $reader.trailer<Root>;
$page1 = $pdf<Pages><Kids>[0];

# Tied objects can also be updated directly.
$reader.trailer<Info><Creator> = PDF::COS::Name.COERCE: 't/helloworld.t';

Utility Scripts

pdf-rewriter.raku [--repair] [--rebuild] [--stream] [--[/]compress] [--password=Xxx] [--decrypt] [--class=Module] [--render] <pdf-or-json-file-in> [<pdf-or-json-file-out>] This script is a thin wrapper for the PDF .open and .save-as methods. It can typically be used to:
- uncompress or render a PDF for human readability
- repair a PDF who's cross-reference index or stream lengths have become invalid
- convert between PDF and JSON

Decode Filters

Filters are used to compress or decompress stream data in objects of type PDF::COS::Stream. These are implemented as follows:

Filter Name	Short Name	Filter Class
ASCIIHexDecode	AHx	PDF::IO::Filter::ASCIIHex
ASCII85Decode	A85	PDF::IO::Filter::ASCII85
CCITTFaxDecode	CCF	NYI
Crypt		NYI
DCTDecode	DCT	NYI
FlateDecode	Fl	PDF::IO::Filter::Flate
LZWDecode	LZW	PDF::IO::Filter::LZW (`decode` only)
JBIG2Decode		NYI
JPXDecode		NYI
RunLengthDecode	RL	PDF::IO::Filter::RunLength

Input to all filters is byte strings, with characters in the range \x0 ... \0xFF. latin-1 encoding is recommended to enforce this.

Each filter has encode and decode methods, which accept and return latin-1 encoded strings, or binary blobs.

my Blob $encoded = PDF::IO::Filter.encode( :dict{ :Filter<RunLengthDecode> },
                                      "This    is waaay toooooo loooong!");
say $encoded.bytes;

Encryption

PDF::IO::Crypt supports RC4 and AES encryption (revisions /R 2 - 4 and versions /V 1 - 4 of PDF Encryption).

To open an encrypted PDF document, specify either the user or owner password: PDF.open( "enc.pdf", :password<ssh!>)

A document can be encrypted using the encrypt method: $pdf.encrypt( :owner-pass<ssh1>, :user-pass<abc>, :aes )

:aes encrypts the document using stronger V4 AES encryption, introduced with PDF 1.6.

Note that it's quite common to leave the user-password blank. This indicates that the document is readable by anyone, but may have restrictions on update, printing or copying of the PDF.

An encrypted PDF can be saved as JSON. It will remain encrypted and passwords may be required, to reopen it.

Built-in objects

PDF::COS also provides a few essential derived classes, that are needed read and write PDF files, including encryption, object streams and cross reference streams.

Class	Base Class	Description
PDF	PDF::COS::Dict	document entry point - the trailer dictionary
PDF::COS::Type::Encrypt	PDF::COS::Dict	PDF Encryption/Permissions dictionary
PDF::COS::Type::Info	PDF::COS::Dict	Document Information Dictionary
PDF::COS::Type::ObjStm	PDF::COS::Stream	PDF 1.5+ Object stream (packed indirect objects)
PDF::COS::Type::XRef	PDF::COS::Stream	PDF 1.5+ Cross Reference stream
PDF::COS::TextString	PDF::COS::ByteString	Implements the 'text-string' data-type

pdf-api6's People

Stargazers

Watchers

Forkers

tklebanoff tbrowder melezhik

pdf-api6's Issues

Unable to use text methods on multiple pages

I have a sub that, given a $pdf.page, uses a text block to put text on that page.
Then I add a new page and use the same sub on the new $pdf.page.

The two pages are saved as a single file with $pdf.save-as, but, when the file is rendered, only the first page has the text on it. Is there a method to finish a page before creating a new one? I will show my sample program in a gist.

The gist: https://gist.github.com/tbrowder/8a3b7f4ec333caa6d3c524aa5a14d803#file-make-pdf-raku

Support named destinations

Among other things, these allow browsers to open PDF documents at an arbitrary location.

As an example https://docs.aspose.com/cells/net/add-pdf-bookmarks-with-named-destinations/50528349.pdf#AsposeCells--L4 opens to a location on page 3.

Probably just needs a modest extension to the existing API, e.g.: $pdf.destination( :name<foo>, :page(2), :fit(FitWindow) ); to enable my.pdf#foo as a destination.

[thanks] This looks very useful for my tax needs

David, thanks for all your tremendous work in this area!

I have seen your module updats reported from CPAN on #raku often and didn't think I had any use for them. I have been generating PostScript files by hand and Perl and now Raku for over 25 years and started converting the PS to pdf as soon as ps2pdf came along.

However, this year I see how your modules can help me greatly! This year, for the first time ever, I signed on with a local CPA so my wife can have financial help in the event of my demise.

He requires us to fill out detailed pdf forms so he can have controlled input for his work flow. Today, while wrestling with some tax stuff for him, I find that it would be nice if I can programmatically update some or all of his 16-page pdf form from my records via Raku filtering from csv dumps from our financial accounts. With the help of your wonderful modules I hope to have some success soon!

Blessings.

-Tom (@tbrowder)

Todo: Support Form-filling

After some experimentation, setting the form value is not enough to enter a displayable value (Acrobat Reader). Displays OK on Xpdf, Evince, but AR needs an Appearance. Perl 5 CAM PDF's CAM::PDF.fillFormFields() can be used as a reference implementation for this feature.

Rebase from PDF::Lite to PDF::Zen

This module currently inherits from PDF::Lite. This keeps the overall distribution lightweight.

However, there is some functional overlap between this module (PDF::API6) and PDF::Zen, which is currently under construction.

Holding ATM back for a couple of reasons:

PDF::Zen is brand new has only just passed it's first Rakudo point release 2017.08.
It's another dependency that needs to be managed. We don't yet have a CPAN like mechanism for upgrading.

Currently progressing this on the zen branch. Will review after 2017.09 point release.

README needs reorganization/transference to doco repository

This was created before https://github.com/pdf-raku/pdf-raku.github.io can into existence and has continued to grow and become a defacto documentation point for the rest of the tool-chain.

Initial problem is that it's in the wrong place. At some point the bulk of it should be moved to the doco repository and the README slimmed down to refer to it.

Valign not centering text in xforms

See results of this program:

#!/bin/env raku

use v6;
use PDF::API6;
use PDF::Page;
use PDF::XObject::Form;
use PDF::Content::Color :rgb;

my PDF::API6 $pdf .= new;
my PDF::Page $page = $pdf.add-page;
$page.media-box = [0, 0, 8.5*72, 11*72];

my $ofile = "xforms.pdf";

# create a new XObject form of size 100x50
my @BBox = [0, 0, 100, 50];
my PDF::XObject::Form $form = $page.xobject-form: :@BBox;

$form.graphics: {
    .Save;
    # color the entire form
    .FillColor = rgb(0, 0, 0); #color Black;
    .Rectangle: |@BBox;
    .paint: :fill, :stroke;
    .FillColor = rgb(1, 1, 1); #color White;
    # add some sample text
    .text: {
        .font = .core-font('Helvetica'), 14;
        .print: "White", :position[50, 25], :align<center>, :valign<center>;
    }
    .Restore;
}

# display the form a couple of times
$page.graphics: {
    .Save;
    .transform: :translate(300, 300);
    .do($form);
    .Restore;
    .Save;
    .transform: :translate(200, 300);
    .do($form);
    .Restore;
}

$pdf.save-as: $ofile;
say "see output file: $ofile";

Explain what the "codes" are in the various tables in file README.md

Are the "codes" aliases for accessors and methods? I hope so because I'm going to try them.

Implement Outlines.

For creating a user TOC/navigatable index. See http://www.perlmonks.org/?node_id=1191298

Adding an image doesn't seem to work

I tried run this using different images:

use PDF::API6;
use PDF::Content::Page :PageSizes;

sub MAIN($output, $file)
{
  my PDF::API6 $pdf .= new;
  $pdf.media-box = A4;
  my $page = $pdf.add-page;
  my $gfx = $page.gfx;
  my $image = $gfx.load-image($file);
  $gfx.do($image, 10, 20);
  $pdf.save-as($output);
}

but the resulting PDF doesn't show anything, just a blank page.
Is there something I'm overlooking?

Actions annotation and field construction.

Following on from destinations, implemented with #2. See also https://metacpan.org/pod/PDF::API2::Annotation.

Click the outlines after make toc, but it doesn't jump to the special page number

I use the follwing code to make outlines, it works quit fine. But when I opened the output pdf file, and click the outlines on the left, I expect it will jump to the specified page number, but nothing happend.

use PDF::API6;
my PDF::API6 $pdf .= new;
$pdf.add-page for 1 .. 7;
use PDF::Destination :Fit;

for 1 .. 7 -> $n {
    $pdf.page($n).text: {
        .text-position = 10,20;
        .say('text @10,20');
    }
}


sub dest(|c) { :destination($pdf.destination(|c)) }

$pdf.outlines.kids = [
          %( :Title('1. Purpose of this Document'), dest(:page(1))),
          %( :Title('2. Pre-requisites'),           dest(:page(2))),
          %( :Title('3. Compiler Speed-up'),        dest(:page(3))),
          %( :Title('4. Recompiling the Kernel for Modules'), dest(:page(4)),
             :kids[
                %( :Title('5.1. Configuring Debian or RedHat for Modules'),
                   dest(:page(5), :fit(FitXYZoom), :top(798)) ),
                %( :Title('5.2. Configuring Slackware for Modules'),
                   dest(:page(5), :fit(FitXYZoom), :top(400)) ),
                %( :Title('5.3. Configuring Other Distributions for Module'),
                   dest(:page(5), :fit(FitXYZoom), :top(200)) ),
             ],
           ),
          %( :Title('Appendix'), dest(:page(7))),
         ];

$pdf.save-as: "../tmp/make-toc.pdf";

Error while trying to write on a PDF v.1.6

I'm using this program:

use PDF::API6;

my PDF::API6 $pdf .= open('old.pdf');
my $page = $pdf.page(1);
my $font = $pdf.core-font('Helvetica-Bold');
$page.text: {
  .font = $font, 20;
  .text-position = 200, 700;
  .say('Hello World!');
}
$pdf.save-as('new.pdf');

to open an existing PDF file, write a string on it, and save the result.
While it works on a PDF v.1.3, it doesn't on a PDF v.1.6.
The error I get is:

Type check failed in assignment to @ops; expected Pair but got Str ("tba: ignored-block =...)
  in method render at /home/nando/.perl6/sources/77CED8D94355E86EA0C383F654D5B2B924829E89 (PDF::Content::Graphics) line 71
  in method gfx at /home/nando/.perl6/sources/77CED8D94355E86EA0C383F654D5B2B924829E89 (PDF::Content::Graphics) line 45
  in method text at /home/nando/.perl6/sources/77CED8D94355E86EA0C383F654D5B2B924829E89 (PDF::Content::Graphics) line 50
  in block <unit> at ./test.p6 line 8

todo: Digital Signing

Would be nice to have the ability to digitally sign documents in various ways. See Adobe technical documentation: https://www.adobe.com/devnet-docs/acrobatetk/tools/DigSig/Acrobat_DigitalSignatures_in_PDF.pdf.

There are some online services that claim to be able to verify digital signatures and could be used for verification purposes.

PDF::Class has the PDF::Signature class. I think we'll also need an independent module.

Note that verification is a harder problem that signing. All (or at least most) types of signatures need to be supported.

Using @bbox = .print: $text, :align<center> results in incorrect @bbox coords

In my use case:

my @position = [$x, $y];
my @bbox = .print: $text, :@position, :align<center>, :$font;

the resulting @bbox[0] and @bbox[2] coordinates (x values) were unchanged from what would have resulted from the default :align at @position = [$x, $y].

In other words, the :align<center> value worked correctly when rendered, but it had no effect on the @bbox values.

Note I have not yet checked effects of other :align or :valign values on @bbox.

Pulling in image from a resource?

I have a document coming from a scanner with the following structure:

% **** Page 1 ****
q
  608.4 0 0 789.12 0 0 cm
  /FXX1 Do
Q
% **** Page 2 ****
q
  605.52 0 0 787.68 0 0 cm
  /FXX1 Do
Q
% **** Page 3 ****
q
  605.16 0 0 790.56 0 0 cm
  /FXX1 Do
Q

FXX1 here is the image itself. $gfx.images doesn't see it. Is there a way to pull it out somehow and save into a PNG or whatever?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.