Coder Social home page Coder Social logo

py-pdf / pypdf Goto Github PK

View Code? Open in Web Editor NEW
7.6K 7.6K 1.3K 17.67 MB

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Home Page: https://pypdf.readthedocs.io/en/latest/

License: Other

Python 99.93% Makefile 0.07% Shell 0.01%
help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python

pypdf's Introduction

py-pdf.github.io

Website py-pdf

Install requirements

$ pip install -r requirements.txt
$ pre-commit install

Launch local server with livereload

$ invoke livereload

Adding a Python dependency

  1. Edit requirements.in
  2. Run pip-compile requirements.in to generate requirements.txt

Publish

$ make github

pypdf's People

Contributors

caxap avatar cclauss avatar dependabot[bot] avatar dkg avatar egbutter avatar exiledkingcc avatar hatell avatar henrykeiter avatar j-t-1 avatar jamma313 avatar knowah avatar kushal-kumaran avatar maphew avatar marcstober avatar martinthoma avatar masterodin avatar mergezalot avatar moshekaplan avatar mozbugbox avatar mstamy2 avatar mtd91429 avatar pubpub-zz avatar rob1080 avatar srogmann avatar stefan6419846 avatar switham avatar sylvainpelissier avatar vashek avatar vfigueiro avatar vladir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypdf's Issues

Can't getData() from /Contents List

I'm trying to dig deep into some PDFs by calling getData directly on part of a page (I am then parsing that data to find coordinates for a bit of text).

This worked for me in the past with essentially:

page = PdfFileReader(inpdf).getPage(0)
text = page.getContents().getData()   #<-- or page["/Contents"].getData()

but with my new PDFs, I am getting an error like this:
"AttributeError: 'ArrayObject' object has no attribute 'getData'"

Digging in, it looks like my old PDF was structured like this (print page) with a single IndirectObject in the contents.

{'/Contents': IndirectObject(14, 0),
 '/MediaBox': [0, 0, 662.40000, 792],
 '/Parent': IndirectObject(1, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(10, 0),
                          '/F4': IndirectObject(7, 0),
                          '/F5': IndirectObject(4, 0)},
                '/ProcSet': IndirectObject(13, 0),
                '/XObject': {}},
 '/Type': '/Page'}

Then page.GetContents() returns:

{'/Filter': '/FlateDecode'}

while my new PDF is structured like this with a list of IndirectObjects in the contents:

{'/Contents': [IndirectObject(11, 0),
               IndirectObject(12, 0),
               IndirectObject(13, 0),
               IndirectObject(14, 0),
               IndirectObject(15, 0),
               IndirectObject(16, 0),
               IndirectObject(17, 0),
               IndirectObject(18, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': IndirectObject(5, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(24, 0),
                          '/F4': IndirectObject(26, 0),
                          '/F6': IndirectObject(29, 0),
                          '/F7': IndirectObject(30, 0)},
                '/ProcSet': IndirectObject(31, 0),
                '/XObject': {}},
 '/Rotate': 0,
 '/Type': '/Page'}

then page.getContents() returns:

[IndirectObject(11, 0),
 IndirectObject(12, 0),
 IndirectObject(13, 0),
 IndirectObject(14, 0),
 IndirectObject(15, 0),
 IndirectObject(16, 0),
 IndirectObject(17, 0),
 IndirectObject(18, 0)]

How do I get at the underlying data of /Contents? going after the pieces of the list with page.getContents()[0] just returns the name of the object and I can't use getData() on that. I can't tell if this is a bug (caused by having a list as the contents) or if I am missing some feature.

PyPDF2 should not overwrite warnings.formatwarning.

Hello,

PyPDF2 1.2.0 overwrites warnings.formatwarning with its own implementation (utils._formatwarning) in pdf.py line 74:

warnings.formatwarning = utils._formatwarning

Unfortunately this may cause severe side-effects if PyPDF2 is imported in a larger application. In our case the PyPDF2 implementation of formatwarning caused IndexErrors whenever a warning was raised somewhere else (and the filename argument was not to the formatter's liking).

Personally, I do not think that it is a good idea for a library to interfere with the global logging/warning infrastructure.

P.S.: Apart from this problem, we have been using PyPDF2 successfully for some time now. Nice piece of software!

Speed up parser

Currently the parser is quite slow, even for moderately sized PDFs. When I get a bit of time, I'm going to investigate different ways it could be sped up. Right now (pending some profiling, obviously) I suspect this is going to involve re-writing some of the core parser loops in something lower level like Cython. I'm looking into options to see if it's possible to write in a language which will be able to compile back to vanilla Python for the benefit of PyPy and friends.

I'm opening this issue to start discussion on the matter, and see if you've got any strong feelings either way.

Some valid but unstand indirect object cause PyPDF2 failure

The issue is something like this: /FontFile2 11 0 R

There is more than 1 space there, cause PyPDF2 failure:

/PyPDF2/generic.py", line 256, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''

This should be supported anyway.

Scaling in python 2.6

I cannot seem to get scaling to work.
If I submit a float or int to "scaleBy":
TypeError: Cannot convert float to Decimal. First convert the float to a string
If I submit a string:
TypeError: can't multiply sequence by non-int of type 'float'
If I submit a Decimal:
TypeError: unsupported operand type(s) for *: 'float' and 'Decimal'

Add method ignoreText

Hi,

I have a PDF and I wan't to remove the text from PDF file , to keep only image in my PDF.

I see have a method ignoreLinks for PdfFileWriter object, can you add method ignoreText ?

Or explain how I can do ?

Thanks.

retaining bookmarks using merge

When using the merge function with two files and using the import_bookmarks=True option, the bookmarks are always off by 1 page.

The issue is further compounded by different .pdf readers. I'm seeing in Adobe the bookmarks are off by 1 page (one page behind) and in other readers like PDF Complete - they are correct.

I made the following adjustment in the source code (merger.py) _associate_bookmarks_to_pages --
for p in pages:
if bp.getObject() == p.pagedata.getObject():
pageno = p.id-1 ########### the -1 was added

Everything looks great in Adobe but now the file in PDF Complete it's off by 1 page...fortunately I only support Adobe.

After further inspection -- although bookmarks work -- the bookmarks are highlighted incorrectly when scrolling through pages. They are off by 1.

I checked the file using the getOutlines() function and saw the file was structured incorrectly with the "/Page" key being off for each item:

Eg:
[......,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 6}, .... ]

Should read this:
[....,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 7}, ...]

And yes I do understand pages start at "0" !

What would I need to fix the root '/Page' key? Would someone be able to help me?

DCT Filter

PyPDF2 currently lacks a filter for DCT compression (true? Even as maintainers, we sometimes forget everything there is to know about PyPDF2). How important is it that we add this? There certainly are instances "in the wild" of PDF which use DCT compression; should we care?

[See also internal Issue756.]

PDF /PageLayout and /PageMode options

Hi,

I've been using PyPDF2 to merge some PDF files, adding bookmarks to the various pages as needed. I've been using the code below to set the initial view of the output PDF so that it shows one page at a time, and displays the bookmarks navigation panel.

pdf = PdfFileWriter()
root = output.getObject(pdf._root)
root.update({NameObject('/PageLayout'): NameObject('/SinglePage'), NameObject('/PageMode'): NameObject('/UseOutlines')})

I'm wondering if there would be any interest in writing this into a more formal method. Maybe something like:

pdf = PdfFileWriter()
pdf.page_layout = 'SinglePage'
pdf.page_mode = 'Bookmarks'

I'm happy to write this an submit a pull request, but I though I'd get some feedback on the syntax.

In addition to this, it would be nice to be modify the author, title, etc. Maybe this is already possible and I've just missed it...

PyPDF2 failing at import

I am using PyPDF2 for extracting text and geometry from a PDF and this is my code snippet of Pdftext.py file :

from PyPDF2 import PdfFileReader

When i run this, i am getting the below error:

Traceback (most recent call last):
File "C:\Program Files\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Mic
rosoft\Python Tools for Visual Studio\2.0\visualstudio_py_util.py", line 76, in
exec_file
exec(code_obj, global_variables)
File "C:\Users\xxx\documents\visual studio 2012\Projects\PDFText\PDFT
ext\PDFText.py", line 3, in
import PyPDF2
File "C:\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 1049, in

u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u

0000'), u_('\u0000'), u_('\u0000'),
File "C:\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

Query - is there a way to bypass security restrictions on a pdf?

I have a pdf that has security restrictions. I need to merge some content into the secured pdf. I don't need the pdf to be secured after the merge.
When I open the file and check isEncrypted, it returns true.
When I try decrypt with empty string there's a notImplementedError raised "only algorithm code 1 and 2 are supported".

The restrictions on the file are shown below.
restrictions

At the moment, to bypass the restrictions on the file, I print the pdf to images and create a new pdf with those images. This isn't ideal as the file size becomes large and the content isn't as crisp.

Is there a better way?

PyPDF2 does not work under pypy

NumberObject is initialized wrong

class NumberObject(int, PdfObject):
    def __init__(self, value):
        int.__init__(value)

Correct would be;

class NumberObject(int, PdfObject):
def init(self, value):
int.init(self, value)

PyPDF2 - AutoCad generated PDF and Watermark

Hi

For some time ago I reported a problem regarding AutoCad generated PDFs.
This problems was solved.

I have encountered a new problem which I belive is also related to the AutoCad generated PDFs.

This time I'm adding a watermark to an existing pdf.
I am able to add this watermark-file (created using pyfpdf ) to most of the files

            a = PdfFileReader(open(filein, "rb")).getPage(0)
            watermark   =  PdfFileReader(file(r'c:\temp\test.pdf','rb')).getPage(0)
            a.mergePage(watermark)

filein is a AutoCad generated PDF.
.

This fails:

a.mergePage(watermark)

File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1644, in _mergePage
originalContent, self.pdf))
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1557, in _pushPopGS
stream = ContentStream(contents, pdf)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1986, in init
self.__parseContentStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 2025, in __parseContentStream
operands.append(readObject(stream, None))
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 55, in readObject
return readStringFromStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 370, in readStringFromStream
raise utils.PdfReadError("Unexpected escaped string")
PyPDF2.utils.PdfReadError: Unexpected escaped string

Looks very similar to the last problem I reported.

Olav

Will hang on invalid PDFs

Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment function:

def skipOverComment(stream):
    tok = stream.read(1)
    stream.seek(-1, 1)
    if tok == b_('%'):
        while tok not in (b_('\n'), b_('\r')):
            tok = stream.read(1)

Will hang indefinitely.

I would propose three courses of action:

  1. Wrap the stream in a method which will raise an exception after a certain number of empty reads; ex:
class SafeStream(object):
    def __init__(self, stream):
        self.stream = stream
        self.seek = stream.seek
        self.tell = stream.tell
        self._empty_reads = 0

    def read(self, *args):
        res = self.stream.read(*args)
        if res == "":
             self._empty_reads += 1
             if self._empty_reads > 1000:
                 raise Exception("too many empty reads")
        else:
             self._empty_reads = 0
        return res
  1. Add a script for automating fuzz testing to the repo

  2. Fix the bugs as the script from step (2) finds them

What do you think? Would you be open to patches for those?

Add method ignoreImage

Hi,

Like my last post "Add method ignoreText" I need to extract only test from Pdf, I try some products for extract text from pdf but all return text in String. But no one keep text position and fonts. I think PyPdf is the good tools for do that.

I add this method in pdf.py in class PdfFileWriter:

   def ignoreImage(self, ignoreByteStringObject=False):
         pages = self.getObject(self._pages)['/Kids']
        for j in range(len(pages)):
            page = pages[j]
            pageRef = self.getObject(page)
            content = pageRef['/Contents'].getObject()
            if not isinstance(content, ContentStream):
                content = ContentStream(content, pageRef)


        _operations = []
        seq_graphics = False
        for operands, operator in content.operations:
            if operator == "Tj":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == "'":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == '"':
                text = operands[2]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[2] = TextStringObject()
            elif operator == "TJ":
                for i in range(len(operands[0])):
                    if ignoreByteStringObject:
                        if not isinstance(operands[0][i], TextStringObject):
                            operands[0][i] = TextStringObject()

            if operator == 'q':
                seq_graphics = True
            if operator == 'Q':
                seq_graphics = False
            if seq_graphics:
                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i', 'gs',
                        'W','n', 'f', 'm', 'l', 'cm', 'Do', 'sh', 'S']:
                    continue
            if operator == 're':
                continue
            _operations.append((operands, operator))

        content.operations = _operations
        pageRef.__setitem__(NameObject('/Contents'), content)

If you thinks this method is helpful. can you add it ?

Thanks.

MergePage rotates 1 page relative to the other, in certain pdfs

I'm merging 2 pdfs using code that works correctly for other pdfs. I'm using the mergePage method to overlay the content from one pdf on the other pdf (merge page by page).
In the image below, the numbers (highlighted by red box) should be positioned vertically.

capture

The "base pdf" is a scan from a Xerox WorkCentre 7435. The "secondary pdf" (containing the highlighted numbers) is generated using reportlab. The "base pdf" and "secondary pdf" have portrait orientation when viewing in a pdf viewer.
Other scans (from other scanners) merge correctly.

I don't know much about how pdf structure works, but is it possible the scan isn't including some data (orientation)?

I will try include a problem pdf when I obtain one that doesn't contain sensitive information.
Thanks
Rob

PDF split with links

I have a 483 page PDF that I use for testing (manual). The problem is that when I try to split the document, it takes almost 2 min to process the first handful of pages, and then 3 seconds to process the remaining 450+.

Pages 3-6 contain a table of contents with links to other parts of the PDF. When I take these few pages out of the document, it takes 3-4 seconds to split the 483 pages.

Any ideas why its hanging on the table of contents (with links).

Encryption/Decryption in Python 3

This seems to be the only feature that doesn't work under Python 3. There are several encryption algorithms, it is probably just a matter of using utils.py correctly to avoid TypeErrors.

API compatibility with PyPDF

Hi,

Is PyPDF2 fully API compatible with PyPDF ? I'm trying to get PyPDF in Fedora replaced by PyPDF2 but we must know if it won't break anything or fix application accordingly.

Thanks !

KeyError: '/Type' when merging pages

Merging 2 pdfs. The first pdf is from paperport 11 (some old program which may not support pdf structure correctly?), I initially needed to apply the fix from #34 (to fix EOF error). The next issue I encountered is in the method: _flatten (in pdf.py) where "/Type" isn't present in the pages dictionary.
I made the following change:

 def _flatten(self, pages=None, inherit=None, indirectRef=None):
       ... 
       ...
        #this is the change I made; default t = '/Pages'. Is this the correct thing to do?
        t = "/Pages"
        if "/Type" in pages:
            t = pages["/Type"]
        ...

Should I commit a fix for this (and make it conditional on strict parameter)? Or is there a better way to pick a type?

Can't read pdf

I get an mysterious error with the PDF Reader using python3 on the file
"Werner - Fragen und Antworten zu Werkstoffen.pdf".
My Code:

import fnmatch
import os
from PyPDF2 import PdfFileReader

for file in os.listdir('.'):
    if fnmatch.fnmatch(file,'*.pdf'):
        print("File: "+file)
        foo = PdfFileReader(open(file,"rb"))

Error:

File: Werner - Fragen und Antworten zu Werkstoffen.pdf
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    foo = PdfFileReader(open(file,"rb"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 684, in __init__
    self.read(stream)
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 1236, in read
    streamData = BytesIO(xrefstream.getData())
  File "/usr/lib/python3.3/site-packages/PyPDF2/generic.py", line 834, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in decode
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in <listcomp>
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
TypeError: ord() expected string of length 1, but int found

Is smth broken with my filename or why this error occurs?

int() got an unexpected keyword argument 'base' error at line 803 in pdf.py when using Py2PDF2

When I execute the following code in Visual Studio 2012 using Python tools and ironpython 2.7 and PyPDF2 v1.20.

i got this error "int() got an unexpected keyword argument 'base' " line 803 in pdf.py

This is my complete code:

import clr
clr.AddReference('System.Drawing')
clr.AddReference('System.Windows.Forms')

from System.Drawing import *
from System.Windows.Forms import *
from PyPDF2 import PdfFileReader
class MyForm(Form):

def __init__(self):
    # Create child controls and initialize form
    self.Text = "Test Project"
    self.Size = Size(600, 500)

    path = "F:/Download/RealPython.pdf"
    f = open(path)
    inputpdf = PdfFileReader(open(path, "rb"))
    page = inputpdf.getPage(8)
    pagecontent = page.extractText()

    display.mediaBox.upperRight = (
           display.mediaBox.getUpperRight_x() / 2,
           display.mediaBox.getUpperRight_y() / 2
    )

Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)

form = MyForm() Application.Run(form)

I read that PyPDF2 is written in pure python so it should run with any python, so i am using ironpython 2.7

can anyone help :)

HTML links not clickable after merge

I have two PDFs to merge, once with HTML links, and another just plain watermarks.

After merging, the links are not working, and if I reverse the merge sequence, the watermarks will hide the links.

Here is my codes:

    bg = PdfFileReader(file("/tmp/bg.pdf", "rb")) #plain watermarks
    fg = PdfFileReader(file("/tmp/fg.pdf", "rb"))   #text with links

    page = bg.getPage(0)
    page.mergePage(fg.getPage(0))

    output = PdfFileWriter()
    output.addPage(page)

    ostream = file('/tmp/out.pdf', 'wb')
    output.write(ostream)
    ostream.close()

PyPDF2 bails out while parsing NameObject if it's standalone

When a standalone NameObject is encountered the parsing code raises an exception.

Reproducible with:
from PyPDF2.generic import readObject
from cStringIO import StringIO
print readObject(StringIO("/deviceRGB"), None)

PyPDF2 fails with PdfStreamError("Stream has ended unexpectedly").

Now some of the PDFs generated with ImageMagick(img to pdf conversion) have this standalone "/deviceRGB". And it is not followed by space or any of the delimiters. I have come across couple of PDFs with this problem. Unfortunately I cannot send them across(client data). I'll try to create such pdf and attach it here

Python Version Compatibility

A new PyPDF2 branch 'Python3-3' has been created, incorporating William Culver's changes from his pull request #4 . However, it currently only completely works on Python 2.6 and 2.7.

PyPDF2 failing to read unicode character

I have a PDF which PDFFileReader is unable to read the text , instead this is the output:

u'\n˘ˇˆ˘ˇ˙˝˛˛˚˜ !!"#$%&"˝˛˝˘˛˘˛˚˙˘ˇ˝˛˘˛$\'(˘%˘ˇ˘ˆ˘)_)˛\'+,-)"˛./0"0!123˛"4˙"5)46)!6"˙˘˘˘,˘ˇˆ˙˙ˆ˝˛˚˜ !˘ˇˆ˙˝"" ˜#˝$˛˚˜ ˆ˙˝"" ˜ %˛˚˜ !˛˚ˇ!"#$%˘ˇ&ˆ˙˝˛˝ˆ˙&˚˝\'˛˚&\'()_ˇ+˙˝"" ˜#˝$˜#( ˛˚(ˇ+,˘˘˘ˇˆˆˆˇ,ˆ--ˆˇˇ˙˝˝% ˜)˜#_#˝$$˜  ˙ ˝_˛˚ˆ-&ˆ!ˆˇ&˘+$ˆ(˙˝+˚˜,!˛˚./&0ˆˆ+$ˆ(˙˝-˛-,&˘˝ˆ. ˚%˝% ˜)˜#\* ˜!˛˚&ˆˇ%ˆ!&(12+3ˇ˙˝,˜ˆ/˛˚%#"+3("ˆˇ.!ˆˇ43ˇ(˙-,&53ˇ6ˆˇ,˝˝% ˜)˜#\* ˜!˛˚(77777777777˜#( 0123& ˜"" ˜ %˛˚˜ 77777777777˜#( _ˆ_˛ ,4+#(56˝% ˜)˜#\* ˜!7  56 _˜ˆ(  %!_ˆ_˛ ˆ˙&˚˝\'586"ˇ+((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'()_&\'(_&\'()˘536((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'&\' &\'˜ ˜˙˚ˆ-",ˇˆˇ!ˆ-ˆ,ˆ&ˆ!ˆˇ&53ˇ6ˆˇ,(˙˚&ˆ!-ˇ!6ˆˇ,˘ 8-ˇˆ-˙˝˝% ˜)˜#_ ˜!7  ˛˚(˙˚9ˇˇˆ-6ˆˇ,:;ˇˇˆ-<ˆˆ-ˇ&\' ,,˘˘ˇˇˆ-(9ˆˇˆ-!˘ˇˆ9˘ˆˇ˘˘(\n\n'

This is the output after Extract Text and it doesnot throw any error message.

A similar issue has been posted here:

http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python
I am using windows so the solution in link is not helpful

Problem with AutoCad generated PDF

Hi

I am trying to use the pyPDF2 module to merge a lot of pdf-files.  For some of the pdf-files it fails.
The failing pdf-files is files generated directly from Autocad.


Traceback (most recent call last):
  File "", line 37, in
  File "", line 29, in main
  File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 168, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 116, in merge
    pages = (0, pdfr.getNumPages())


My script:

def main():
    from PyPDF2 import PdfFileReader, PdfFileMerger
    doclistdir = r'xxxxxxxxxxxxxxxxxx''
    doclistfile = open(r'xxxxxxxxxx\list.txt','r')
    doclist = doclistfile.readlines()
    merger = PdfFileMerger()

    for doc in doclist:
        pdfdoc = doclistdir + '' + doc.strip()
        mergerelement = open(pdfdoc,'rb')
        #print 'Processing:  ' + pdfdoc

       
        merger.append(mergerelement)
       
   
    output = open(doclistdir + '' + "document-output.pdf", "wb")
    merger.write(output)
    pass

if name == 'main':
    main()


regards
Olav

Bad arguments to str() in u_

*** Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32. ***

Traceback (most recent call last):
File "test.py", line 1, in
import PyPDF2
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\generic.py", line 1042, in
u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

Whitespace issues in extract_text()

I am not able to read text which proper formatting and spaces are not handled during extraction:

PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.

Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/

No way to redirect warning messages to standard python logging implementation

PdfFileReader sends all warning messages to stderr (or some other file that you can specify). Normally, you can redirect warnings into the logging system by using logging.captureWarnings. PdfFileReader stops this by replacing the showWarning function in the constructor.

The only problem is that this will break backwards compatibility for the PdfFileReader constructor.

Complete operator for method removeImages

Hi,

Thanks you for add methods removeText and removeImage.
For the method removeImages, just a little correction for manage correctly content.

                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i',
                        'gs', 'W', 'b', 's', 'S', 'f', 'F', 'n', 'm', 'l',
                        'c', 'v', 'y', 'h' , 'B', 'Do', 'sh'] or \
                    operator in [b'cm', b'w', b'J', b'j', b'M', b'd', b'ri', b'i',
                        b'gs', b'W', b'b', b's', b'S', b'f', b'F', b'n', b'm', b'l',
                        b'c', b'v', b'y', b'h', b'B', b'Do', b'sh']:
                    continue

Pdf form overlap issue

When merging 2 pdfs, where one has form elements and the other does not; the "check box" form element overwrites any text that is present.

Is there a workaround for this?

The example below shows the text "blah" overlapped by check box element.
check box form element overlap text

PdfReadError: EOF marker not found

We are getting error like PdfReadError: EOF marker not found .

Scenario: We concatenate some PDF's using pyPDF - input can be princeXML supplied PDF , normal PDF etc .
No issues here .

princeXML PDF + Adobe PDF = pyPDF generated Concatendated PDF - Cool works fine .

Issue happens we we now use the above type of pyPDF concated pdf and concat with other normal pdf again using pyPDF itself .

princeXML PDF + some pyPDF generated PDF = pyPDF generated Concatendated PDF (Expected) works in most cases some cases this won't work . It basically complaints that the pyPDF generated PDF EOF marker not found ! However it was generated by pyPDF itself , did pyPDF miss putting EOF marker in some strange cases ?

Can anyone look at this bug ? This has happened quite rarely but some online sites are handling this same pdf pretty well . How can I attach the PDF to Github for inspection ?

A relevant question can be seen here :
http://stackoverflow.com/questions/15177587/merge-non-standard-pdfs-with-pypdf

PdfFileMerger.addBookmark() should return the newly added bookmark

PdfFileWriter.addBookmark() returns the newly added bookmark, so it can be used as the parent in subsequent addBookmark() calls in order to create nested bookmarks.

For consistency, PdfFileMerger.addBookmark() should function similarly, however it does not, as it doesn't return anything, thus making it impossible to create nested bookmarks with PdfFileMerger.

Wrong PDF generation on windows

The below code will generate an output, but the resulting PDF is not the expected concatenation of the two original pages. Same code works as intended on Linux.

import PyPDF2

pdfList = ['top_01.pdf','top_02.pdf']

def mergePDF():
        writer = PyPDF2.PdfFileWriter()
        for pdf in pdfList :
            f = open(pdf, 'rb')
            reader = PyPDF2.PdfFileReader(f)
            writer.addPage(reader.getPage(0))
        out = open('top.pdf', 'w')
        writer.write(out)
        #out.close()

mergePDF()

Here are the links :
top.pdf

top_01.pdf

top_02.pdf

Encounter a valid pdf file but PyPDF2 fail on it

that file can be decompressed by pdftk, but the FlateDecode of PyPDF2 failed:

  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1751, in mergePage
    self._mergePage(page2)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1801, in _mergePage
    originalContent, self.pdf))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1714, in _pushPopGS
    stream = ContentStream(contents, pdf)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 2158, in __init__
    stream = BytesIO(b_(stream.getData()))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/generic.py", line 850, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 102, in decode
    data = decompress(data)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 47, in decompress
    return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream

here's the data to be decompressed (repr print):

'H\x89\xedW\xcbr\x1b\xb7\x12\xdd\xf3+f)W\x85#\xbc\x1fZ]\x8a\x0f\xc5%\x9a\x94I:\xded\xa3X\xd4\xe3F\x12mIN\xe2\xbf\xbf\xc0\x003\x83\x19\x00M\xd9I*Qr7*q\x80n4\xbaO\x9f>8\xde\x0c\x0eg\xb8\xc0\xa8D\xaa\xd8\\\x0e\x84\x94%Q\x05\xd3\xa4\xd4\xa2\xd8L\x8a\x83\xc9\xeeU\xb1\xf9\xef`hvp\xb3\xb2\xf9P\x1c|8\xaa>\x1d\xceHk\x88UIX\x81\xac\t\xaa6aM\x84\x92\xc4\xef4G\xe0\x121\xbbsH\x15/5)8-\xa5;b\xd4\x9c\xa0\xb4?\xe2\xa6\xfa\x84J\x8c\x04r_\x1e\xb2\x9b\xaa\xff\xef\xaf\xeau\x8c%&\xd5\xb7\xa2\x8d\x9c\x0bg2\tL\xcec\x8b\xa7`y\xfb\x18\xaf\x1f\x15\xc1\x06\x8a\xa3 \x87\x8d\tB\xd4\x99\xbc\x89N\xec\xdcj\x18,\x13\x84Y\xee\x16n\xc7f\x07\xaf\x13\x99\t\xc9-\x8f>>3\x80\xaa`\xc4V\x8b\xf0\xd2T\xb9\x02\x85)U\x95~V\xed]\x07v\xd7;\xef\xb7^l\x8beQd\x13\x1bF9Kow\x8bw}\xd3\xd0\xf2\xcd\xf6\xe2\xe6<\xda\xb0\xaeo\xe5\x12\xf2!\x8cl{\xf1\xf9v[}\x98n\x06\x9f\n\xac\x95\xc1\'*\x86D\x95\xaa\xa8\xfep\xa5\x8a\x0fw\xc5\xe1\xcd\x1d.&\xbb\xe2\xed\xe0\xb8\xd7\x13D\x97\xdc\xc0\x95\xc9RTI8 X\xd2\xa6\x0eT\x11\x16\xc5\xb9>on\xc8\x99[\xbe\x8d\xda\xe8\xe7\xa6\x18X\xc6W\x1dE\xfb\x7f\t\xc1\x19\xd9~\xb7\'\xa2\xcfQD7O\x91\xc3\xac9E\x08\xde0\x0e\xba\tw\xcb\n&\xe11\xf0\xf1\xd3\xf9E|\xad\xce!#8\x08M\x10R\xd5on\xc9fhHI\x92\xaad\xe38\x94\x90=\xb6\xf7\xd1\xc1OG~\xddxiYm\xdc\x14Vr*\x02\xc0\xba\xe5\x8fq\xdd\xc3c\xae\xee\xe3$\xfdx\x90\xc9Iw\xd7+\x17\x8e\xb9N\x18\xcer\xb4\x8e\x02\x7f\x1dV\x1d\xce\xd7|z2\x9a\xc3[f\xa0\xff\xc5h1\x9e\xd6\x0e\xa8L9X\x9eMW{\xa0\xb3\xdc\x13\xe5|\x9c\xd8\xe0\xf3\xe6\x0f\x99\xcf\x16?\xbe\xea\x16,\x9d\x1c\x9f\x11\x93\xc7z[L\xd0\t\x10\'\xfa6\xd3\x87OM\xa1\x82a8\x9a\xc1y\x9c\x83\xab\xa3\xcd\x9e\x1c/\xa6\x9b\xda\x01V\x9e\xd8\xdf\x87\xe9Y\x9d\xd6\x14\x88J\xa6\x8a_\x0b\xc2lg0D\xed$\x96\x0c\x97\xe6\xb6RX\xc2{\xd8\x0e\xd6\x96\x02)+\xb9(\x98\xe0\xa5\x1b\xd2q.\x9f\xa2\xc4uVC\xe0\xef\xee\xbbY98^\xbe\xab\x91\x13\x13`\'\xff\xc3\xf0\xa2\x81\x99+!\xad9y\xbcK\xb4p\xaa\xb2\xad\x8ay^\xe9\xc6\xcb\xf9|:]\x80\\}:\x1f\x9dLCd5\xea\xe9\xeczw\xbf\xed\xbaMu\xbfs#\x98Hu>x4c*\x91\xb3`\x834c2\xf6\xd0i\xf187\xbf=\xa72\xcf\x8f\xddE\xaa\xd0\xdeHu\xaa\xb45L\xd2\x94\x9b`\xdc\xdb(\xe1\xfd\xa2;P\xc63\xe16l\xed\xed}\x94\x97\x9fc\x930\x94\xabmd\x11\xce\xb7\xffd.\x92\x18\xf2\x99\xcbn/\xe2\xdb\x82 \x7f\x043\xd5\xb9n\xce6\x00A\x99i"\xf3\x85H\xc1\xdc\xa7\x9d\x0b\x98RU2\xc30%%m3\x1d\x7f\x8e<|I\x8a\x90\x96~]-)\xea=/&\xab\x86=z\x94g\xcc\x05N\xb1\xe4\xe9\xbb\x15H\xb3\'\xcd](\xf5\xcb~tifo\xb1\xa7{\xa5H\xe0{O\xf7\nDc\x9b\xb0\'0g\xb8\x19kMGLrc\xca\xa7\xf1\xd7L\xe5|\x1e>g\xcds\xba%U\x98\x10\xd8\xd71\xf2B\xb5\n\xe3r{q\x15\x8f\xd1\x1c\xd4P\x881C\xb4\xdc>"\x03\xc2\xdd\xc0\x91\xc6s\xea\xf2\xf2\xa6\xa9\xbb\xee>p\xbd\xf7\xbf\xa6\xee\x84\xf8+Ra\xc1\x17T?~\x19<>f\xca\x07\xb7sg>\xb6\xd2\xb4\xd1X~\x1e\xb2\xce<\\\xedr\x95\xc9\x80\xa7\xd9\x9f\xd6.\x9b\xd8_\x07:\x0f\x7fr\x07j\x02W\x82b\x8d\xe2\x0e\xfc\xc6,\xe4Zr_\x12\xfe\xa6\xed\xc7\x8dH4}7$\xc4\xb3\xfc\xc1\xfac<G\xe0\xc4$\xa6R\x11\xb2r<\x84s@\xce\xe6\'\x1e:\t\xed\xd7\x95\xab\xf1\xa1\xae\x17\x04\xd2V*c\xaa=\x12|S\xc4\xb7\xfe%\xee\xa7\x10w\x93\x84b\x8e&a\xb8z\x14Z\xa3=\xa2o\xf41\x91\x91p\x03A\x98\xc1\x01z\x86\xea\x84\x10,s\x04\x9b\x9f\xb9g\x95\x96\xf6\x811\xa4\x8e\x9c\xb3i\xea^\xd4+I\x82dI-\xb9\x1b\x11\xc1\x82\xa7ZgB\x8f\xdc\x84f\x8e#q\xc9\x03\xb5A\xdd\xe8\xd4\xc8\xbeu\x9a\xb5\x96?[/g\x80\x17\xe2\xbc\x0c\x85\xb0X\x7f\x8e\x1b?\x93\x9a\x9d\xb2\xfaj\x86I\x10\x85\xaf\xd3\xd7\x85R\xd7\x8c\x0bT*\x15\xc42\x89[\xfc\xcbc="\xadrj\'d(-\xdf\x84\x14T\x17"\xf9\x82\xdc\xcc\xd6\xee}\x83\r\x07"\xaf\xc4\x0e~\xc8\xea\x82T\xbff\x88\xf1\xde\xfbU\xf6\xf5\xe9\xfc\xce\xbd\xf4&f\xf6\xa9\xfa\xe3*\xd11\xee|\xf3\xac\x15\x8a[\xa0QA\xac\xc1]\xf3\x9b\xe2\xd2\x9cy\xeb\x9e\xb4\x92T\xab\xd46o\xd0\xbb_\xdb\x89\x89\xe9\x95\x9d\xaa\xb7\x1e\xca\xd4B\xa2U)M\xdd\x8d\xbe\xe3\xce\xec\xc6\'.Y\xf8\xe7\x1dc\xf2`r^\x1a\x1d\xde\xe4\xa1\xfe\xdd\xe6\xe1S\x81\xcdo"LKI\xa6\xed)\x15\xa3=l\x8b\xeb\xe2}q?\xc0:H\xdc\xdd@\x1aO]\x07o\x8bO\x85I&QU\x88m23\xaaz=v\xbf\xb8nqsv\xbd\xbbwy\xe5\xaa\x99\xeb\x07cx\x08\xee\xd1\xa0\xf5\xe3\x8bR\xeaXC\xd9\xae\xb7~Y\x94\xbb\xefB\x9eC\x9e\xc9$"e\xfdd\xfc\xd6&\xc9\xb2r\x08\x1f\xf7\x89U\x13$\x10VH#4\xca:p[\x04\xe2gn\x0eU\x97ty[,\xe3\xa8\x8a0\xae\xefG\x1b\x10\xc4\x98 p\xfd\x07\xec[T\x94\xa2~\x84e\xc6B\xed\x8a\x18X\xd4\xd2\xd9\x0c\x1b\xd4\xc7\xa7y\xcf\x99Vh\xf1\x89\xb9\x05\xa1\xefS\xa1u\x89\r\xb4H\xc5r\xceC\xbf\xd1k\x07\xf5\xef\x8e\x03nN\'*\xe1`\x88\xb5j\x8a|\xa0PB/\x86\xf9\xe6b\x8fLG\xba\x96\x8f!\xbc\x11\xc2pB\xd7\xc7\x94H\xe1\x93*\xdbJ\xce\xb3\xaa2l\xf3\x06\xe1\xdcf\xc9\xda\xc5\xa3:D8\xe1\t\x84\xef\x03\xe9\xd7O\x86\x00\xd0\x16\xa8\xe8\x0cN\x9dA\n\x1f\xbd<,S\xc2\xfbXv\xd0k\xb1\x8c*h\xf6\xb0\x8c*Q`=\xe0\x08\xcb\xb5\x83\xfaw\xc7A\x8d\xe5\xc8\xc1?\x03\xcbF\xdb\xf1\x0e\x96\x1b\xacVs\xe8w\xb1q\x02\x98\xab\xe5\xfae`\x8e\xb1V\x0eE\xec\x89\xba\xecITH~\x12a\x0b\x18bd\xafp\xb2\x99\xf7\x01\x87\xba\xe4\xd9\xb5\xe7\x92\xdaZD\xf6F\xc4\x10[\xab\x97\x8d737L\xc6\x02\xbca\x98<y}O\xdb\xc8\x8a\xfd_\x1e\xfc\t\x94\x1a\xc9\x03\x0f\xc8\x16\xe02A\xa9D\xf2\xfa\xf9\xc3\xfa\x08o\x1c\xd4\xbf;\x0e\xb8\xfb\x15;0\x10\xb7O\xcd\x17\x8e\xf0\xbe:  \xc0\xd5\x1f\xa8\x7f\x9f\x85\xef\xdf\xd5\x04\xf49M\x80^\xa2\xae\xe0Q\x13\xc8\x1e\xcb\xf3\x14\xcb\x0b\x0b\xd8\n\xae\xfd\x16\x90=\x92\xe7)\x92\xef\x99\xff\xcd9~q\x1e\x99=\xdd\x04v\xbb{\xdf\x04\x873R`d\'\xe8\xe6r`\x84\xa6M\xc0\x103\x7fW\xe1\xf7\xe0v\x0f&\xd2\xbeo\xdd1\x18l\x18\x9d\x92\xd3/j"lF\xa7S0\xff\xe3\xe5b\xb3Z\xce\x1b\xdfT\x11\x96\xed\x88?PJ{\x8c\xb6\x90g!m\xd7\x90g"\xa8b\x00x\xde\xe3\xfc\x8e1Q\xd8\x92}m\x8c\x10\x82!\xb8\x19\x9b\x98\x17\xdfJ\xc2i\xfc5\x8c\xfco\x07\xe0d5\x8a\xc3z\x1d\x98-N\xc0\xe2|?ZL\xfeZ\xb6\x8e\xa1\xcbzlMRlm\xe4\x85Ri\xf4\xb2\x1e]w\xeck\xf4\xb6\xf6{\x01\xbc\x9c\x8d7\xf8\x9b\x01l\x02#,\xb8\xc7\xdd@\nQ\xf6\xafu8\xa3\x85\xb4\xd1\x1a\x00ss\x88\xb0y(\x89c\xd9\xb3\xa5y\xa1\x12\xa1$\x196\x97e\x92\xb9\x07L\x85\xdb\xca\xbf\xb07z\xd8\x16\x97f\xb1\xaaF\xbd(IwQ\xda\x83\xdd"\xae2\x14,*dO\xcd-\x12\xc0\xad\xa2\xad\xdbx\x91\xdb\xa1\x95s[5lnQ\x03n5i\xa3\x8d\x17)\xe0Vs\xc8\xad\x04\xa25\x9a\x99\xe5,\xb9Ag\xd6-G\xb4u\x1b/r\xc8-PO\x8e\x80zr\x0c\xd4\x93\xe3\xa0\x9e\xf1"\x07\xa2\xc5\x12\x88\x16+ Z\xac\x81h\tPON\x80zr"\x81h\r\x91\xe4\x93`>\xe7\xa3\xa5P=)\xcd\xc3\xc4L\x10\xc8\xad\x84\xdcj\xc0-\xc3@\xe2\x19TO&\x80$0\xa8d\x0chA\xce1\x10-\xa7@\xb4\x1chA\xce\x81\x16\xe4\x1cjA\x01\xd5S@\xf5\x14\x0c@\x9f\xe0\x00\xfa\x04\xd4\x82\x12\x01n%\x06\xdcJ\xa8\x05\xa5\x80\xdc\x02\x94\xca%TO\x05\xb5\xa0\x82\xea\xa9\x18\x80>\x05\xd5SA\xf5\xd4\x08p\xab\x19\x00j\r\xb5\xa0\x86ZP\xeb|\xe2\x05\x02ZP \xa0\x05\x8d\x82\x83\xdc\x02\xfd)\x10\xd0\x9f\x02c\xc0-&\x80[\x0c\x94L`\x80R+\xc9\xc2\x1a\xc9Bt%\xaa\x84\x7f\x16\x8e\x1f\xbc\xa0BR\x0b\xa7\x8c\xb6\xa12z\x8a\x97/\x8a\x9f\xbe\x04[\x8e\xea\x1d\x9a\xe0P\x91Yy\x8c\x04%\xd5\xa7I`\xb0\xaa\xdfaHI\xec\x96\xdf\x87"\xee\xf4\xdd\xaav@0\xf3\x0e^7\xba\xcf\xfd>\t\x0c\xa6\xab8\x84"\xf4\xc8\x0e\xd5!A\x98e"u\x9b\xe8\x11GG\x8c\x83{\xce\xde\xc4\x8f\x18a\x80\xcd\x85y\xe9X\xa8U\x1a\xf0\xfcj\x0b\x0bu\xf8\x8d\xb9\x8b\x8c/\x0bR\x8b\xc9\xb7\xc5\xff\x00L\xc2\xa0'

the pdf file can also be opened by osx preview correctly.

startxref is not necessarily on a different line from the location of the xref table

The PDF spec seems to require that the startxref keyword and the byte offset to the xref table be on different lines.

However, in the wild, I have found otherwise valid PDFs where the startxref keyword and the byte offset to the xref table are on the same lines, like so:

...
0 8
0000000000 65535 f
0000000009 00000 n
0000305603 00000 n
0000305652 00000 n
0000000083 00000 n
0000305310 00000 n
0000305405 00000 n
0000305423 00000 n
trailer
<<
/Size 8
/Root 2 0 R
/Info 1 0 R
>>
startxref 305711
%%EOF

See here for example: https://www.docketalarm.com/cases/PTAB/IPR2014-00358/Inter_Partes_Review_of_U.S._Reissue_Pat._RE043707/docs/01-17-2014-PET-1193/Power_of_Attorney-2-Power_of_Attorney.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.