py-pdf / pypdf Goto Github PK

View Code? Open in Web Editor NEW

7.6K 7.6K 1.3K 17.67 MB

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Home Page: https://pypdf.readthedocs.io/en/latest/

License: Other

Python 99.93% Makefile 0.07% Shell 0.01%

help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python

pypdf's Introduction

py-pdf.github.io

Website py-pdf

Install requirements

$ pip install -r requirements.txt
$ pre-commit install

Launch local server with livereload

$ invoke livereload

Adding a Python dependency

Edit requirements.in
Run pip-compile requirements.in to generate requirements.txt

Publish

$ make github

pypdf's People

Contributors

Stargazers

Watchers

Forkers

personalnovel kushal-kumaran pythoning mozbugbox jamadden ccurvey adammorris jerem thinker007 arneboon samrussell cro martijnthe mspisars haoshuji mvanderkolff nithinpb josephw trayor inductiveload dooper87 cecilkorik mouthwateringmedia onenick972 zjxtx4431 duedil-ltd jeansch tramzzz exitio thevladsoft jamma313 kvbik nvictus zejn dreispt pacoqueen tiagosab potatoym guywillett speedplane rhyspowell wearefaces agilentia juanbits snorfalorpagus purcaro andycasey lazyfunctor ascii1011 theapachecats bartoreebbo bdbaddog 171230839 haiiiiiyun dylanmc hsonntag sshekh chrishiestand kewisch mattxbart usgm twac liminggang switham necross neomanic rob1080 jeroanan talumbau hihihippp lentinj ulion a-yasui zed9 cemmanouilidis carlosfunk moubachiryounes wolever penriister sdpython egbutter synchroack h4ck3rm1k3 jasonbot caxap krikunts musray kevinlowrie vinhphu1711 letolab lothilius pyhunterpig madjar ovnicraft oyv qingzhu henrykeiter jp41 tom-kerr b-rich

pypdf's Issues

Can't getData() from /Contents List

I'm trying to dig deep into some PDFs by calling getData directly on part of a page (I am then parsing that data to find coordinates for a bit of text).

This worked for me in the past with essentially:

page = PdfFileReader(inpdf).getPage(0)
text = page.getContents().getData()   #<-- or page["/Contents"].getData()

but with my new PDFs, I am getting an error like this:
"AttributeError: 'ArrayObject' object has no attribute 'getData'"

Digging in, it looks like my old PDF was structured like this (print page) with a single IndirectObject in the contents.

{'/Contents': IndirectObject(14, 0),
 '/MediaBox': [0, 0, 662.40000, 792],
 '/Parent': IndirectObject(1, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(10, 0),
                          '/F4': IndirectObject(7, 0),
                          '/F5': IndirectObject(4, 0)},
                '/ProcSet': IndirectObject(13, 0),
                '/XObject': {}},
 '/Type': '/Page'}

Then page.GetContents() returns:

{'/Filter': '/FlateDecode'}

while my new PDF is structured like this with a list of IndirectObjects in the contents:

{'/Contents': [IndirectObject(11, 0),
               IndirectObject(12, 0),
               IndirectObject(13, 0),
               IndirectObject(14, 0),
               IndirectObject(15, 0),
               IndirectObject(16, 0),
               IndirectObject(17, 0),
               IndirectObject(18, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': IndirectObject(5, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(24, 0),
                          '/F4': IndirectObject(26, 0),
                          '/F6': IndirectObject(29, 0),
                          '/F7': IndirectObject(30, 0)},
                '/ProcSet': IndirectObject(31, 0),
                '/XObject': {}},
 '/Rotate': 0,
 '/Type': '/Page'}

then page.getContents() returns:

[IndirectObject(11, 0),
 IndirectObject(12, 0),
 IndirectObject(13, 0),
 IndirectObject(14, 0),
 IndirectObject(15, 0),
 IndirectObject(16, 0),
 IndirectObject(17, 0),
 IndirectObject(18, 0)]

How do I get at the underlying data of /Contents? going after the pieces of the list with page.getContents()[0] just returns the name of the object and I can't use getData() on that. I can't tell if this is a bug (caused by having a list as the contents) or if I am missing some feature.

PyPDF2 should not overwrite warnings.formatwarning.

Hello,

PyPDF2 1.2.0 overwrites warnings.formatwarning with its own implementation (utils._formatwarning) in pdf.py line 74:

warnings.formatwarning = utils._formatwarning

Unfortunately this may cause severe side-effects if PyPDF2 is imported in a larger application. In our case the PyPDF2 implementation of formatwarning caused IndexErrors whenever a warning was raised somewhere else (and the filename argument was not to the formatter's liking).

Personally, I do not think that it is a good idea for a library to interfere with the global logging/warning infrastructure.

P.S.: Apart from this problem, we have been using PyPDF2 successfully for some time now. Nice piece of software!

Speed up parser

Currently the parser is quite slow, even for moderately sized PDFs. When I get a bit of time, I'm going to investigate different ways it could be sped up. Right now (pending some profiling, obviously) I suspect this is going to involve re-writing some of the core parser loops in something lower level like Cython. I'm looking into options to see if it's possible to write in a language which will be able to compile back to vanilla Python for the benefit of PyPy and friends.

I'm opening this issue to start discussion on the matter, and see if you've got any strong feelings either way.

Some valid but unstand indirect object cause PyPDF2 failure

The issue is something like this: /FontFile2 11 0 R

There is more than 1 space there, cause PyPDF2 failure:

/PyPDF2/generic.py", line 256, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''

This should be supported anyway.

Scaling in python 2.6

I cannot seem to get scaling to work.
If I submit a float or int to "scaleBy":
TypeError: Cannot convert float to Decimal. First convert the float to a string
If I submit a string:
TypeError: can't multiply sequence by non-int of type 'float'
If I submit a Decimal:
TypeError: unsupported operand type(s) for *: 'float' and 'Decimal'

"file has not been decrypted" error

Exception occurs when reading certain valid pdfs

File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1019, in getObject
raise Exception, "file has not been decrypted"

PDF: http://dis.puc.state.oh.us/ViewImage.aspx?CMID=A1001001A13L20B64142B89320

Add method ignoreText

Hi,

I have a PDF and I wan't to remove the text from PDF file , to keep only image in my PDF.

I see have a method ignoreLinks for PdfFileWriter object, can you add method ignoreText ?

Or explain how I can do ?

Thanks.

retaining bookmarks using merge

When using the merge function with two files and using the import_bookmarks=True option, the bookmarks are always off by 1 page.

The issue is further compounded by different .pdf readers. I'm seeing in Adobe the bookmarks are off by 1 page (one page behind) and in other readers like PDF Complete - they are correct.

I made the following adjustment in the source code (merger.py) _associate_bookmarks_to_pages --
for p in pages:
if bp.getObject() == p.pagedata.getObject():
pageno = p.id-1 ########### the -1 was added

Everything looks great in Adobe but now the file in PDF Complete it's off by 1 page...fortunately I only support Adobe.

After further inspection -- although bookmarks work -- the bookmarks are highlighted incorrectly when scrolling through pages. They are off by 1.

I checked the file using the getOutlines() function and saw the file was structured incorrectly with the "/Page" key being off for each item:

Eg:
[......,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 6}, .... ]

Should read this:
[....,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 7}, ...]

And yes I do understand pages start at "0" !

What would I need to fix the root '/Page' key? Would someone be able to help me?

DCT Filter

PyPDF2 currently lacks a filter for DCT compression (true? Even as maintainers, we sometimes forget everything there is to know about PyPDF2). How important is it that we add this? There certainly are instances "in the wild" of PDF which use DCT compression; should we care?

[See also internal Issue756.]

Edit document info with pypdf2

It would be nice if pypdf could edit the document meta information. Is anything like this planned?

PDF /PageLayout and /PageMode options

Hi,

I've been using PyPDF2 to merge some PDF files, adding bookmarks to the various pages as needed. I've been using the code below to set the initial view of the output PDF so that it shows one page at a time, and displays the bookmarks navigation panel.

pdf = PdfFileWriter()
root = output.getObject(pdf._root)
root.update({NameObject('/PageLayout'): NameObject('/SinglePage'), NameObject('/PageMode'): NameObject('/UseOutlines')})

I'm wondering if there would be any interest in writing this into a more formal method. Maybe something like:

pdf = PdfFileWriter()
pdf.page_layout = 'SinglePage'
pdf.page_mode = 'Bookmarks'

I'm happy to write this an submit a pull request, but I though I'd get some feedback on the syntax.

In addition to this, it would be nice to be modify the author, title, etc. Maybe this is already possible and I've just missed it...

PyPDF2 failing at import

I am using PyPDF2 for extracting text and geometry from a PDF and this is my code snippet of Pdftext.py file :

from PyPDF2 import PdfFileReader

When i run this, i am getting the below error:

Traceback (most recent call last):
File "C:\Program Files\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Mic
rosoft\Python Tools for Visual Studio\2.0\visualstudio_py_util.py", line 76, in
exec_file
exec(code_obj, global_variables)
File "C:\Users\xxx\documents\visual studio 2012\Projects\PDFText\PDFT
ext\PDFText.py", line 3, in
import PyPDF2
File "C:\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 1049, in

u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u

0000'), u_('\u0000'), u_('\u0000'),
File "C:\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

Query - is there a way to bypass security restrictions on a pdf?

I have a pdf that has security restrictions. I need to merge some content into the secured pdf. I don't need the pdf to be secured after the merge.
When I open the file and check isEncrypted, it returns true.
When I try decrypt with empty string there's a notImplementedError raised "only algorithm code 1 and 2 are supported".

The restrictions on the file are shown below.

At the moment, to bypass the restrictions on the file, I print the pdf to images and create a new pdf with those images. This isn't ideal as the file size becomes large and the content isn't as crisp.

Is there a better way?

PyPDF2 does not work under pypy

NumberObject is initialized wrong

class NumberObject(int, PdfObject):
    def __init__(self, value):
        int.__init__(value)

Correct would be;

class NumberObject(int, PdfObject):
def init(self, value):
int.init(self, value)

PyPDF2 - AutoCad generated PDF and Watermark

For some time ago I reported a problem regarding AutoCad generated PDFs.
This problems was solved.

I have encountered a new problem which I belive is also related to the AutoCad generated PDFs.

This time I'm adding a watermark to an existing pdf.
I am able to add this watermark-file (created using pyfpdf ) to most of the files

            a = PdfFileReader(open(filein, "rb")).getPage(0)
            watermark   =  PdfFileReader(file(r'c:\temp\test.pdf','rb')).getPage(0)
            a.mergePage(watermark)

filein is a AutoCad generated PDF.
.

This fails:

a.mergePage(watermark)

File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1644, in _mergePage
originalContent, self.pdf))
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1557, in _pushPopGS
stream = ContentStream(contents, pdf)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1986, in init
self.__parseContentStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 2025, in __parseContentStream
operands.append(readObject(stream, None))
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 55, in readObject
return readStringFromStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 370, in readStringFromStream
raise utils.PdfReadError("Unexpected escaped string")
PyPDF2.utils.PdfReadError: Unexpected escaped string

Looks very similar to the last problem I reported.

Olav

Infinite loop with PDFFileReader when file is empty

Reproducible with:

import StringIO, PyPDF2

PyPDF2.PdfFileReader(StringIO.StringIO())

Will hang on invalid PDFs

Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment function:

def skipOverComment(stream):
    tok = stream.read(1)
    stream.seek(-1, 1)
    if tok == b_('%'):
        while tok not in (b_('\n'), b_('\r')):
            tok = stream.read(1)

Will hang indefinitely.

I would propose three courses of action:

Wrap the stream in a method which will raise an exception after a certain number of empty reads; ex:

class SafeStream(object):
    def __init__(self, stream):
        self.stream = stream
        self.seek = stream.seek
        self.tell = stream.tell
        self._empty_reads = 0

    def read(self, *args):
        res = self.stream.read(*args)
        if res == "":
             self._empty_reads += 1
             if self._empty_reads > 1000:
                 raise Exception("too many empty reads")
        else:
             self._empty_reads = 0
        return res

Add a script for automating fuzz testing to the repo
Fix the bugs as the script from step (2) finds them

What do you think? Would you be open to patches for those?

Add method ignoreImage

Hi,

Like my last post "Add method ignoreText" I need to extract only test from Pdf, I try some products for extract text from pdf but all return text in String. But no one keep text position and fonts. I think PyPdf is the good tools for do that.

I add this method in pdf.py in class PdfFileWriter:

   def ignoreImage(self, ignoreByteStringObject=False):
         pages = self.getObject(self._pages)['/Kids']
        for j in range(len(pages)):
            page = pages[j]
            pageRef = self.getObject(page)
            content = pageRef['/Contents'].getObject()
            if not isinstance(content, ContentStream):
                content = ContentStream(content, pageRef)


        _operations = []
        seq_graphics = False
        for operands, operator in content.operations:
            if operator == "Tj":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == "'":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == '"':
                text = operands[2]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[2] = TextStringObject()
            elif operator == "TJ":
                for i in range(len(operands[0])):
                    if ignoreByteStringObject:
                        if not isinstance(operands[0][i], TextStringObject):
                            operands[0][i] = TextStringObject()

            if operator == 'q':
                seq_graphics = True
            if operator == 'Q':
                seq_graphics = False
            if seq_graphics:
                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i', 'gs',
                        'W','n', 'f', 'm', 'l', 'cm', 'Do', 'sh', 'S']:
                    continue
            if operator == 're':
                continue
            _operations.append((operands, operator))

        content.operations = _operations
        pageRef.__setitem__(NameObject('/Contents'), content)

If you thinks this method is helpful. can you add it ?

Thanks.

MergePage rotates 1 page relative to the other, in certain pdfs

I'm merging 2 pdfs using code that works correctly for other pdfs. I'm using the mergePage method to overlay the content from one pdf on the other pdf (merge page by page).
In the image below, the numbers (highlighted by red box) should be positioned vertically.

The "base pdf" is a scan from a Xerox WorkCentre 7435. The "secondary pdf" (containing the highlighted numbers) is generated using reportlab. The "base pdf" and "secondary pdf" have portrait orientation when viewing in a pdf viewer.
Other scans (from other scanners) merge correctly.

I don't know much about how pdf structure works, but is it possible the scan isn't including some data (orientation)?

I will try include a problem pdf when I obtain one that doesn't contain sensitive information.
Thanks
Rob

PDF split with links

I have a 483 page PDF that I use for testing (manual). The problem is that when I try to split the document, it takes almost 2 min to process the first handful of pages, and then 3 seconds to process the remaining 450+.

Pages 3-6 contain a table of contents with links to other parts of the PDF. When I take these few pages out of the document, it takes 3-4 seconds to split the 483 pages.

Any ideas why its hanging on the table of contents (with links).

Encryption/Decryption in Python 3

This seems to be the only feature that doesn't work under Python 3. There are several encryption algorithms, it is probably just a matter of using utils.py correctly to avoid TypeErrors.

Installation through pip does not work

Since your username has changed it's not possible to download via pip anymore.
Could you please fix this?
https://pypi.python.org/pypi/PyPDF2

Thanks!

API compatibility with PyPDF

Hi,

Is PyPDF2 fully API compatible with PyPDF ? I'm trying to get PyPDF in Fedora replaced by PyPDF2 but we must know if it won't break anything or fix application accordingly.

Thanks !

PyPDF2 warning in PyPDF2 1.19

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be
corrected. [pdf.py:1130]

KeyError: '/Type' when merging pages

Merging 2 pdfs. The first pdf is from paperport 11 (some old program which may not support pdf structure correctly?), I initially needed to apply the fix from #34 (to fix EOF error). The next issue I encountered is in the method: _flatten (in pdf.py) where "/Type" isn't present in the pages dictionary.
I made the following change:

 def _flatten(self, pages=None, inherit=None, indirectRef=None):
       ... 
       ...
        #this is the change I made; default t = '/Pages'. Is this the correct thing to do?
        t = "/Pages"
        if "/Type" in pages:
            t = pages["/Type"]
        ...

Should I commit a fix for this (and make it conditional on strict parameter)? Or is there a better way to pick a type?

Can't read pdf

I get an mysterious error with the PDF Reader using python3 on the file
"Werner - Fragen und Antworten zu Werkstoffen.pdf".
My Code:

import fnmatch
import os
from PyPDF2 import PdfFileReader

for file in os.listdir('.'):
    if fnmatch.fnmatch(file,'*.pdf'):
        print("File: "+file)
        foo = PdfFileReader(open(file,"rb"))

Error:

File: Werner - Fragen und Antworten zu Werkstoffen.pdf
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    foo = PdfFileReader(open(file,"rb"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 684, in __init__
    self.read(stream)
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 1236, in read
    streamData = BytesIO(xrefstream.getData())
  File "/usr/lib/python3.3/site-packages/PyPDF2/generic.py", line 834, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in decode
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in <listcomp>
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
TypeError: ord() expected string of length 1, but int found

Is smth broken with my filename or why this error occurs?

int() got an unexpected keyword argument 'base' error at line 803 in pdf.py when using Py2PDF2

When I execute the following code in Visual Studio 2012 using Python tools and ironpython 2.7 and PyPDF2 v1.20.

i got this error "int() got an unexpected keyword argument 'base' " line 803 in pdf.py

This is my complete code:

import clr
clr.AddReference('System.Drawing')
clr.AddReference('System.Windows.Forms')

from System.Drawing import *
from System.Windows.Forms import *
from PyPDF2 import PdfFileReader
class MyForm(Form):
def __init__(self):
    # Create child controls and initialize form
    self.Text = "Test Project"
    self.Size = Size(600, 500)

    path = "F:/Download/RealPython.pdf"
    f = open(path)
    inputpdf = PdfFileReader(open(path, "rb"))
    page = inputpdf.getPage(8)
    pagecontent = page.extractText()

    display.mediaBox.upperRight = (
           display.mediaBox.getUpperRight_x() / 2,
           display.mediaBox.getUpperRight_y() / 2
    )
Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)

form = MyForm() Application.Run(form)

I read that PyPDF2 is written in pure python so it should run with any python, so i am using ironpython 2.7

can anyone help :)

Cannot read certain PDF file (tolerance issue?)

PyPDF2 can't read some files which can be read by pyPdf.

Usually these PDFs contains ^M characters in the first line, just after the comment, like this one: http://books.nips.cc/papers/files/nips24/NIPS2011_0622.pdf

Also I noticed that the function readNonWhitespace in utils.py is performing exactly the same function as skipOverWhitespace. I suppose readNonWhitespace should read non whitespace characters?

pdfcat problem parsing arg like "-2:"

The command

pdfcat foo.pdf -2: >bar.pdf

is supposed to put the last two pages of foo.pdf into bar.pdf. But argparse chokes on the "-2:".

HTML links not clickable after merge

I have two PDFs to merge, once with HTML links, and another just plain watermarks.

After merging, the links are not working, and if I reverse the merge sequence, the watermarks will hide the links.

Here is my codes:

    bg = PdfFileReader(file("/tmp/bg.pdf", "rb")) #plain watermarks
    fg = PdfFileReader(file("/tmp/fg.pdf", "rb"))   #text with links

    page = bg.getPage(0)
    page.mergePage(fg.getPage(0))

    output = PdfFileWriter()
    output.addPage(page)

    ostream = file('/tmp/out.pdf', 'wb')
    output.write(ostream)
    ostream.close()

PyPDF2 bails out while parsing NameObject if it's standalone

When a standalone NameObject is encountered the parsing code raises an exception.

Reproducible with:
from PyPDF2.generic import readObject
from cStringIO import StringIO
print readObject(StringIO("/deviceRGB"), None)

PyPDF2 fails with PdfStreamError("Stream has ended unexpectedly").

Now some of the PDFs generated with ImageMagick(img to pdf conversion) have this standalone "/deviceRGB". And it is not followed by space or any of the delimiters. I have come across couple of PDFs with this problem. Unfortunately I cannot send them across(client data). I'll try to create such pdf and attach it here

Python Version Compatibility

A new PyPDF2 branch 'Python3-3' has been created, incorporating William Culver's changes from his pull request #4 . However, it currently only completely works on Python 2.6 and 2.7.

PyPDF2 failing to read unicode character

I have a PDF which PDFFileReader is unable to read the text , instead this is the output:

u'\n˘ˇˆ˘ˇ˙˝˛˛˚˜ !!"#$%&"˝˛˝˘˛˘˛˚˙˘ˇ˝˛˘˛$\'(˘%˘ˇ˘ˆ˘)_)˛\'+,-)"˛./0"0!123˛"4˙"5)46)!6"˙˘˘˘,˘ˇˆ˙˙ˆ˝˛˚˜ !˘ˇˆ˙˝"" ˜#˝$˛˚˜ ˆ˙˝"" ˜ %˛˚˜ !˛˚ˇ!"#$%˘ˇ&ˆ˙˝˛˝ˆ˙&˚˝\'˛˚&\'()_ˇ+˙˝"" ˜#˝$˜#( ˛˚(ˇ+,˘˘˘ˇˆˆˆˇ,ˆ--ˆˇˇ˙˝˝% ˜)˜#_#˝$$˜  ˙ ˝_˛˚ˆ-&ˆ!ˆˇ&˘+$ˆ(˙˝+˚˜,!˛˚./&0ˆˆ+$ˆ(˙˝-˛-,&˘˝ˆ. ˚%˝% ˜)˜#\* ˜!˛˚&ˆˇ%ˆ!&(12+3ˇ˙˝,˜ˆ/˛˚%#"+3("ˆˇ.!ˆˇ43ˇ(˙-,&53ˇ6ˆˇ,˝˝% ˜)˜#\* ˜!˛˚(77777777777˜#( 0123& ˜"" ˜ %˛˚˜ 77777777777˜#( _ˆ_˛ ,4+#(56˝% ˜)˜#\* ˜!7  56 _˜ˆ(  %!_ˆ_˛ ˆ˙&˚˝\'586"ˇ+((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'()_&\'(_&\'()˘536((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'&\' &\'˜ ˜˙˚ˆ-",ˇˆˇ!ˆ-ˆ,ˆ&ˆ!ˆˇ&53ˇ6ˆˇ,(˙˚&ˆ!-ˇ!6ˆˇ,˘ 8-ˇˆ-˙˝˝% ˜)˜#_ ˜!7  ˛˚(˙˚9ˇˇˆ-6ˆˇ,:;ˇˇˆ-<ˆˆ-ˇ&\' ,,˘˘ˇˇˆ-(9ˆˇˆ-!˘ˇˆ9˘ˆˇ˘˘(\n\n'

This is the output after Extract Text and it doesnot throw any error message.

A similar issue has been posted here:

http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python
I am using windows so the solution in link is not helpful

Problem with AutoCad generated PDF

I am trying to use the pyPDF2 module to merge a lot of pdf-files. For some of the pdf-files it fails.
The failing pdf-files is files generated directly from Autocad.

Traceback (most recent call last):
File "", line 37, in
File "", line 29, in main
File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 168, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 116, in merge
pages = (0, pdfr.getNumPages())

My script:

def main():
from PyPDF2 import PdfFileReader, PdfFileMerger
doclistdir = r'xxxxxxxxxxxxxxxxxx''
doclistfile = open(r'xxxxxxxxxx\list.txt','r')
doclist = doclistfile.readlines()
merger = PdfFileMerger()

for doc in doclist:
pdfdoc = doclistdir + '' + doc.strip()
mergerelement = open(pdfdoc,'rb')
#print 'Processing: ' + pdfdoc

merger.append(mergerelement)

output = open(doclistdir + '' + "document-output.pdf", "wb")
merger.write(output)
pass

if name == 'main':
main()

regards
Olav

Bad arguments to str() in u_

*** Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32. ***

Traceback (most recent call last):
File "test.py", line 1, in
import PyPDF2
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\generic.py", line 1042, in
u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

PdfFileReader barfs on unicode _namedDest ... needs to check isinstance(dest, basestring)

I noticed the way you are keeping 2to3 compatability with isinstance(x, basestring) in pagerange.py using this Str indirection. I propose we move this Str to utils, and import it into pagerange.py and pdf.py. Will submit a pull request later tonight.

Whitespace issues in extract_text()

I am not able to read text which proper formatting and spaces are not handled during extraction:

PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.

Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/

No way to redirect warning messages to standard python logging implementation

PdfFileReader sends all warning messages to stderr (or some other file that you can specify). Normally, you can redirect warnings into the logging system by using logging.captureWarnings. PdfFileReader stops this by replacing the showWarning function in the constructor.

The only problem is that this will break backwards compatibility for the PdfFileReader constructor.

debug print statement

https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/merger.py#L315

Complete operator for method removeImages

Hi,

Thanks you for add methods removeText and removeImage.
For the method removeImages, just a little correction for manage correctly content.

                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i',
                        'gs', 'W', 'b', 's', 'S', 'f', 'F', 'n', 'm', 'l',
                        'c', 'v', 'y', 'h' , 'B', 'Do', 'sh'] or \
                    operator in [b'cm', b'w', b'J', b'j', b'M', b'd', b'ri', b'i',
                        b'gs', b'W', b'b', b's', b'S', b'f', b'F', b'n', b'm', b'l',
                        b'c', b'v', b'y', b'h', b'B', b'Do', b'sh']:
                    continue

Incorrect Destination Type /FitBH

When I call pdf.getOutlines() I get the error "Unknown Destination Type
/FitBH"

Here's my sample pdf to reproduce: http://www.miem.gub.uy/MIEM_Dinamige-portlet/fileServlet?file=1707

I believe the issue is here: https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/generic.py#L942

It seems like it should be /FitBH instead of just FitBH (and similarly for the next three lines).

Exceptions / missing spaces in extract_text() method

extractText() method isn't broken, but throws some exceptions in these cases:

http://doctor12wer.blogspot.com/2013/06/extracttext-function-in-pypdf2-throws.html

http://stackoverflow.com/questions/17270387/pypdf2-typeerror-when-trying-to-extract-text

Pdf form overlap issue

When merging 2 pdfs, where one has form elements and the other does not; the "check box" form element overwrites any text that is present.

Is there a workaround for this?

The example below shows the text "blah" overlapped by check box element.

Huge memory/cpu utilization for 1 page PDF extraction

extractText() cpu/memory utilization is massive for the following 1 page 3 MB file. The extraction doesn't complete and the process has to be killed.

http://www.dora.state.co.us/pls/efi/efi_p2_v2_demo.show_document?p_dms_document_id=105933&p_session_id=

Multiple definitions in dictionary error

Getting a Multiple definitions in dictionary at byte 0x3ee546 for key /CPUCstFn1 for file
http://docs.cpuc.ca.gov/PublishedDocs/Efile/G000/M075/K768/75768464.PDF
when doing reader.getPage()
The file opens without an issue in Foxit.

PdfReadError: EOF marker not found

We are getting error like PdfReadError: EOF marker not found .

Scenario: We concatenate some PDF's using pyPDF - input can be princeXML supplied PDF , normal PDF etc .
No issues here .

princeXML PDF + Adobe PDF = pyPDF generated Concatendated PDF - Cool works fine .

Issue happens we we now use the above type of pyPDF concated pdf and concat with other normal pdf again using pyPDF itself .

princeXML PDF + some pyPDF generated PDF = pyPDF generated Concatendated PDF (Expected) works in most cases some cases this won't work . It basically complaints that the pyPDF generated PDF EOF marker not found ! However it was generated by pyPDF itself , did pyPDF miss putting EOF marker in some strange cases ?

Can anyone look at this bug ? This has happened quite rarely but some online sites are handling this same pdf pretty well . How can I attach the PDF to Github for inspection ?

A relevant question can be seen here :
http://stackoverflow.com/questions/15177587/merge-non-standard-pdfs-with-pypdf

PdfFileMerger.addBookmark() should return the newly added bookmark

PdfFileWriter.addBookmark() returns the newly added bookmark, so it can be used as the parent in subsequent addBookmark() calls in order to create nested bookmarks.

For consistency, PdfFileMerger.addBookmark() should function similarly, however it does not, as it doesn't return anything, thus making it impossible to create nested bookmarks with PdfFileMerger.

Wrong PDF generation on windows

The below code will generate an output, but the resulting PDF is not the expected concatenation of the two original pages. Same code works as intended on Linux.

import PyPDF2

pdfList = ['top_01.pdf','top_02.pdf']

def mergePDF():
        writer = PyPDF2.PdfFileWriter()
        for pdf in pdfList :
            f = open(pdf, 'rb')
            reader = PyPDF2.PdfFileReader(f)
            writer.addPage(reader.getPage(0))
        out = open('top.pdf', 'w')
        writer.write(out)
        #out.close()

mergePDF()

Here are the links :
top.pdf

top_01.pdf

top_02.pdf

Encounter a valid pdf file but PyPDF2 fail on it

that file can be decompressed by pdftk, but the FlateDecode of PyPDF2 failed:

  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1751, in mergePage
    self._mergePage(page2)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1801, in _mergePage
    originalContent, self.pdf))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1714, in _pushPopGS
    stream = ContentStream(contents, pdf)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 2158, in __init__
    stream = BytesIO(b_(stream.getData()))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/generic.py", line 850, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 102, in decode
    data = decompress(data)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 47, in decompress
    return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream

here's the data to be decompressed (repr print):

'H\x89\xedW\xcbr\x1b\xb7\x12\xdd\xf3+f)W\x85#\xbc\x1fZ]\x8a\x0f\xc5%\x9a\x94I:\xded\xa3X\xd4\xe3F\x12mIN\xe2\xbf\xbf\xc0\x003\x83\x19\x00M\xd9I*Qr7*q\x80n4\xbaO\x9f>8\xde\x0c\x0eg\xb8\xc0\xa8D\xaa\xd8\\\x0e\x84\x94%Q\x05\xd3\xa4\xd4\xa2\xd8L\x8a\x83\xc9\xeeU\xb1\xf9\xef`hvp\xb3\xb2\xf9P\x1c|8\xaa>\x1d\xceHk\x88UIX\x81\xac\t\xaa6aM\x84\x92\xc4\xef4G\xe0\x121\xbbsH\x15/5)8-\xa5;b\xd4\x9c\xa0\xb4?\xe2\xa6\xfa\x84J\x8c\x04r_\x1e\xb2\x9b\xaa\xff\xef\xaf\xeau\x8c%&\xd5\xb7\xa2\x8d\x9c\x0bg2\tL\xcec\x8b\xa7`y\xfb\x18\xaf\x1f\x15\xc1\x06\x8a\xa3 \x87\x8d\tB\xd4\x99\xbc\x89N\xec\xdcj\x18,\x13\x84Y\xee\x16n\xc7f\x07\xaf\x13\x99\t\xc9-\x8f>>3\x80\xaa`\xc4V\x8b\xf0\xd2T\xb9\x02\x85)U\x95~V\xed]\x07v\xd7;\xef\xb7^l\x8beQd\x13\x1bF9Kow\x8bw}\xd3\xd0\xf2\xcd\xf6\xe2\xe6<\xda\xb0\xaeo\xe5\x12\xf2!\x8cl{\xf1\xf9v[}\x98n\x06\x9f\n\xac\x95\xc1\'*\x86D\x95\xaa\xa8\xfep\xa5\x8a\x0fw\xc5\xe1\xcd\x1d.&\xbb\xe2\xed\xe0\xb8\xd7\x13D\x97\xdc\xc0\x95\xc9RTI8 X\xd2\xa6\x0eT\x11\x16\xc5\xb9>on\xc8\x99[\xbe\x8d\xda\xe8\xe7\xa6\x18X\xc6W\x1dE\xfb\x7f\t\xc1\x19\xd9~\xb7\'\xa2\xcfQD7O\x91\xc3\xac9E\x08\xde0\x0e\xba\tw\xcb\n&\xe11\xf0\xf1\xd3\xf9E|\xad\xce!#8\x08M\x10R\xd5on\xc9fhHI\x92\xaad\xe38\x94\x90=\xb6\xf7\xd1\xc1OG~\xddxiYm\xdc\x14Vr*\x02\xc0\xba\xe5\x8fq\xdd\xc3c\xae\xee\xe3$\xfdx\x90\xc9Iw\xd7+\x17\x8e\xb9N\x18\xcer\xb4\x8e\x02\x7f\x1dV\x1d\xce\xd7|z2\x9a\xc3[f\xa0\xff\xc5h1\x9e\xd6\x0e\xa8L9X\x9eMW{\xa0\xb3\xdc\x13\xe5|\x9c\xd8\xe0\xf3\xe6\x0f\x99\xcf\x16?\xbe\xea\x16,\x9d\x1c\x9f\x11\x93\xc7z[L\xd0\t\x10\'\xfa6\xd3\x87OM\xa1\x82a8\x9a\xc1y\x9c\x83\xab\xa3\xcd\x9e\x1c/\xa6\x9b\xda\x01V\x9e\xd8\xdf\x87\xe9Y\x9d\xd6\x14\x88J\xa6\x8a_\x0b\xc2lg0D\xed$\x96\x0c\x97\xe6\xb6RX\xc2{\xd8\x0e\xd6\x96\x02)+\xb9(\x98\xe0\xa5\x1b\xd2q.\x9f\xa2\xc4uVC\xe0\xef\xee\xbbY98^\xbe\xab\x91\x13\x13`\'\xff\xc3\xf0\xa2\x81\x99+!\xad9y\xbcK\xb4p\xaa\xb2\xad\x8ay^\xe9\xc6\xcb\xf9|:]\x80\\}:\x1f\x9dLCd5\xea\xe9\xeczw\xbf\xed\xbaMu\xbfs#\x98Hu>x4c*\x91\xb3`\x834c2\xf6\xd0i\xf187\xbf=\xa72\xcf\x8f\xddE\xaa\xd0\xdeHu\xaa\xb45L\xd2\x94\x9b`\xdc\xdb(\xe1\xfd\xa2;P\xc63\xe16l\xed\xed}\x94\x97\x9fc\x930\x94\xabmd\x11\xce\xb7\xffd.\x92\x18\xf2\x99\xcbn/\xe2\xdb\x82 \x7f\x043\xd5\xb9n\xce6\x00A\x99i"\xf3\x85H\xc1\xdc\xa7\x9d\x0b\x98RU2\xc30%%m3\x1d\x7f\x8e<|I\x8a\x90\x96~]-)\xea=/&\xab\x86=z\x94g\xcc\x05N\xb1\xe4\xe9\xbb\x15H\xb3\'\xcd](\xf5\xcb~tifo\xb1\xa7{\xa5H\xe0{O\xf7\nDc\x9b\xb0\'0g\xb8\x19kMGLrc\xca\xa7\xf1\xd7L\xe5|\x1e>g\xcds\xba%U\x98\x10\xd8\xd71\xf2B\xb5\n\xe3r{q\x15\x8f\xd1\x1c\xd4P\x881C\xb4\xdc>"\x03\xc2\xdd\xc0\x91\xc6s\xea\xf2\xf2\xa6\xa9\xbb\xee>p\xbd\xf7\xbf\xa6\xee\x84\xf8+Ra\xc1\x17T?~\x19<>f\xca\x07\xb7sg>\xb6\xd2\xb4\xd1X~\x1e\xb2\xce<\\\xedr\x95\xc9\x80\xa7\xd9\x9f\xd6.\x9b\xd8_\x07:\x0f\x7fr\x07j\x02W\x82b\x8d\xe2\x0e\xfc\xc6,\xe4Zr_\x12\xfe\xa6\xed\xc7\x8dH4}7$\xc4\xb3\xfc\xc1\xfac<G\xe0\xc4$\xa6R\x11\xb2r<\x84s@\xce\xe6\'\x1e:\t\xed\xd7\x95\xab\xf1\xa1\xae\x17\x04\xd2V*c\xaa=\x12|S\xc4\xb7\xfe%\xee\xa7\x10w\x93\x84b\x8e&a\xb8z\x14Z\xa3=\xa2o\xf41\x91\x91p\x03A\x98\xc1\x01z\x86\xea\x84\x10,s\x04\x9b\x9f\xb9g\x95\x96\xf6\x811\xa4\x8e\x9c\xb3i\xea^\xd4+I\x82dI-\xb9\x1b\x11\xc1\x82\xa7ZgB\x8f\xdc\x84f\x8e#q\xc9\x03\xb5A\xdd\xe8\xd4\xc8\xbeu\x9a\xb5\x96?[/g\x80\x17\xe2\xbc\x0c\x85\xb0X\x7f\x8e\x1b?\x93\x9a\x9d\xb2\xfaj\x86I\x10\x85\xaf\xd3\xd7\x85R\xd7\x8c\x0bT*\x15\xc42\x89[\xfc\xcbc="\xadrj\'d(-\xdf\x84\x14T\x17"\xf9\x82\xdc\xcc\xd6\xee}\x83\r\x07"\xaf\xc4\x0e~\xc8\xea\x82T\xbff\x88\xf1\xde\xfbU\xf6\xf5\xe9\xfc\xce\xbd\xf4&f\xf6\xa9\xfa\xe3*\xd11\xee|\xf3\xac\x15\x8a[\xa0QA\xac\xc1]\xf3\x9b\xe2\xd2\x9cy\xeb\x9e\xb4\x92T\xab\xd46o\xd0\xbb_\xdb\x89\x89\xe9\x95\x9d\xaa\xb7\x1e\xca\xd4B\xa2U)M\xdd\x8d\xbe\xe3\xce\xec\xc6\'.Y\xf8\xe7\x1dc\xf2`r^\x1a\x1d\xde\xe4\xa1\xfe\xdd\xe6\xe1S\x81\xcdo"LKI\xa6\xed)\x15\xa3=l\x8b\xeb\xe2}q?\xc0:H\xdc\xdd@\x1aO]\x07o\x8bO\x85I&QU\x88m23\xaaz=v\xbf\xb8nqsv\xbd\xbbwy\xe5\xaa\x99\xeb\x07cx\x08\xee\xd1\xa0\xf5\xe3\x8bR\xeaXC\xd9\xae\xb7~Y\x94\xbb\xefB\x9eC\x9e\xc9$"e\xfdd\xfc\xd6&\xc9\xb2r\x08\x1f\xf7\x89U\x13$\x10VH#4\xca:p[\x04\xe2gn\x0eU\x97ty[,\xe3\xa8\x8a0\xae\xefG\x1b\x10\xc4\x98 p\xfd\x07\xec[T\x94\xa2~\x84e\xc6B\xed\x8a\x18X\xd4\xd2\xd9\x0c\x1b\xd4\xc7\xa7y\xcf\x99Vh\xf1\x89\xb9\x05\xa1\xefS\xa1u\x89\r\xb4H\xc5r\xceC\xbf\xd1k\x07\xf5\xef\x8e\x03nN\'*\xe1`\x88\xb5j\x8a|\xa0PB/\x86\xf9\xe6b\x8fLG\xba\x96\x8f!\xbc\x11\xc2pB\xd7\xc7\x94H\xe1\x93*\xdbJ\xce\xb3\xaa2l\xf3\x06\xe1\xdcf\xc9\xda\xc5\xa3:D8\xe1\t\x84\xef\x03\xe9\xd7O\x86\x00\xd0\x16\xa8\xe8\x0cN\x9dA\n\x1f\xbd<,S\xc2\xfbXv\xd0k\xb1\x8c*h\xf6\xb0\x8c*Q`=\xe0\x08\xcb\xb5\x83\xfaw\xc7A\x8d\xe5\xc8\xc1?\x03\xcbF\xdb\xf1\x0e\x96\x1b\xacVs\xe8w\xb1q\x02\x98\xab\xe5\xfae`\x8e\xb1V\x0eE\xec\x89\xba\xecITH~\x12a\x0b\x18bd\xafp\xb2\x99\xf7\x01\x87\xba\xe4\xd9\xb5\xe7\x92\xdaZD\xf6F\xc4\x10[\xab\x97\x8d737L\xc6\x02\xbca\x98<y}O\xdb\xc8\x8a\xfd_\x1e\xfc\t\x94\x1a\xc9\x03\x0f\xc8\x16\xe02A\xa9D\xf2\xfa\xf9\xc3\xfa\x08o\x1c\xd4\xbf;\x0e\xb8\xfb\x15;0\x10\xb7O\xcd\x17\x8e\xf0\xbe:  \xc0\xd5\x1f\xa8\x7f\x9f\x85\xef\xdf\xd5\x04\xf49M\x80^\xa2\xae\xe0Q\x13\xc8\x1e\xcb\xf3\x14\xcb\x0b\x0b\xd8\n\xae\xfd\x16\x90=\x92\xe7)\x92\xef\x99\xff\xcd9~q\x1e\x99=\xdd\x04v\xbb{\xdf\x04\x873R`d\'\xe8\xe6r`\x84\xa6M\xc0\x103\x7fW\xe1\xf7\xe0v\x0f&\xd2\xbeo\xdd1\x18l\x18\x9d\x92\xd3/j"lF\xa7S0\xff\xe3\xe5b\xb3Z\xce\x1b\xdfT\x11\x96\xed\x88?PJ{\x8c\xb6\x90g!m\xd7\x90g"\xa8b\x00x\xde\xe3\xfc\x8e1Q\xd8\x92}m\x8c\x10\x82!\xb8\x19\x9b\x98\x17\xdfJ\xc2i\xfc5\x8c\xfco\x07\xe0d5\x8a\xc3z\x1d\x98-N\xc0\xe2|?ZL\xfeZ\xb6\x8e\xa1\xcbzlMRlm\xe4\x85Ri\xf4\xb2\x1e]w\xeck\xf4\xb6\xf6{\x01\xbc\x9c\x8d7\xf8\x9b\x01l\x02#,\xb8\xc7\xdd@\nQ\xf6\xafu8\xa3\x85\xb4\xd1\x1a\x00ss\x88\xb0y(\x89c\xd9\xb3\xa5y\xa1\x12\xa1$\x196\x97e\x92\xb9\x07L\x85\xdb\xca\xbf\xb07z\xd8\x16\x97f\xb1\xaaF\xbd(IwQ\xda\x83\xdd"\xae2\x14,*dO\xcd-\x12\xc0\xad\xa2\xad\xdbx\x91\xdb\xa1\x95s[5lnQ\x03n5i\xa3\x8d\x17)\xe0Vs\xc8\xad\x04\xa25\x9a\x99\xe5,\xb9Ag\xd6-G\xb4u\x1b/r\xc8-PO\x8e\x80zr\x0c\xd4\x93\xe3\xa0\x9e\xf1"\x07\xa2\xc5\x12\x88\x16+ Z\xac\x81h\tPON\x80zr"\x81h\r\x91\xe4\x93`>\xe7\xa3\xa5P=)\xcd\xc3\xc4L\x10\xc8\xad\x84\xdcj\xc0-\xc3@\xe2\x19TO&\x80$0\xa8d\x0chA\xce1\x10-\xa7@\xb4\x1chA\xce\x81\x16\xe4\x1cjA\x01\xd5S@\xf5\x14\x0c@\x9f\xe0\x00\xfa\x04\xd4\x82\x12\x01n%\x06\xdcJ\xa8\x05\xa5\x80\xdc\x02\x94\xca%TO\x05\xb5\xa0\x82\xea\xa9\x18\x80>\x05\xd5SA\xf5\xd4\x08p\xab\x19\x00j\r\xb5\xa0\x86ZP\xeb|\xe2\x05\x02ZP \xa0\x05\x8d\x82\x83\xdc\x02\xfd)\x10\xd0\x9f\x02c\xc0-&\x80[\x0c\x94L`\x80R+\xc9\xc2\x1a\xc9Bt%\xaa\x84\x7f\x16\x8e\x1f\xbc\xa0BR\x0b\xa7\x8c\xb6\xa12z\x8a\x97/\x8a\x9f\xbe\x04[\x8e\xea\x1d\x9a\xe0P\x91Yy\x8c\x04%\xd5\xa7I`\xb0\xaa\xdfaHI\xec\x96\xdf\x87"\xee\xf4\xdd\xaav@0\xf3\x0e^7\xba\xcf\xfd>\t\x0c\xa6\xab8\x84"\xf4\xc8\x0e\xd5!A\x98e"u\x9b\xe8\x11GG\x8c\x83{\xce\xde\xc4\x8f\x18a\x80\xcd\x85y\xe9X\xa8U\x1a\xf0\xfcj\x0b\x0bu\xf8\x8d\xb9\x8b\x8c/\x0bR\x8b\xc9\xb7\xc5\xff\x00L\xc2\xa0'

the pdf file can also be opened by osx preview correctly.

startxref is not necessarily on a different line from the location of the xref table

The PDF spec seems to require that the startxref keyword and the byte offset to the xref table be on different lines.

However, in the wild, I have found otherwise valid PDFs where the startxref keyword and the byte offset to the xref table are on the same lines, like so:

...
0 8
0000000000 65535 f
0000000009 00000 n
0000305603 00000 n
0000305652 00000 n
0000000083 00000 n
0000305310 00000 n
0000305405 00000 n
0000305423 00000 n
trailer
<<
/Size 8
/Root 2 0 R
/Info 1 0 R
>>
startxref 305711
%%EOF

See here for example: https://www.docketalarm.com/cases/PTAB/IPR2014-00358/Inter_Partes_Review_of_U.S._Reissue_Pat._RE043707/docs/01-17-2014-PET-1193/Power_of_Attorney-2-Power_of_Attorney.pdf