Website py-pdf
$ pip install -r requirements.txt
$ pre-commit install
$ invoke livereload
- Edit
requirements.in
- Run
pip-compile requirements.in
to generaterequirements.txt
$ make github
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Home Page: https://pypdf.readthedocs.io/en/latest/
License: Other
I'm trying to dig deep into some PDFs by calling getData directly on part of a page (I am then parsing that data to find coordinates for a bit of text).
This worked for me in the past with essentially:
page = PdfFileReader(inpdf).getPage(0)
text = page.getContents().getData() #<-- or page["/Contents"].getData()
but with my new PDFs, I am getting an error like this:
"AttributeError: 'ArrayObject' object has no attribute 'getData'"
Digging in, it looks like my old PDF was structured like this (print page) with a single IndirectObject in the contents.
{'/Contents': IndirectObject(14, 0),
'/MediaBox': [0, 0, 662.40000, 792],
'/Parent': IndirectObject(1, 0),
'/Resources': {'/Font': {'/F3': IndirectObject(10, 0),
'/F4': IndirectObject(7, 0),
'/F5': IndirectObject(4, 0)},
'/ProcSet': IndirectObject(13, 0),
'/XObject': {}},
'/Type': '/Page'}
Then page.GetContents() returns:
{'/Filter': '/FlateDecode'}
while my new PDF is structured like this with a list of IndirectObjects in the contents:
{'/Contents': [IndirectObject(11, 0),
IndirectObject(12, 0),
IndirectObject(13, 0),
IndirectObject(14, 0),
IndirectObject(15, 0),
IndirectObject(16, 0),
IndirectObject(17, 0),
IndirectObject(18, 0)],
'/CropBox': [0, 0, 612, 792],
'/MediaBox': [0, 0, 612, 792],
'/Parent': IndirectObject(5, 0),
'/Resources': {'/Font': {'/F3': IndirectObject(24, 0),
'/F4': IndirectObject(26, 0),
'/F6': IndirectObject(29, 0),
'/F7': IndirectObject(30, 0)},
'/ProcSet': IndirectObject(31, 0),
'/XObject': {}},
'/Rotate': 0,
'/Type': '/Page'}
then page.getContents() returns:
[IndirectObject(11, 0),
IndirectObject(12, 0),
IndirectObject(13, 0),
IndirectObject(14, 0),
IndirectObject(15, 0),
IndirectObject(16, 0),
IndirectObject(17, 0),
IndirectObject(18, 0)]
How do I get at the underlying data of /Contents? going after the pieces of the list with page.getContents()[0] just returns the name of the object and I can't use getData() on that. I can't tell if this is a bug (caused by having a list as the contents) or if I am missing some feature.
Hello,
PyPDF2 1.2.0 overwrites warnings.formatwarning with its own implementation (utils._formatwarning) in pdf.py line 74:
warnings.formatwarning = utils._formatwarning
Unfortunately this may cause severe side-effects if PyPDF2 is imported in a larger application. In our case the PyPDF2 implementation of formatwarning caused IndexErrors whenever a warning was raised somewhere else (and the filename argument was not to the formatter's liking).
Personally, I do not think that it is a good idea for a library to interfere with the global logging/warning infrastructure.
P.S.: Apart from this problem, we have been using PyPDF2 successfully for some time now. Nice piece of software!
Currently the parser is quite slow, even for moderately sized PDFs. When I get a bit of time, I'm going to investigate different ways it could be sped up. Right now (pending some profiling, obviously) I suspect this is going to involve re-writing some of the core parser loops in something lower level like Cython. I'm looking into options to see if it's possible to write in a language which will be able to compile back to vanilla Python for the benefit of PyPy and friends.
I'm opening this issue to start discussion on the matter, and see if you've got any strong feelings either way.
The issue is something like this: /FontFile2 11 0 R
There is more than 1 space there, cause PyPDF2 failure:
/PyPDF2/generic.py", line 256, in readFromStream
return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''
This should be supported anyway.
I cannot seem to get scaling to work.
If I submit a float or int to "scaleBy":
TypeError: Cannot convert float to Decimal. First convert the float to a string
If I submit a string:
TypeError: can't multiply sequence by non-int of type 'float'
If I submit a Decimal:
TypeError: unsupported operand type(s) for *: 'float' and 'Decimal'
Exception occurs when reading certain valid pdfs
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1019, in getObject
raise Exception, "file has not been decrypted"
PDF: http://dis.puc.state.oh.us/ViewImage.aspx?CMID=A1001001A13L20B64142B89320
Hi,
I have a PDF and I wan't to remove the text from PDF file , to keep only image in my PDF.
I see have a method ignoreLinks for PdfFileWriter object, can you add method ignoreText ?
Or explain how I can do ?
Thanks.
When using the merge function with two files and using the import_bookmarks=True option, the bookmarks are always off by 1 page.
The issue is further compounded by different .pdf readers. I'm seeing in Adobe the bookmarks are off by 1 page (one page behind) and in other readers like PDF Complete - they are correct.
I made the following adjustment in the source code (merger.py) _associate_bookmarks_to_pages --
for p in pages:
if bp.getObject() == p.pagedata.getObject():
pageno = p.id-1 ########### the -1 was added
Everything looks great in Adobe but now the file in PDF Complete it's off by 1 page...fortunately I only support Adobe.
After further inspection -- although bookmarks work -- the bookmarks are highlighted incorrectly when scrolling through pages. They are off by 1.
I checked the file using the getOutlines() function and saw the file was structured incorrectly with the "/Page" key being off for each item:
Eg:
[......,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 6}, .... ]
Should read this:
[....,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 7}, ...]
And yes I do understand pages start at "0" !
What would I need to fix the root '/Page' key? Would someone be able to help me?
PyPDF2 currently lacks a filter for DCT compression (true? Even as maintainers, we sometimes forget everything there is to know about PyPDF2). How important is it that we add this? There certainly are instances "in the wild" of PDF which use DCT compression; should we care?
[See also internal Issue756.]
It would be nice if pypdf could edit the document meta information. Is anything like this planned?
Hi,
I've been using PyPDF2 to merge some PDF files, adding bookmarks to the various pages as needed. I've been using the code below to set the initial view of the output PDF so that it shows one page at a time, and displays the bookmarks navigation panel.
pdf = PdfFileWriter()
root = output.getObject(pdf._root)
root.update({NameObject('/PageLayout'): NameObject('/SinglePage'), NameObject('/PageMode'): NameObject('/UseOutlines')})
I'm wondering if there would be any interest in writing this into a more formal method. Maybe something like:
pdf = PdfFileWriter()
pdf.page_layout = 'SinglePage'
pdf.page_mode = 'Bookmarks'
I'm happy to write this an submit a pull request, but I though I'd get some feedback on the syntax.
In addition to this, it would be nice to be modify the author, title, etc. Maybe this is already possible and I've just missed it...
I am using PyPDF2 for extracting text and geometry from a PDF and this is my code snippet of Pdftext.py file :
from PyPDF2 import PdfFileReader
When i run this, i am getting the below error:
Traceback (most recent call last):
File "C:\Program Files\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Mic
rosoft\Python Tools for Visual Studio\2.0\visualstudio_py_util.py", line 76, in
exec_file
exec(code_obj, global_variables)
File "C:\Users\xxx\documents\visual studio 2012\Projects\PDFText\PDFT
ext\PDFText.py", line 3, in
import PyPDF2
File "C:\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 1049, in
u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u
0000'), u_('\u0000'), u_('\u0000'),
File "C:\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)
I have a pdf that has security restrictions. I need to merge some content into the secured pdf. I don't need the pdf to be secured after the merge.
When I open the file and check isEncrypted, it returns true.
When I try decrypt with empty string there's a notImplementedError raised "only algorithm code 1 and 2 are supported".
The restrictions on the file are shown below.
At the moment, to bypass the restrictions on the file, I print the pdf to images and create a new pdf with those images. This isn't ideal as the file size becomes large and the content isn't as crisp.
Is there a better way?
NumberObject is initialized wrong
class NumberObject(int, PdfObject):
def __init__(self, value):
int.__init__(value)
Correct would be;
class NumberObject(int, PdfObject):
def init(self, value):
int.init(self, value)
Hi
For some time ago I reported a problem regarding AutoCad generated PDFs.
This problems was solved.
I have encountered a new problem which I belive is also related to the AutoCad generated PDFs.
This time I'm adding a watermark to an existing pdf.
I am able to add this watermark-file (created using pyfpdf ) to most of the files
a = PdfFileReader(open(filein, "rb")).getPage(0)
watermark = PdfFileReader(file(r'c:\temp\test.pdf','rb')).getPage(0)
a.mergePage(watermark)
filein is a AutoCad generated PDF.
.
This fails:
a.mergePage(watermark)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1644, in _mergePage
originalContent, self.pdf))
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1557, in _pushPopGS
stream = ContentStream(contents, pdf)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1986, in init
self.__parseContentStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 2025, in __parseContentStream
operands.append(readObject(stream, None))
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 55, in readObject
return readStringFromStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 370, in readStringFromStream
raise utils.PdfReadError("Unexpected escaped string")
PyPDF2.utils.PdfReadError: Unexpected escaped string
Looks very similar to the last problem I reported.
Olav
Reproducible with:
import StringIO, PyPDF2
PyPDF2.PdfFileReader(StringIO.StringIO())
Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment
function:
def skipOverComment(stream):
tok = stream.read(1)
stream.seek(-1, 1)
if tok == b_('%'):
while tok not in (b_('\n'), b_('\r')):
tok = stream.read(1)
Will hang indefinitely.
I would propose three courses of action:
class SafeStream(object):
def __init__(self, stream):
self.stream = stream
self.seek = stream.seek
self.tell = stream.tell
self._empty_reads = 0
def read(self, *args):
res = self.stream.read(*args)
if res == "":
self._empty_reads += 1
if self._empty_reads > 1000:
raise Exception("too many empty reads")
else:
self._empty_reads = 0
return res
Add a script for automating fuzz testing to the repo
Fix the bugs as the script from step (2) finds them
What do you think? Would you be open to patches for those?
Hi,
Like my last post "Add method ignoreText" I need to extract only test from Pdf, I try some products for extract text from pdf but all return text in String. But no one keep text position and fonts. I think PyPdf is the good tools for do that.
I add this method in pdf.py in class PdfFileWriter:
def ignoreImage(self, ignoreByteStringObject=False):
pages = self.getObject(self._pages)['/Kids']
for j in range(len(pages)):
page = pages[j]
pageRef = self.getObject(page)
content = pageRef['/Contents'].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, pageRef)
_operations = []
seq_graphics = False
for operands, operator in content.operations:
if operator == "Tj":
text = operands[0]
if ignoreByteStringObject:
if not isinstance(text, TextStringObject):
operands[0] = TextStringObject()
elif operator == "'":
text = operands[0]
if ignoreByteStringObject:
if not isinstance(text, TextStringObject):
operands[0] = TextStringObject()
elif operator == '"':
text = operands[2]
if ignoreByteStringObject:
if not isinstance(text, TextStringObject):
operands[2] = TextStringObject()
elif operator == "TJ":
for i in range(len(operands[0])):
if ignoreByteStringObject:
if not isinstance(operands[0][i], TextStringObject):
operands[0][i] = TextStringObject()
if operator == 'q':
seq_graphics = True
if operator == 'Q':
seq_graphics = False
if seq_graphics:
if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i', 'gs',
'W','n', 'f', 'm', 'l', 'cm', 'Do', 'sh', 'S']:
continue
if operator == 're':
continue
_operations.append((operands, operator))
content.operations = _operations
pageRef.__setitem__(NameObject('/Contents'), content)
If you thinks this method is helpful. can you add it ?
Thanks.
I'm merging 2 pdfs using code that works correctly for other pdfs. I'm using the mergePage method to overlay the content from one pdf on the other pdf (merge page by page).
In the image below, the numbers (highlighted by red box) should be positioned vertically.
The "base pdf" is a scan from a Xerox WorkCentre 7435. The "secondary pdf" (containing the highlighted numbers) is generated using reportlab. The "base pdf" and "secondary pdf" have portrait orientation when viewing in a pdf viewer.
Other scans (from other scanners) merge correctly.
I don't know much about how pdf structure works, but is it possible the scan isn't including some data (orientation)?
I will try include a problem pdf when I obtain one that doesn't contain sensitive information.
Thanks
Rob
I have a 483 page PDF that I use for testing (manual). The problem is that when I try to split the document, it takes almost 2 min to process the first handful of pages, and then 3 seconds to process the remaining 450+.
Pages 3-6 contain a table of contents with links to other parts of the PDF. When I take these few pages out of the document, it takes 3-4 seconds to split the 483 pages.
Any ideas why its hanging on the table of contents (with links).
This seems to be the only feature that doesn't work under Python 3. There are several encryption algorithms, it is probably just a matter of using utils.py correctly to avoid TypeErrors.
Since your username has changed it's not possible to download via pip anymore.
Could you please fix this?
https://pypi.python.org/pypi/PyPDF2
Thanks!
Hi,
Is PyPDF2 fully API compatible with PyPDF ? I'm trying to get PyPDF in Fedora replaced by PyPDF2 but we must know if it won't break anything or fix application accordingly.
Thanks !
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be
corrected. [pdf.py:1130]
Merging 2 pdfs. The first pdf is from paperport 11 (some old program which may not support pdf structure correctly?), I initially needed to apply the fix from #34 (to fix EOF error). The next issue I encountered is in the method: _flatten (in pdf.py) where "/Type" isn't present in the pages dictionary.
I made the following change:
def _flatten(self, pages=None, inherit=None, indirectRef=None):
...
...
#this is the change I made; default t = '/Pages'. Is this the correct thing to do?
t = "/Pages"
if "/Type" in pages:
t = pages["/Type"]
...
Should I commit a fix for this (and make it conditional on strict parameter)? Or is there a better way to pick a type?
I get an mysterious error with the PDF Reader using python3 on the file
"Werner - Fragen und Antworten zu Werkstoffen.pdf".
My Code:
import fnmatch
import os
from PyPDF2 import PdfFileReader
for file in os.listdir('.'):
if fnmatch.fnmatch(file,'*.pdf'):
print("File: "+file)
foo = PdfFileReader(open(file,"rb"))
Error:
File: Werner - Fragen und Antworten zu Werkstoffen.pdf
Traceback (most recent call last):
File "test.py", line 8, in <module>
foo = PdfFileReader(open(file,"rb"))
File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 684, in __init__
self.read(stream)
File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 1236, in read
streamData = BytesIO(xrefstream.getData())
File "/usr/lib/python3.3/site-packages/PyPDF2/generic.py", line 834, in getData
decoded._data = filters.decodeStreamData(self)
File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 310, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in decode
rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in <listcomp>
rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
TypeError: ord() expected string of length 1, but int found
Is smth broken with my filename or why this error occurs?
When I execute the following code in Visual Studio 2012 using Python tools and ironpython 2.7 and PyPDF2 v1.20.
i got this error "int() got an unexpected keyword argument 'base' " line 803 in pdf.py
This is my complete code:
import clr
clr.AddReference('System.Drawing')
clr.AddReference('System.Windows.Forms')
from System.Drawing import *
from System.Windows.Forms import *
from PyPDF2 import PdfFileReader
class MyForm(Form):def __init__(self): # Create child controls and initialize form self.Text = "Test Project" self.Size = Size(600, 500) path = "F:/Download/RealPython.pdf" f = open(path) inputpdf = PdfFileReader(open(path, "rb")) page = inputpdf.getPage(8) pagecontent = page.extractText() display.mediaBox.upperRight = ( display.mediaBox.getUpperRight_x() / 2, display.mediaBox.getUpperRight_y() / 2 )
Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)
form = MyForm() Application.Run(form)
I read that PyPDF2 is written in pure python so it should run with any python, so i am using ironpython 2.7
can anyone help :)
PyPDF2 can't read some files which can be read by pyPdf.
Usually these PDFs contains ^M characters in the first line, just after the comment, like this one: http://books.nips.cc/papers/files/nips24/NIPS2011_0622.pdf
Also I noticed that the function readNonWhitespace in utils.py is performing exactly the same function as skipOverWhitespace. I suppose readNonWhitespace should read non whitespace characters?
The command
pdfcat foo.pdf -2: >bar.pdf
is supposed to put the last two pages of foo.pdf into bar.pdf. But argparse chokes on the "-2:
".
I have two PDFs to merge, once with HTML links, and another just plain watermarks.
After merging, the links are not working, and if I reverse the merge sequence, the watermarks will hide the links.
Here is my codes:
bg = PdfFileReader(file("/tmp/bg.pdf", "rb")) #plain watermarks
fg = PdfFileReader(file("/tmp/fg.pdf", "rb")) #text with links
page = bg.getPage(0)
page.mergePage(fg.getPage(0))
output = PdfFileWriter()
output.addPage(page)
ostream = file('/tmp/out.pdf', 'wb')
output.write(ostream)
ostream.close()
When a standalone NameObject is encountered the parsing code raises an exception.
Reproducible with:
from PyPDF2.generic import readObject
from cStringIO import StringIO
print readObject(StringIO("/deviceRGB"), None)
PyPDF2 fails with PdfStreamError("Stream has ended unexpectedly").
Now some of the PDFs generated with ImageMagick(img to pdf conversion) have this standalone "/deviceRGB". And it is not followed by space or any of the delimiters. I have come across couple of PDFs with this problem. Unfortunately I cannot send them across(client data). I'll try to create such pdf and attach it here
A new PyPDF2 branch 'Python3-3' has been created, incorporating William Culver's changes from his pull request #4 . However, it currently only completely works on Python 2.6 and 2.7.
I have a PDF which PDFFileReader is unable to read the text , instead this is the output:
u'\n˘ˇˆ˘ˇ˙˝˛˛˚˜ !!"#$%&"˝˛˝˘˛˘˛˚˙˘ˇ˝˛˘˛$\'(˘%˘ˇ˘ˆ˘)_)˛\'+,-)"˛./0"0!123˛"4˙"5)46)!6"˙˘˘˘,˘ˇˆ˙˙ˆ˝˛˚˜ !˘ˇˆ˙˝"" ˜#˝$˛˚˜ ˆ˙˝"" ˜ %˛˚˜ !˛˚ˇ!"#$%˘ˇ&ˆ˙˝˛˝ˆ˙&˚˝\'˛˚&\'()_ˇ+˙˝"" ˜#˝$˜#( ˛˚(ˇ+,˘˘˘ˇˆˆˆˇ,ˆ--ˆˇˇ˙˝˝% ˜)˜#_#˝$$˜ ˙ ˝_˛˚ˆ-&ˆ!ˆˇ&˘+$ˆ(˙˝+˚˜,!˛˚./&0ˆˆ+$ˆ(˙˝-˛-,&˘˝ˆ. ˚%˝% ˜)˜#\* ˜!˛˚&ˆˇ%ˆ!&(12+3ˇ˙˝,˜ˆ/˛˚%#"+3("ˆˇ.!ˆˇ43ˇ(˙-,&53ˇ6ˆˇ,˝˝% ˜)˜#\* ˜!˛˚(77777777777˜#( 0123& ˜"" ˜ %˛˚˜ 77777777777˜#( _ˆ_˛ ,4+#(56˝% ˜)˜#\* ˜!7 56 _˜ˆ( %!_ˆ_˛ ˆ˙&˚˝\'586"ˇ+((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'()_&\'(_&\'()˘536((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'&\' &\'˜ ˜˙˚ˆ-",ˇˆˇ!ˆ-ˆ,ˆ&ˆ!ˆˇ&53ˇ6ˆˇ,(˙˚&ˆ!-ˇ!6ˆˇ,˘ 8-ˇˆ-˙˝˝% ˜)˜#_ ˜!7 ˛˚(˙˚9ˇˇˆ-6ˆˇ,:;ˇˇˆ-<ˆˆ-ˇ&\' ,,˘˘ˇˇˆ-(9ˆˇˆ-!˘ˇˆ9˘ˆˇ˘˘(\n\n'
This is the output after Extract Text and it doesnot throw any error message.
A similar issue has been posted here:
http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python
I am using windows so the solution in link is not helpful
Hi
I am trying to use the pyPDF2 module to merge a lot of pdf-files. For some of the pdf-files it fails.
The failing pdf-files is files generated directly from Autocad.
Traceback (most recent call last):
File "", line 37, in
File "", line 29, in main
File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 168, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 116, in merge
pages = (0, pdfr.getNumPages())
def main():
from PyPDF2 import PdfFileReader, PdfFileMerger
doclistdir = r'xxxxxxxxxxxxxxxxxx''
doclistfile = open(r'xxxxxxxxxx\list.txt','r')
doclist = doclistfile.readlines()
merger = PdfFileMerger()
for doc in doclist:
pdfdoc = doclistdir + '' + doc.strip()
mergerelement = open(pdfdoc,'rb')
#print 'Processing: ' + pdfdoc
merger.append(mergerelement)
output = open(doclistdir + '' + "document-output.pdf", "wb")
merger.write(output)
pass
if name == 'main':
main()
regards
Olav
*** Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32. ***
Traceback (most recent call last):
File "test.py", line 1, in
import PyPDF2
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\generic.py", line 1042, in
u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)
I noticed the way you are keeping 2to3 compatability with isinstance(x, basestring) in pagerange.py using this Str indirection. I propose we move this Str to utils, and import it into pagerange.py and pdf.py. Will submit a pull request later tonight.
I am not able to read text which proper formatting and spaces are not handled during extraction:
PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.
Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/
PdfFileReader sends all warning messages to stderr (or some other file that you can specify). Normally, you can redirect warnings into the logging system by using logging.captureWarnings. PdfFileReader stops this by replacing the showWarning function in the constructor.
The only problem is that this will break backwards compatibility for the PdfFileReader constructor.
Hi,
Thanks you for add methods removeText and removeImage.
For the method removeImages, just a little correction for manage correctly content.
if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i',
'gs', 'W', 'b', 's', 'S', 'f', 'F', 'n', 'm', 'l',
'c', 'v', 'y', 'h' , 'B', 'Do', 'sh'] or \
operator in [b'cm', b'w', b'J', b'j', b'M', b'd', b'ri', b'i',
b'gs', b'W', b'b', b's', b'S', b'f', b'F', b'n', b'm', b'l',
b'c', b'v', b'y', b'h', b'B', b'Do', b'sh']:
continue
When I call pdf.getOutlines() I get the error "Unknown Destination Type
/FitBH"
Here's my sample pdf to reproduce: http://www.miem.gub.uy/MIEM_Dinamige-portlet/fileServlet?file=1707
I believe the issue is here: https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/generic.py#L942
It seems like it should be /FitBH instead of just FitBH (and similarly for the next three lines).
extractText() method isn't broken, but throws some exceptions in these cases:
http://doctor12wer.blogspot.com/2013/06/extracttext-function-in-pypdf2-throws.html
http://stackoverflow.com/questions/17270387/pypdf2-typeerror-when-trying-to-extract-text
extractText() cpu/memory utilization is massive for the following 1 page 3 MB file. The extraction doesn't complete and the process has to be killed.
Getting a Multiple definitions in dictionary at byte 0x3ee546 for key /CPUCstFn1 for file
http://docs.cpuc.ca.gov/PublishedDocs/Efile/G000/M075/K768/75768464.PDF
when doing reader.getPage()
The file opens without an issue in Foxit.
We are getting error like PdfReadError: EOF marker not found .
Scenario: We concatenate some PDF's using pyPDF - input can be princeXML supplied PDF , normal PDF etc .
No issues here .
princeXML PDF + Adobe PDF = pyPDF generated Concatendated PDF - Cool works fine .
Issue happens we we now use the above type of pyPDF concated pdf and concat with other normal pdf again using pyPDF itself .
princeXML PDF + some pyPDF generated PDF = pyPDF generated Concatendated PDF (Expected) works in most cases some cases this won't work . It basically complaints that the pyPDF generated PDF EOF marker not found ! However it was generated by pyPDF itself , did pyPDF miss putting EOF marker in some strange cases ?
Can anyone look at this bug ? This has happened quite rarely but some online sites are handling this same pdf pretty well . How can I attach the PDF to Github for inspection ?
A relevant question can be seen here :
http://stackoverflow.com/questions/15177587/merge-non-standard-pdfs-with-pypdf
PdfFileWriter.addBookmark() returns the newly added bookmark, so it can be used as the parent in subsequent addBookmark() calls in order to create nested bookmarks.
For consistency, PdfFileMerger.addBookmark() should function similarly, however it does not, as it doesn't return anything, thus making it impossible to create nested bookmarks with PdfFileMerger.
The below code will generate an output, but the resulting PDF is not the expected concatenation of the two original pages. Same code works as intended on Linux.
import PyPDF2
pdfList = ['top_01.pdf','top_02.pdf']
def mergePDF():
writer = PyPDF2.PdfFileWriter()
for pdf in pdfList :
f = open(pdf, 'rb')
reader = PyPDF2.PdfFileReader(f)
writer.addPage(reader.getPage(0))
out = open('top.pdf', 'w')
writer.write(out)
#out.close()
mergePDF()
Here are the links :
top.pdf
that file can be decompressed by pdftk, but the FlateDecode of PyPDF2 failed:
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1751, in mergePage
self._mergePage(page2)
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1801, in _mergePage
originalContent, self.pdf))
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1714, in _pushPopGS
stream = ContentStream(contents, pdf)
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 2158, in __init__
stream = BytesIO(b_(stream.getData()))
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/generic.py", line 850, in getData
decoded._data = filters.decodeStreamData(self)
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 310, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 102, in decode
data = decompress(data)
File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 47, in decompress
return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream
here's the data to be decompressed (repr print):
'H\x89\xedW\xcbr\x1b\xb7\x12\xdd\xf3+f)W\x85#\xbc\x1fZ]\x8a\x0f\xc5%\x9a\x94I:\xded\xa3X\xd4\xe3F\x12mIN\xe2\xbf\xbf\xc0\x003\x83\x19\x00M\xd9I*Qr7*q\x80n4\xbaO\x9f>8\xde\x0c\x0eg\xb8\xc0\xa8D\xaa\xd8\\\x0e\x84\x94%Q\x05\xd3\xa4\xd4\xa2\xd8L\x8a\x83\xc9\xeeU\xb1\xf9\xef`hvp\xb3\xb2\xf9P\x1c|8\xaa>\x1d\xceHk\x88UIX\x81\xac\t\xaa6aM\x84\x92\xc4\xef4G\xe0\x121\xbbsH\x15/5)8-\xa5;b\xd4\x9c\xa0\xb4?\xe2\xa6\xfa\x84J\x8c\x04r_\x1e\xb2\x9b\xaa\xff\xef\xaf\xeau\x8c%&\xd5\xb7\xa2\x8d\x9c\x0bg2\tL\xcec\x8b\xa7`y\xfb\x18\xaf\x1f\x15\xc1\x06\x8a\xa3 \x87\x8d\tB\xd4\x99\xbc\x89N\xec\xdcj\x18,\x13\x84Y\xee\x16n\xc7f\x07\xaf\x13\x99\t\xc9-\x8f>>3\x80\xaa`\xc4V\x8b\xf0\xd2T\xb9\x02\x85)U\x95~V\xed]\x07v\xd7;\xef\xb7^l\x8beQd\x13\x1bF9Kow\x8bw}\xd3\xd0\xf2\xcd\xf6\xe2\xe6<\xda\xb0\xaeo\xe5\x12\xf2!\x8cl{\xf1\xf9v[}\x98n\x06\x9f\n\xac\x95\xc1\'*\x86D\x95\xaa\xa8\xfep\xa5\x8a\x0fw\xc5\xe1\xcd\x1d.&\xbb\xe2\xed\xe0\xb8\xd7\x13D\x97\xdc\xc0\x95\xc9RTI8 X\xd2\xa6\x0eT\x11\x16\xc5\xb9>on\xc8\x99[\xbe\x8d\xda\xe8\xe7\xa6\x18X\xc6W\x1dE\xfb\x7f\t\xc1\x19\xd9~\xb7\'\xa2\xcfQD7O\x91\xc3\xac9E\x08\xde0\x0e\xba\tw\xcb\n&\xe11\xf0\xf1\xd3\xf9E|\xad\xce!#8\x08M\x10R\xd5on\xc9fhHI\x92\xaad\xe38\x94\x90=\xb6\xf7\xd1\xc1OG~\xddxiYm\xdc\x14Vr*\x02\xc0\xba\xe5\x8fq\xdd\xc3c\xae\xee\xe3$\xfdx\x90\xc9Iw\xd7+\x17\x8e\xb9N\x18\xcer\xb4\x8e\x02\x7f\x1dV\x1d\xce\xd7|z2\x9a\xc3[f\xa0\xff\xc5h1\x9e\xd6\x0e\xa8L9X\x9eMW{\xa0\xb3\xdc\x13\xe5|\x9c\xd8\xe0\xf3\xe6\x0f\x99\xcf\x16?\xbe\xea\x16,\x9d\x1c\x9f\x11\x93\xc7z[L\xd0\t\x10\'\xfa6\xd3\x87OM\xa1\x82a8\x9a\xc1y\x9c\x83\xab\xa3\xcd\x9e\x1c/\xa6\x9b\xda\x01V\x9e\xd8\xdf\x87\xe9Y\x9d\xd6\x14\x88J\xa6\x8a_\x0b\xc2lg0D\xed$\x96\x0c\x97\xe6\xb6RX\xc2{\xd8\x0e\xd6\x96\x02)+\xb9(\x98\xe0\xa5\x1b\xd2q.\x9f\xa2\xc4uVC\xe0\xef\xee\xbbY98^\xbe\xab\x91\x13\x13`\'\xff\xc3\xf0\xa2\x81\x99+!\xad9y\xbcK\xb4p\xaa\xb2\xad\x8ay^\xe9\xc6\xcb\xf9|:]\x80\\}:\x1f\x9dLCd5\xea\xe9\xeczw\xbf\xed\xbaMu\xbfs#\x98Hu>x4c*\x91\xb3`\x834c2\xf6\xd0i\xf187\xbf=\xa72\xcf\x8f\xddE\xaa\xd0\xdeHu\xaa\xb45L\xd2\x94\x9b`\xdc\xdb(\xe1\xfd\xa2;P\xc63\xe16l\xed\xed}\x94\x97\x9fc\x930\x94\xabmd\x11\xce\xb7\xffd.\x92\x18\xf2\x99\xcbn/\xe2\xdb\x82 \x7f\x043\xd5\xb9n\xce6\x00A\x99i"\xf3\x85H\xc1\xdc\xa7\x9d\x0b\x98RU2\xc30%%m3\x1d\x7f\x8e<|I\x8a\x90\x96~]-)\xea=/&\xab\x86=z\x94g\xcc\x05N\xb1\xe4\xe9\xbb\x15H\xb3\'\xcd](\xf5\xcb~tifo\xb1\xa7{\xa5H\xe0{O\xf7\nDc\x9b\xb0\'0g\xb8\x19kMGLrc\xca\xa7\xf1\xd7L\xe5|\x1e>g\xcds\xba%U\x98\x10\xd8\xd71\xf2B\xb5\n\xe3r{q\x15\x8f\xd1\x1c\xd4P\x881C\xb4\xdc>"\x03\xc2\xdd\xc0\x91\xc6s\xea\xf2\xf2\xa6\xa9\xbb\xee>p\xbd\xf7\xbf\xa6\xee\x84\xf8+Ra\xc1\x17T?~\x19<>f\xca\x07\xb7sg>\xb6\xd2\xb4\xd1X~\x1e\xb2\xce<\\\xedr\x95\xc9\x80\xa7\xd9\x9f\xd6.\x9b\xd8_\x07:\x0f\x7fr\x07j\x02W\x82b\x8d\xe2\x0e\xfc\xc6,\xe4Zr_\x12\xfe\xa6\xed\xc7\x8dH4}7$\xc4\xb3\xfc\xc1\xfac<G\xe0\xc4$\xa6R\x11\xb2r<\x84s@\xce\xe6\'\x1e:\t\xed\xd7\x95\xab\xf1\xa1\xae\x17\x04\xd2V*c\xaa=\x12|S\xc4\xb7\xfe%\xee\xa7\x10w\x93\x84b\x8e&a\xb8z\x14Z\xa3=\xa2o\xf41\x91\x91p\x03A\x98\xc1\x01z\x86\xea\x84\x10,s\x04\x9b\x9f\xb9g\x95\x96\xf6\x811\xa4\x8e\x9c\xb3i\xea^\xd4+I\x82dI-\xb9\x1b\x11\xc1\x82\xa7ZgB\x8f\xdc\x84f\x8e#q\xc9\x03\xb5A\xdd\xe8\xd4\xc8\xbeu\x9a\xb5\x96?[/g\x80\x17\xe2\xbc\x0c\x85\xb0X\x7f\x8e\x1b?\x93\x9a\x9d\xb2\xfaj\x86I\x10\x85\xaf\xd3\xd7\x85R\xd7\x8c\x0bT*\x15\xc42\x89[\xfc\xcbc="\xadrj\'d(-\xdf\x84\x14T\x17"\xf9\x82\xdc\xcc\xd6\xee}\x83\r\x07"\xaf\xc4\x0e~\xc8\xea\x82T\xbff\x88\xf1\xde\xfbU\xf6\xf5\xe9\xfc\xce\xbd\xf4&f\xf6\xa9\xfa\xe3*\xd11\xee|\xf3\xac\x15\x8a[\xa0QA\xac\xc1]\xf3\x9b\xe2\xd2\x9cy\xeb\x9e\xb4\x92T\xab\xd46o\xd0\xbb_\xdb\x89\x89\xe9\x95\x9d\xaa\xb7\x1e\xca\xd4B\xa2U)M\xdd\x8d\xbe\xe3\xce\xec\xc6\'.Y\xf8\xe7\x1dc\xf2`r^\x1a\x1d\xde\xe4\xa1\xfe\xdd\xe6\xe1S\x81\xcdo"LKI\xa6\xed)\x15\xa3=l\x8b\xeb\xe2}q?\xc0:H\xdc\xdd@\x1aO]\x07o\x8bO\x85I&QU\x88m23\xaaz=v\xbf\xb8nqsv\xbd\xbbwy\xe5\xaa\x99\xeb\x07cx\x08\xee\xd1\xa0\xf5\xe3\x8bR\xeaXC\xd9\xae\xb7~Y\x94\xbb\xefB\x9eC\x9e\xc9$"e\xfdd\xfc\xd6&\xc9\xb2r\x08\x1f\xf7\x89U\x13$\x10VH#4\xca:p[\x04\xe2gn\x0eU\x97ty[,\xe3\xa8\x8a0\xae\xefG\x1b\x10\xc4\x98 p\xfd\x07\xec[T\x94\xa2~\x84e\xc6B\xed\x8a\x18X\xd4\xd2\xd9\x0c\x1b\xd4\xc7\xa7y\xcf\x99Vh\xf1\x89\xb9\x05\xa1\xefS\xa1u\x89\r\xb4H\xc5r\xceC\xbf\xd1k\x07\xf5\xef\x8e\x03nN\'*\xe1`\x88\xb5j\x8a|\xa0PB/\x86\xf9\xe6b\x8fLG\xba\x96\x8f!\xbc\x11\xc2pB\xd7\xc7\x94H\xe1\x93*\xdbJ\xce\xb3\xaa2l\xf3\x06\xe1\xdcf\xc9\xda\xc5\xa3:D8\xe1\t\x84\xef\x03\xe9\xd7O\x86\x00\xd0\x16\xa8\xe8\x0cN\x9dA\n\x1f\xbd<,S\xc2\xfbXv\xd0k\xb1\x8c*h\xf6\xb0\x8c*Q`=\xe0\x08\xcb\xb5\x83\xfaw\xc7A\x8d\xe5\xc8\xc1?\x03\xcbF\xdb\xf1\x0e\x96\x1b\xacVs\xe8w\xb1q\x02\x98\xab\xe5\xfae`\x8e\xb1V\x0eE\xec\x89\xba\xecITH~\x12a\x0b\x18bd\xafp\xb2\x99\xf7\x01\x87\xba\xe4\xd9\xb5\xe7\x92\xdaZD\xf6F\xc4\x10[\xab\x97\x8d737L\xc6\x02\xbca\x98<y}O\xdb\xc8\x8a\xfd_\x1e\xfc\t\x94\x1a\xc9\x03\x0f\xc8\x16\xe02A\xa9D\xf2\xfa\xf9\xc3\xfa\x08o\x1c\xd4\xbf;\x0e\xb8\xfb\x15;0\x10\xb7O\xcd\x17\x8e\xf0\xbe: \xc0\xd5\x1f\xa8\x7f\x9f\x85\xef\xdf\xd5\x04\xf49M\x80^\xa2\xae\xe0Q\x13\xc8\x1e\xcb\xf3\x14\xcb\x0b\x0b\xd8\n\xae\xfd\x16\x90=\x92\xe7)\x92\xef\x99\xff\xcd9~q\x1e\x99=\xdd\x04v\xbb{\xdf\x04\x873R`d\'\xe8\xe6r`\x84\xa6M\xc0\x103\x7fW\xe1\xf7\xe0v\x0f&\xd2\xbeo\xdd1\x18l\x18\x9d\x92\xd3/j"lF\xa7S0\xff\xe3\xe5b\xb3Z\xce\x1b\xdfT\x11\x96\xed\x88?PJ{\x8c\xb6\x90g!m\xd7\x90g"\xa8b\x00x\xde\xe3\xfc\x8e1Q\xd8\x92}m\x8c\x10\x82!\xb8\x19\x9b\x98\x17\xdfJ\xc2i\xfc5\x8c\xfco\x07\xe0d5\x8a\xc3z\x1d\x98-N\xc0\xe2|?ZL\xfeZ\xb6\x8e\xa1\xcbzlMRlm\xe4\x85Ri\xf4\xb2\x1e]w\xeck\xf4\xb6\xf6{\x01\xbc\x9c\x8d7\xf8\x9b\x01l\x02#,\xb8\xc7\xdd@\nQ\xf6\xafu8\xa3\x85\xb4\xd1\x1a\x00ss\x88\xb0y(\x89c\xd9\xb3\xa5y\xa1\x12\xa1$\x196\x97e\x92\xb9\x07L\x85\xdb\xca\xbf\xb07z\xd8\x16\x97f\xb1\xaaF\xbd(IwQ\xda\x83\xdd"\xae2\x14,*dO\xcd-\x12\xc0\xad\xa2\xad\xdbx\x91\xdb\xa1\x95s[5lnQ\x03n5i\xa3\x8d\x17)\xe0Vs\xc8\xad\x04\xa25\x9a\x99\xe5,\xb9Ag\xd6-G\xb4u\x1b/r\xc8-PO\x8e\x80zr\x0c\xd4\x93\xe3\xa0\x9e\xf1"\x07\xa2\xc5\x12\x88\x16+ Z\xac\x81h\tPON\x80zr"\x81h\r\x91\xe4\x93`>\xe7\xa3\xa5P=)\xcd\xc3\xc4L\x10\xc8\xad\x84\xdcj\xc0-\xc3@\xe2\x19TO&\x80$0\xa8d\x0chA\xce1\x10-\xa7@\xb4\x1chA\xce\x81\x16\xe4\x1cjA\x01\xd5S@\xf5\x14\x0c@\x9f\xe0\x00\xfa\x04\xd4\x82\x12\x01n%\x06\xdcJ\xa8\x05\xa5\x80\xdc\x02\x94\xca%TO\x05\xb5\xa0\x82\xea\xa9\x18\x80>\x05\xd5SA\xf5\xd4\x08p\xab\x19\x00j\r\xb5\xa0\x86ZP\xeb|\xe2\x05\x02ZP \xa0\x05\x8d\x82\x83\xdc\x02\xfd)\x10\xd0\x9f\x02c\xc0-&\x80[\x0c\x94L`\x80R+\xc9\xc2\x1a\xc9Bt%\xaa\x84\x7f\x16\x8e\x1f\xbc\xa0BR\x0b\xa7\x8c\xb6\xa12z\x8a\x97/\x8a\x9f\xbe\x04[\x8e\xea\x1d\x9a\xe0P\x91Yy\x8c\x04%\xd5\xa7I`\xb0\xaa\xdfaHI\xec\x96\xdf\x87"\xee\xf4\xdd\xaav@0\xf3\x0e^7\xba\xcf\xfd>\t\x0c\xa6\xab8\x84"\xf4\xc8\x0e\xd5!A\x98e"u\x9b\xe8\x11GG\x8c\x83{\xce\xde\xc4\x8f\x18a\x80\xcd\x85y\xe9X\xa8U\x1a\xf0\xfcj\x0b\x0bu\xf8\x8d\xb9\x8b\x8c/\x0bR\x8b\xc9\xb7\xc5\xff\x00L\xc2\xa0'
the pdf file can also be opened by osx preview correctly.
The PDF spec seems to require that the startxref keyword and the byte offset to the xref table be on different lines.
However, in the wild, I have found otherwise valid PDFs where the startxref keyword and the byte offset to the xref table are on the same lines, like so:
...
0 8
0000000000 65535 f
0000000009 00000 n
0000305603 00000 n
0000305652 00000 n
0000000083 00000 n
0000305310 00000 n
0000305405 00000 n
0000305423 00000 n
trailer
<<
/Size 8
/Root 2 0 R
/Info 1 0 R
>>
startxref 305711
%%EOF
See here for example: https://www.docketalarm.com/cases/PTAB/IPR2014-00358/Inter_Partes_Review_of_U.S._Reissue_Pat._RE043707/docs/01-17-2014-PET-1193/Power_of_Attorney-2-Power_of_Attorney.pdf
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.