0xabu / pdfannots Goto Github PK
View Code? Open in Web Editor NEWExtracts and formats text annotations from a PDF file
License: MIT License
Extracts and formats text annotations from a PDF file
License: MIT License
Use case 1: Sometimes, we may need annotations from just a specific chapter or a few pages from here and there. This would also speed up the extraction process as all pages need not be processed.
Use case 2: New annotations are being made as one progresses through the pdf, and hence needs to process only the newer annotations.
Would love CSV output like this:
page,type,author,created,text
1,Highlight,John,2023-05-17T11:38:17,Text
Sounds like that should be possible but not sure how. Great tool, thanks!
It would be really great to differentiate multi-color highlights in a document. Is that possible on pdfannots' end, or is it rather something that has to be done on pdfminer.six' end?
It would be nice if the program could extract "Caret" annotations as well, which are the opposite of StrikeOut (suggestion of new text within a context).
If a pdf has the following:
"This is a
sample statement"
The output that is returned is: "This is a samplestatement"
The new line information is lost.
Don't have time to debug this, but there's some weirdness with the command line argument of the pdf file: when the path has some extra dirs it chokes (most probably pdfminer itself or the six part).
Example: (the file dada.pdf does not contain any annotations btw)
$ python pdfannots.py dada.pdf
Document doesn't include outlines ("bookmarks") .
$ cp dada.pdf ..
$ python pdfannots.py ../dada.pdf
Document doesn't include outlines ("bookmarks") .
$ cp dada.pdf ../..
$ python pdfannots.py ../../dada.pdf
Traceback (most recent call last):
File "pdfannots.py", line 6, in
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
ModuleNotFoundError: No module named 'pdfminer'
(3.6.6-test) $ pip list
Package Version
astroid 2.0.4
chardet 3.0.4
isort 4.3.4
lazy-object-proxy 1.3.1
mccabe 0.6.1
pdfminer.six 20170720
pip 18.1
pycryptodome 3.7.0
pylint 2.1.1
setuptools 39.0.1
six 1.11.0
sortedcontainers 2.0.5
typed-ast 1.1.0
wrapt 1.10.11
I wanted to the output in a certain format for my personal use. Is it possible to have it as I would like?
Also, is it possible to extract the colour of the highlight or the comment box? So that I can have different formatting based on a colour?
Hi, this script seems very promising. I just used it to extract highlights and comments from a paper I just reviewed.
One issue I have is that, the style and format of the highlighted and comments is same. If the style can be changed even a little, so that the comment has a different style, it is easy to differentiate the author's text and the reviewer's comment. How can I change the script so that the reviewer's comment is either bold or italic?
Hello,
First off, I love the script! It produces beautiful output with context, etc. Great job!
For scientific reviews, since you need to provide rebuttal to each of the reviewers' notes, it would be convenient to respond to, e.g. 'page 3, note 12' or 'note 32'. I would suggest labeling the annotations not only with the page number, but also the number of the note on this page and/or global number (the note count would start at the beginning of the document then).
What do you think?
For this pdf extracting normal text works 100% fine (sorry, it's in German). But trying to extract any of the source code sections (or simple commands within the normal text) never works. Would be very glad if you could improve this situation.
Hello,
stumbled on the following error.
Traceback (most recent call last):
File "/[path]/pdfannots-master/pdfannots.py", line 8, in <module>
pdfannots.cli.main()
File "/[path]/pdfannots-master/pdfannots/cli.py", line 141, in main
doc = process_file(
File "/[path]/pdfannots-master/pdfannots/__init__.py", line 387, in process_file
annot = _mkannotation(pdftypes.dict_value(pa), page)
File "/[path]/pdfannots-master/pdfannots/__init__.py", line 58, in _mkannotation
assert isinstance(subtype, PSLiteral)
AssertionError
Using the latest master 57fd55d
with python3.
If necessary can provide the PDF in private.
Hi Andrew, may I ask how come that per License.txt Microsoft is the copyright holder for pdfannot.py?
Thank you very much for this program!
I have run pdfannots via Terminal on Mac OS and in the md output I got some bizarre errors in terms of the formatting.
In all cases, I get the same highlight twice.
Hi,
For one of the pdf file I want to process, I get the following error :
Traceback (most recent call last):
File "/home/antoinejdd/Documents/Projects/pdfannots/./pdfannots.py", line 8, in <module>
pdfannots.cli.main()
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/cli.py", line 141, in main
doc = process_file(
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 401, in process_file
interpreter.process_page(pdfpage)
File "/usr/local/lib/python3.9/dist-packages/pdfminer.six-20220524-py3.9.egg/pdfminer/pdfinterp.py", line 992, in process_page
self.device.end_page(page)
File "/usr/local/lib/python3.9/dist-packages/pdfminer.six-20220524-py3.9.egg/pdfminer/converter.py", line 80, in end_page
self.receive_layout(self.cur_item)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 195, in receive_layout
self.render(ltpage)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 295, in render
self.render(child)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 290, in render
self.update_pageseq(item)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 210, in update_pageseq
x.update_pageseq(component, self.compseq)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 246, in update_pageseq
self.pos.update_pageseq(component, pageseq)
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 218, in update_pageseq
if self.item_hit(component):
File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 210, in item_hit
return (self.x >= item.x0
TypeError: '>=' not supported between instances of 'PSKeyword' and 'float'
This issue seems similar to #54
I can't publicly share the faulty pdf file, but I can send it to you in PM.
First of all, I wish to thank you for this incredible CLI-tool, since I have been searching for so long how to extract my own annotations from PDF readings (as a student) for quite some time, and pdfannots works like a charm!
Having said this, I was wondering whether it would be possible to add an option so as to only extract "detailed comments" to a markdown file, or put differently, an option to leave out "highlights".
In my own use case, I tend to highlight the PDF file using different colour schemes for readability, but I am only interested in extracting the comments that I write in the margins of the PDF (so as to add them to my reading notes).
Do you think adding such a feature would be possible, and not to difficult to implement?
This little script is great. I built an Alfred workflow around pdfannots to have a more convenient UI.
I want to share it with some colleagues but the whole setup (install python3, pip, pdfminer.six, pdfannots, Alfred with Powerpack, my workflow) is a bit tedious for the less technically adept ones. Could pdfannots just be registered at pip? It would make it a bit easier for others to set it up with a oneliner à la
pip install pdfannots
I used the code to see how it works and received an error.
Microsoft Windows [Version 10.0.16299.248]
(c) 2017 Microsoft Corporation. All rights reserved.
G:\Reference>python proba.py 10036.pdf
1Traceback (most recent call last):
File "proba.py", line 345, in
main()
File "proba.py", line 342, in main
printannots(fh)
File "proba.py", line 309, in printannots
pdfannots = [ar.resolve() for ar in pdftypes.resolve1(page.annots)]
File "proba.py", line 309, in
pdfannots = [ar.resolve() for ar in pdftypes.resolve1(page.annots)]
AttributeError: 'dict' object has no attribute 'resolve'
Thank you for creating and maintaining it.
The description of the "INFILE" argument is "PDF files to process".
This suggests multiple files can be processed in a batch. Is this correct, and if so:
It would be really useful to be able to do something like:
pdfannots *.pdf -o allcomments.md
or have a way to do the same, but to output one file per input file, using the input file name as the output file name.
Perhaps something like this:
pdfannots *.pdf - multioutput=true -o [filename].md
Thanks!
Hi, is there any good pdf file that can represent the case example for the following line
Line 384 in abf1664
How can I have pdfannots return any value at all - page number, etc. - for annotation subtype: /'Square'?
Here's why... the markup annotation - the marker highlighter annotation - used in iOS is (apparently) reported by pdfannots as subtype: /'Square'. Although, my PDF app - PDF Expert - reports this annotation type as "Rectangle".
For this use case, I simply need pdfannots to return even just a page number, for any type of annotation, at all, on my PDF.
Needed to dereference some objects to make the code work for some PDFs:
@@ -89,6 +89,14 @@ class Annotation:
def __init__(self, pageno, tagname, coords=None, rect=None, contents=None):
self.pageno = pageno
self.tagname = tagname
+
+ if isinstance(coords, pdftypes.PDFObjRef):
+ coords = coords.resolve()
+ if isinstance(rect, pdftypes.PDFObjRef):
+ rect = rect.resolve()
+ if isinstance(contents, pdftypes.PDFObjRef):
+ contents = contents.resolve()
+
if contents == '':
self.contents = None
else:
I found the script cannot extract annotations when there are some drawings on pdf,why not just ignore them?
Hi!
I recently discovered this repo, and I have to say, it does exactly what I am looking for!
The thing is for my pdfs, I am using different colors to differentiate different types of highlights. So, I am wondering if is there any plan for adding highlight color to the output?
Thanks!
Traceback (most recent call last):
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\runpy.py", line 193, in run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\yoga260\AppData\Local\Programs\Python\Python38\Scripts\pdfannots.exe_main.py", line 7, in
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\cli.py", line 141, in main
doc = process_file(
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init.py", line 483, in process_file
interpreter.process_page(pdfpage)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfminer\pdfinterp.py", line 998, in process_page
self.device.end_page(page)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfminer\converter.py", line 81, in end_page
self.receive_layout(self.cur_item)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init.py", line 264, in receive_layout
self.render(ltpage)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 364, in render
self.render(child)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 364, in render
self.render(child)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 359, in render
self.update_pageseq(item)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 279, in update_pageseq
x.update_pageseq(component, self.compseq)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 232, in update_pageseq
self.pos.update_pageseq(component, pageseq)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 205, in update_pageseq
if self.item_hit(component):
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 197, in item_hit
return (self.x >= item.x0 # type: ignore [no-any-return]
TypeError: '>=' not supported between instances of 'PSKeyword' and 'float'
If the filename contain space, it won't be successful run.
For example, AB.pdf can work correctly, but A B.pdf will fail.
Expected behavior: running app with help flag produces help file
Actual behavior: produces errors
$ ./pdfannots.py --help
Traceback (most recent call last):
File "/home/twood/bin/pdf_annots/./pdfannots.py", line 7, in
from pdfannots.cli import main
File "/home/twood/bin/pdf_annots/pdfannots/init.py", line 13, in
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
ModuleNotFoundError: No module named 'pdfminer'
Modify README.md to add "Installing" section including note to:
sudo pip3 install -r requirements.txt
For many of the PDFs I want to analyze I get the following error/backtrace:
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/my/pdfs.py", line 98, in get_annots
doc = pdfannots.process_file(fo, emit_progress_to=None)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 483, in process_file
interpreter.process_page(pdfpage)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfminer/converter.py", line 52, in end_page
self.receive_layout(self.cur_item)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 264, in receive_layout
self.render(ltpage)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 364, in render
self.render(child)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 364, in render
self.render(child)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 359, in render
self.update_pageseq(item)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 279, in update_pageseq
x.update_pageseq(component, self.compseq)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 232, in update_pageseq
self.pos.update_pageseq(component, pageseq)
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 205, in update_pageseq
if self.item_hit(component):
File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 197, in item_hit
return (self.x >= item.x0 # type: ignore [no-any-return]
TypeError: '>=' not supported between instances of 'NoneType' and 'float'
I attached an example PDF file for which the error occurs. Any ideas?
This looks look a similar case like issue #48.
Once again, I have a PDF file which is a book scan, and get a warning about popup
annotations not being supported. And instead of continuing, an Assertion Error results in the abortion of the annotation extraction.
This time, I couldn't break it down to a particular page – the issue seems to occur regardless of the page tried. I have therefore attached a sample of 10 pages, and the log I get.
WARNING: Unsupported annotation subtype: /'Popup'
WARNING: Unsupported annotation subtype: /'Popup'
Traceback (most recent call last):
File "/opt/homebrew/bin/pdfannots", line 8, in <module>
sys.exit(main())
File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/cli.py", line 141, in main
doc = process_file(
File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/__init__.py", line 472, in process_file
page.annots.sort()
File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py", line 226, in __lt__
return self.pos < other.pos
File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py", line 182, in __lt__
assert self._pageseq != 0
AssertionError
I have downloaded the folder including your last update, run an extraction from a pdf and I get back the following errors:
File "/Users/.../pdfannots.py", line 10, in <module>
sys.exit(main())
File "/Users/.../pdfannots/cli.py", line 141, in main
doc = process_file(
File "/Users/.../pdfannots/__init__.py", line 448, in process_file
annot = _mkannotation(pa.resolve(), page)
File "/Users/.../pdfannots/__init__.py", line 46, in _mkannotation
subtype = pa.get('Subtype')
AttributeError: 'NoneType' object has no attribute 'get'
The call function used is: python3 pdfannots.py "file_2017.pdf" -o notes.md --print-filename -p
The md output is blank.
Originally posted by @Chris-mik in #41 (comment)
With a previous version of pdfannots, I never found a way to redirect the output to a file (> or tee etc.) because whenever I added another argument to the call "python pdfannots x.pdf", I quickly (after only a few pages) got an error like this:
Traceback (most recent call last):
File "pdfannots.py", line 452, in
sys.exit(main())
File "pdfannots.py", line 448, in main
prettyprint(annots, args.output, args.wrap, args.sections)
File "pdfannots.py", line 323, in prettyprint
printitem(a, fmttext(a))
File "pdfannots.py", line 304, in printitem
print(msg + "\n", file=outfile)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 15: ordinal not in range(128)
Never got any problems with "python pdfannots x.pdf" alone, worked perfectly every time.
With the new version and the "-o" argument, I still have the same issue. Am I doing anything wrong (I'm not a Python programmer)?
(Also, sometimes I cannot extract source codes from PDF files, those end up in gibberish. Obviously, some encoding problem. )
Hi, thanks again for pdfannots! I recently encountered a small issue where an unsupported annotation type completely shut down the annotation extraction. While it's understandable that not every fancy annotation type can be extracted, pdfannots shouldn't completely abort, bur rather simply skip the annotation.
It took me a bit to find the problematic annotation, which was even exacerbated by the fact that it wasn't an annotation visible in normal PDF readers, but probably some result of a bad quality PDF OCR scan.
The error was: WARNING: Unsupported annotation subtype: /'Popup'
I've ran pdfannots across all of my pdfs and getting this error on few of them. The ones that failed are not very important to me, so I didn't bother much investigating but leaving it here just in case someone else wants to investigate.
WARNING:pdfannots:failed to retrieve outlines
ERROR:pdfannots:'int' object has no attribute 'objid'
Traceback (most recent call last):
File "/L/soft/pdfannots/pdfannots.py", line 454, in process_file
outlines = get_outlines(doc, pagesdict)
File "/L/soft/pdfannots/pdfannots.py", line 395, in get_outlines
page = pagesdict[pageref.objid]
AttributeError: 'int' object has no attribute 'objid'
(I added backtrace myself).
I've attached an example of file on which it reproduces.
Hit a WARNING: Failed to retrieve outlines: 'PDFObjRef' object is not subscriptable
.
Using the latest master 57fd55d
with python3.
Can provide the PDF in private.
Hi folks, thanks for your efforts with this tool. I was wondering if there are plans to add rectangular (image) selection to the tool?
The workflow that I imagine would be to select figures / tables / formulas using the often available rectangular selection, and then have those selection saved as images (.png, .jpg, etc...) and included as links in the markdown file.
This might be due to a change in pdfminer.six's PDFObjRef.
Traceback (most recent call last):
File "pdfannots.py", line 351, in <module>
main()
File "pdfannots.py", line 348, in main
printannots(fh)
File "pdfannots.py", line 316, in printannots
pageannots = getannots(pdfannots, pageno)
File "pdfannots.py", line 149, in getannots
a = Annotation(pageno, subtype.name.lower(), pa.get('QuadPoints'), pa.get('Rect'), contents)
File "pdfannots.py", line 102, in __init__
assert len(coords) % 8 == 0
TypeError: object of type 'PDFObjRef' has no len()
Environment.
ca-certificates 2018.03.07 0
certifi 2018.11.29 py36_0
chardet 3.0.4
libcxx 4.0.1 hcfea43d_1
libcxxabi 4.0.1 hcfea43d_1
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
ncurses 6.1 h0a44026_1
openssl 1.1.1a h1de35cc_0
pdfminer.six 20181108
pip 18.1 py36_0
pycryptodome 3.7.2
python 3.6.8 haf84260_0
readline 7.0 h1de35cc_5
setuptools 40.6.3 py36_0
six 1.12.0
sortedcontainers 2.1.0
sqlite 3.26.0 ha441bb4_0
tk 8.6.8 ha441bb4_0
wheel 0.32.3 py36_0
xz 5.2.4 h1de35cc_4
zlib 1.2.11 h1de35cc_3
If there is no highlighted text Document doesn't include outlines ("bookmarks")
is returned. pdf2text.py works fine on the text regardless of highlight status.
Many thanks for developing this wonderful tool!
Just an idea for the format of output file like this:
https://forum.zettlr.com/discussion/94/zotero-as-zettelkasten
I think it would make the citation of the notes and highlighted texts much easier.
Showing highlights first, then comments etc. feels a bit odd to me.
I would much rather have the output sorted by the page / line they're found on.
Hi!
For my specific use case it would be great to have an option to have pdfminer ignore page labels.
At the moment I am using a script that, in the resulting markdown file, adds links to the specific page in the PDF, like so:
[Page 2](<file.pdf#page=2>)
Obviously, the page labels often don't correspond to the actual page number in the file, which would make this type of switch useful.
Hai @0xabu,
Actually, I am trying to extract headings from pdf so, is it possible through this library.
I tried some code using this library but did not get any output apart from error and did not find either documentation or any use case of this library.
Can you please guide me.
Thanking you in advance.
Great works, really appreciate it.
Just curious, is it possible to detect the shapes like, rectangle, line and arrows using this code, if possible for future improvement?
Great work from the author.
While other pdf work like a charm, I have issue when running over this pdf.
The compiler return an error
line 341, in format_bullet
assert quotepos + quotelen <= len(paras)
AssertionError
While disabling the line 341 allow the program run, but,
the compiler throw an error while commented the assert quotepos + quotelen <= len(paras)
line 675, in main
paper_title_long=comb_details_report[paper_title_index]['all_text']
TypeError: list indices must be integers or slices, not NoneType
On Windows (not WSL), the following command produces a UTF8-encoded output file:
python3 -m pdfannots foo.pdf -o annots.txt
The following command does not, instead the resulting file has an 8-bit encoding -- on my machine, it uses some variant of ISO8859, but is presumably dependent on the codepage.
python3 -m pdfannots foo.pdf > annots.txt
As a workaround, when using Windows Python, it's advisable to use the -o option to write clean UTF8.
Related: #44
Hey all,
Thank you all for this fantastic script! It works very well, although I found a pdf (attached) whose highlights are being severely truncated. I tweaked boxhit
function to return True
if there is any overlap at all which gave me better results but then the script still does not pick up the last line of each highlight. It looks like original boxes and the rectangle in the Annotation object are indeed missing this last line (the annotation y0 is bigger than the item's)...
Anyway... I can provide more info if you'd like and I'd very much appreciate any insight into fixing this although it is also possible that it is more of a pdfminer issue...
With about half the pdf I try to extract from I get the following error message / or similar:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 274: ordinal not in range(256)
I suppose an additional UTF-8 specification is needed.
I have already changed "iso8859-15" to "utf-8" in the get_annots method, yet the issue remains.
I work on a recent Ubuntu Distro and python 3.6.
Given absolute path of the pdf
file is as follow;
C:\abc.pdf
There are three ways to execute the pdfannots.py`
python pdfannots.py C:\abc.pdf
python pdfannots.py C:\abc.pdf D:\xyz.pdf E:\foo.pdf
3.If you are interested to hard coded the pdf path for of ease debugging/understand the whole code using debugging tools
3.a)
def parse_args():
p = argparse.ArgumentParser(description=__doc__)
# p.add_argument("input", metavar="INFILE", type=argparse.FileType("rb"),
# help="PDF files to process", nargs='+')
g = p.add_argument_group('Basic options')
g.add_argument("-p", "--progress", default=False, action="store_true",
help="emit progress information")
g.add_argument("-o", metavar="OUTFILE", type=argparse.FileType("w"), dest="output",
default=sys.stdout, help="output file (default is stdout)")
g.add_argument("-n", "--cols", default=2, type=int, metavar="COLS", dest="cols",
help="number of columns per page in the document (default: 2)")
g = p.add_argument_group('Options controlling output format')
allsects = ["highlights", "comments", "nits"]
g.add_argument("-s", "--sections", metavar="SEC", nargs="*",
choices=allsects, default=allsects,
help=("sections to emit (default: %s)" % ', '.join(allsects)))
g.add_argument("--no-group", dest="group", default=True, action="store_false",
help="emit annotations in order, don't group into sections")
g.add_argument("--print-filename", dest="printfilename", default=False, action="store_true",
help="print the filename when it has annotations")
g.add_argument("-w", "--wrap", metavar="COLS", type=int,
help="wrap text at this many output columns")
return p.parse_args()
def main():
args = parse_args()
global COLUMNS_PER_PAGE
COLUMNS_PER_PAGE = args.cols
for file in [r"C:\abc.pdf"]:
file = open(file, 'rb')
(annots, outlines) = process_file(file, args.progress)
pp = PrettyPrinter(outlines, args.wrap)
if args.printfilename and annots:
print("# File: '%s'\n" % file.name)
if args.group:
pp.printall_grouped(args.sections, annots, args.output)
else:
pp.printall(annots, args.output)
return 0
Credit to the OP from SO
With this PDF-file the words are not split. It's an OCR-scan. I tried modifying the word_margin in LAParams to no avail. When exporting the highlights using PDF Expert (my macOS-PDF Reader) it works fine though: here's the expected output.
Any thoughts?
Best regards
File cli.py, line 147:
args.output.write(line)
I replaced args.output.write(line) for
with open(args.output.name,"a",encoding="utf-8") as f:
f.write(line)
My PDFs have a lot of diacritics and with this change I was able to get the correct output.
The text annotation tool in the PDF Expert app creates annotations of subtype "FreeText". pdfannots.py
doesn't support these, it complains "WARNING: Unsupported FreeText annotation ignored on page 1". It could support these annotations just like "Text" (sticky note) annotations.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.