0xabu / pdfannots Goto Github PK

View Code? Open in Web Editor NEW

515.0 515.0 96.0 1.16 MB

Extracts and formats text annotations from a PDF file

License: MIT License

Python 100.00%

pdfannots's People

Contributors

Stargazers

Watchers

Forkers

pykong aarondeng tramsauer blame19 itst breakerfallen rraadd88 sofian fustilio zhanghaitao1 wliment zverhope eddings karlicoss blay steve-kasica pluieciel lachlangrose jacksparrowff2 sanjeev2838 boan-anbo zhuth anabasisxu joelostblom frandsoh sam33r andysnake96 mmohammadi9812 nclv syyunn mesarpe ptheod dilawar cuppersd ryanbethel foxal kguidonimartins cosmo65 ilupin elnazsn1988 asutu vortixhead denissouth blue-cosmos jay-dot-ctrl purplefishies yash-rxlogix drsureshkannaiyan bithappens shmeni rdnj liy167 ajdavis jthodge duggalr lawrennd ilaner2000 galaxy-auto enginkarahan owldown jousimies aqiu9 danli-ds marui888 mpkopec capuanob blairw mayhemheroes igoravilapereira tgraupne pho-souza melsior biostheoretikos liaihu98 abellykens cr-project-tm thiswillbeyourgithub charlesneimog linozen proeliorr shadowalker1995 eacmills cimadure ajmaradiaga enoriega gvtulder abdulmuttaleb-al dklenowski giovanaaron suyashmahar pawsitive-pc

pdfannots's Issues

Feature: Ability to specify page range from where to extract annotations

Use case 1: Sometimes, we may need annotations from just a specific chapter or a few pages from here and there. This would also speed up the extraction process as all pages need not be processed.
Use case 2: New annotations are being made as one progresses through the pdf, and hence needs to process only the newer annotations.

Feature: CSV output

Would love CSV output like this:

page,type,author,created,text
1,Highlight,John,2023-05-17T11:38:17,Text

Sounds like that should be possible but not sure how. Great tool, thanks!

Feature Request: Differentiate Extracted Highlights by color

It would be really great to differentiate multi-color highlights in a document. Is that possible on pdfannots' end, or is it rather something that has to be done on pdfminer.six' end?

Support "Caret" annotation

It would be nice if the program could extract "Caret" annotations as well, which are the opposite of StrikeOut (suggestion of new text within a context).

Text split by lines losses spacing

If a pdf has the following:

"This is a
sample statement"

The output that is returned is: "This is a samplestatement"

The new line information is lost.

some weirdness with the path to the PDF

Don't have time to debug this, but there's some weirdness with the command line argument of the pdf file: when the path has some extra dirs it chokes (most probably pdfminer itself or the six part).

Example: (the file dada.pdf does not contain any annotations btw)

$ python pdfannots.py dada.pdf
Document doesn't include outlines ("bookmarks") .
$ cp dada.pdf ..
$ python pdfannots.py ../dada.pdf
Document doesn't include outlines ("bookmarks") .
$ cp dada.pdf ../..
$ python pdfannots.py ../../dada.pdf
Traceback (most recent call last):
File "pdfannots.py", line 6, in
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
ModuleNotFoundError: No module named 'pdfminer'

(3.6.6-test) $ pip list
Package Version

astroid 2.0.4
chardet 3.0.4
isort 4.3.4
lazy-object-proxy 1.3.1
mccabe 0.6.1
pdfminer.six 20170720
pip 18.1
pycryptodome 3.7.0
pylint 2.1.1
setuptools 39.0.1
six 1.11.0
sortedcontainers 2.0.5
typed-ast 1.1.0
wrapt 1.10.11

Can the code be modified to change the format?

I wanted to the output in a certain format for my personal use. Is it possible to have it as I would like?
Also, is it possible to extract the colour of the highlight or the comment box? So that I can have different formatting based on a colour?

Modifying the output

Hi, this script seems very promising. I just used it to extract highlights and comments from a paper I just reviewed.
One issue I have is that, the style and format of the highlighted and comments is same. If the style can be changed even a little, so that the comment has a different style, it is easy to differentiate the author's text and the reviewer's comment. How can I change the script so that the reviewer's comment is either bold or italic?

Feature: Add option to display the number of the annotation

Hello,
First off, I love the script! It produces beautiful output with context, etc. Great job!

For scientific reviews, since you need to provide rebuttal to each of the reviewers' notes, it would be convenient to respond to, e.g. 'page 3, note 12' or 'note 32'. I would suggest labeling the annotations not only with the page number, but also the number of the note on this page and/or global number (the note count would start at the beginning of the document then).

What do you think?

Source codes extracted with wrong encoding?

For this pdf extracting normal text works 100% fine (sorry, it's in German). But trying to extract any of the source code sections (or simple commands within the normal text) never works. Would be very glad if you could improve this situation.

sample.pdf
output_sample.txt

AssertionError `assert isinstance(subtype, PSLiteral)`

Hello,

stumbled on the following error.

Traceback (most recent call last):
  File "/[path]/pdfannots-master/pdfannots.py", line 8, in <module>
    pdfannots.cli.main()
  File "/[path]/pdfannots-master/pdfannots/cli.py", line 141, in main
    doc = process_file(
  File "/[path]/pdfannots-master/pdfannots/__init__.py", line 387, in process_file
    annot = _mkannotation(pdftypes.dict_value(pa), page)
  File "/[path]/pdfannots-master/pdfannots/__init__.py", line 58, in _mkannotation
    assert isinstance(subtype, PSLiteral)
AssertionError

Using the latest master 57fd55d with python3.

If necessary can provide the PDF in private.

License

Hi Andrew, may I ask how come that per License.txt Microsoft is the copyright holder for pdfannot.py?

Md output contains highlighted text twice with different formats

Thank you very much for this program!
I have run pdfannots via Terminal on Mac OS and in the md output I got some bizarre errors in terms of the formatting.
In all cases, I get the same highlight twice.

Example 1: the highlighted text appears in two versions: i) one starting with a blockquote (>) and ii) another without. In the former version the extracted text is more polished, i.e., spaces between words are correct, end-line hyphens are removed and the highlighted text appears as a single sentence, whereas the latter version mixes up word spacing and lines are broken.
Example 2: the highlighted text appears again in two versions: i) the highlighted text is in "" and ii) the same text starts with --.
I have encountered these issues in the same pdf. I have tried changing various arguments in the pdfannots call function but nothing changed in the output.
The pdf readers that I use to make annotations are: Xodo, Foxit or adobe.

TypeError: '>=' not supported between instances of 'PSKeyword' and 'float'

Hi,

For one of the pdf file I want to process, I get the following error :

Traceback (most recent call last):
  File "/home/antoinejdd/Documents/Projects/pdfannots/./pdfannots.py", line 8, in <module>
    pdfannots.cli.main()
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/cli.py", line 141, in main
    doc = process_file(
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 401, in process_file
    interpreter.process_page(pdfpage)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer.six-20220524-py3.9.egg/pdfminer/pdfinterp.py", line 992, in process_page
    self.device.end_page(page)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer.six-20220524-py3.9.egg/pdfminer/converter.py", line 80, in end_page
    self.receive_layout(self.cur_item)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 195, in receive_layout
    self.render(ltpage)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 295, in render
    self.render(child)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 290, in render
    self.update_pageseq(item)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/__init__.py", line 210, in update_pageseq
    x.update_pageseq(component, self.compseq)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 246, in update_pageseq
    self.pos.update_pageseq(component, pageseq)
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 218, in update_pageseq
    if self.item_hit(component):
  File "/home/antoinejdd/Documents/Projects/pdfannots/pdfannots/types.py", line 210, in item_hit
    return (self.x >= item.x0
TypeError: '>=' not supported between instances of 'PSKeyword' and 'float'

This issue seems similar to #54

I can't publicly share the faulty pdf file, but I can send it to you in PM.

Feature: Selecting Annotation-Types

First of all, I wish to thank you for this incredible CLI-tool, since I have been searching for so long how to extract my own annotations from PDF readings (as a student) for quite some time, and pdfannots works like a charm!

Having said this, I was wondering whether it would be possible to add an option so as to only extract "detailed comments" to a markdown file, or put differently, an option to leave out "highlights".

In my own use case, I tend to highlight the PDF file using different colour schemes for readability, but I am only interested in extracting the comments that I write in the margins of the PDF (so as to add them to my reading notes).

Do you think adding such a feature would be possible, and not to difficult to implement?

register at pip?

This little script is great. I built an Alfred workflow around pdfannots to have a more convenient UI.

I want to share it with some colleagues but the whole setup (install python3, pip, pdfminer.six, pdfannots, Alfred with Powerpack, my workflow) is a bit tedious for the less technically adept ones. Could pdfannots just be registered at pip? It would make it a bit easier for others to set it up with a oneliner à la
pip install pdfannots

'dict' object has no attribute resolve

I used the code to see how it works and received an error.

G:\Reference>python proba.py 10036.pdf
1Traceback (most recent call last):
File "proba.py", line 345, in
main()
File "proba.py", line 342, in main
printannots(fh)
File "proba.py", line 309, in printannots
pdfannots = [ar.resolve() for ar in pdftypes.resolve1(page.annots)]
File "proba.py", line 309, in
pdfannots = [ar.resolve() for ar in pdftypes.resolve1(page.annots)]
AttributeError: 'dict' object has no attribute 'resolve'

I love this script so much

Thank you for creating and maintaining it.

Process multiple input files

The description of the "INFILE" argument is "PDF files to process".

This suggests multiple files can be processed in a batch. Is this correct, and if so:

how should the command be formatted (how to specify multiple input files)?
will the output file contain a group of annotations per filename?

It would be really useful to be able to do something like:
pdfannots *.pdf -o allcomments.md

or have a way to do the same, but to output one file per input file, using the input file name as the output file name.
Perhaps something like this:
pdfannots *.pdf - multioutput=true -o [filename].md

Thanks!

PDF example that can illustrate transformation of a text into a blockquote, and the comment into subsequent paragraphs.

Hi, is there any good pdf file that can represent the case example for the following line

pdfannots/pdfannots.py

Line 384 in abf1664

# Otherwise, text (if any) turns into a blockquote, and the comment (if

WARNING: Unsupported annotation subtype: /'Square'

How can I have pdfannots return any value at all - page number, etc. - for annotation subtype: /'Square'?

Here's why... the markup annotation - the marker highlighter annotation - used in iOS is (apparently) reported by pdfannots as subtype: /'Square'. Although, my PDF app - PDF Expert - reports this annotation type as "Rectangle".

For this use case, I simply need pdfannots to return even just a page number, for any type of annotation, at all, on my PDF.

dereference some stuff

Needed to dereference some objects to make the code work for some PDFs:

@@ -89,6 +89,14 @@ class Annotation:
     def __init__(self, pageno, tagname, coords=None, rect=None, contents=None):
         self.pageno = pageno
         self.tagname = tagname
+
+        if isinstance(coords, pdftypes.PDFObjRef):
+            coords = coords.resolve()
+        if isinstance(rect, pdftypes.PDFObjRef):
+            rect = rect.resolve()
+        if isinstance(contents, pdftypes.PDFObjRef):
+            contents = contents.resolve()
+
         if contents == '':
             self.contents = None
         else:

don't work when there are some drawings on pdf

I found the script cannot extract annotations when there are some drawings on pdf，why not just ignore them?

Include the highlight color of the annotations in the output

Hi!

I recently discovered this repo, and I have to say, it does exactly what I am looking for!

The thing is for my pdfs, I am using different colors to differentiate different types of highlights. So, I am wondering if is there any plan for adding highlight color to the output?

Thanks!

new bug found :)

Traceback (most recent call last):
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\runpy.py", line 193, in run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\yoga260\AppData\Local\Programs\Python\Python38\Scripts\pdfannots.exe_main.py", line 7, in
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\cli.py", line 141, in main
doc = process_file(
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init.py", line 483, in process_file
interpreter.process_page(pdfpage)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfminer\pdfinterp.py", line 998, in process_page
self.device.end_page(page)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfminer\converter.py", line 81, in end_page
self.receive_layout(self.cur_item)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init.py", line 264, in receive_layout
self.render(ltpage)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 364, in render
self.render(child)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 364, in render
self.render(child)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 359, in render
self.update_pageseq(item)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots_init_.py", line 279, in update_pageseq
x.update_pageseq(component, self.compseq)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 232, in update_pageseq
self.pos.update_pageseq(component, pageseq)
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 205, in update_pageseq
if self.item_hit(component):
File "c:\users\yoga260\appdata\local\programs\python\python38\lib\site-packages\pdfannots\types.py", line 197, in item_hit
return (self.x >= item.x0 # type: ignore [no-any-return]
TypeError: '>=' not supported between instances of 'PSKeyword' and 'float'

Extraction of highlighted text fails

If the filename contain space, it won't be successful run

If the filename contain space, it won't be successful run.

For example, AB.pdf can work correctly, but A B.pdf will fail.

Add installation to README.md

Expected behavior: running app with help flag produces help file

Actual behavior: produces errors
$ ./pdfannots.py --help
Traceback (most recent call last):
File "/home/twood/bin/pdf_annots/./pdfannots.py", line 7, in
from pdfannots.cli import main
File "/home/twood/bin/pdf_annots/pdfannots/init.py", line 13, in
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
ModuleNotFoundError: No module named 'pdfminer'

Modify README.md to add "Installing" section including note to:

sudo pip3 install -r requirements.txt

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

For many of the PDFs I want to analyze I get the following error/backtrace:

      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/my/pdfs.py", line 98, in get_annots
        doc = pdfannots.process_file(fo, emit_progress_to=None)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 483, in process_file
        interpreter.process_page(pdfpage)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 896, in process_page
        self.device.end_page(page)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfminer/converter.py", line 52, in end_page
        self.receive_layout(self.cur_item)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 264, in receive_layout
        self.render(ltpage)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 364, in render
        self.render(child)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 364, in render
        self.render(child)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 359, in render
        self.update_pageseq(item)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/__init__.py", line 279, in update_pageseq
        x.update_pageseq(component, self.compseq)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 232, in update_pageseq
        self.pos.update_pageseq(component, pageseq)
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 205, in update_pageseq
        if self.item_hit(component):
      File "/nix/store/7mz8dpy5jbq29s2qddap6rd7fiphc6sj-python3-3.9.6-env/lib/python3.9/site-packages/pdfannots/types.py", line 197, in item_hit
        return (self.x >= item.x0  # type: ignore [no-any-return]
    TypeError: '>=' not supported between instances of 'NoneType' and 'float'

I attached an example PDF file for which the error occurs. Any ideas?

kim09_biogen_small_rnas_animal.pdf

Bug: Assertion error resulting in the abortion of extraction

This looks look a similar case like issue #48.

Once again, I have a PDF file which is a book scan, and get a warning about popup annotations not being supported. And instead of continuing, an Assertion Error results in the abortion of the annotation extraction.

This time, I couldn't break it down to a particular page – the issue seems to occur regardless of the page tried. I have therefore attached a sample of 10 pages, and the log I get.

sample.pdf

WARNING: Unsupported annotation subtype: /'Popup'
WARNING: Unsupported annotation subtype: /'Popup'
Traceback (most recent call last):
  File "/opt/homebrew/bin/pdfannots", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/cli.py", line 141, in main
    doc = process_file(
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/__init__.py", line 472, in process_file
    page.annots.sort()
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py", line 226, in __lt__
    return self.pos < other.pos
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py", line 182, in __lt__
    assert self._pageseq != 0
AssertionError

'NoneType' object has no attribute 'get' in _mkannotation

I have downloaded the folder including your last update, run an extraction from a pdf and I get back the following errors:

  File "/Users/.../pdfannots.py", line 10, in <module>
    sys.exit(main())
  File "/Users/.../pdfannots/cli.py", line 141, in main
    doc = process_file(
  File "/Users/.../pdfannots/__init__.py", line 448, in process_file
    annot = _mkannotation(pa.resolve(), page)
  File "/Users/.../pdfannots/__init__.py", line 46, in _mkannotation
    subtype = pa.get('Subtype')
AttributeError: 'NoneType' object has no attribute 'get'

The call function used is: python3 pdfannots.py "file_2017.pdf" -o notes.md --print-filename -p
The md output is blank.

Originally posted by @Chris-mik in #41 (comment)

Redirect output not possible?

With a previous version of pdfannots, I never found a way to redirect the output to a file (> or tee etc.) because whenever I added another argument to the call "python pdfannots x.pdf", I quickly (after only a few pages) got an error like this:

Traceback (most recent call last):
File "pdfannots.py", line 452, in
sys.exit(main())
File "pdfannots.py", line 448, in main
prettyprint(annots, args.output, args.wrap, args.sections)
File "pdfannots.py", line 323, in prettyprint
printitem(a, fmttext(a))
File "pdfannots.py", line 304, in printitem
print(msg + "\n", file=outfile)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 15: ordinal not in range(128)

Never got any problems with "python pdfannots x.pdf" alone, worked perfectly every time.
With the new version and the "-o" argument, I still have the same issue. Am I doing anything wrong (I'm not a Python programmer)?

(Also, sometimes I cannot extract source codes from PDF files, those end up in gibberish. Obviously, some encoding problem. )

Bug: assert self._pageseq != 0

Hi, thanks again for pdfannots! I recently encountered a small issue where an unsupported annotation type completely shut down the annotation extraction. While it's understandable that not every fancy annotation type can be extracted, pdfannots shouldn't completely abort, bur rather simply skip the annotation.

It took me a bit to find the problematic annotation, which was even exacerbated by the fact that it wasn't an annotation visible in normal PDF readers, but probably some result of a bad quality PDF OCR scan.

The error was: WARNING: Unsupported annotation subtype: /'Popup'

AttributeError: 'int' object has no attribute 'objid' while processing some pdfs

I've ran pdfannots across all of my pdfs and getting this error on few of them. The ones that failed are not very important to me, so I didn't bother much investigating but leaving it here just in case someone else wants to investigate.

WARNING:pdfannots:failed to retrieve outlines
ERROR:pdfannots:'int' object has no attribute 'objid'
Traceback (most recent call last):
  File "/L/soft/pdfannots/pdfannots.py", line 454, in process_file
    outlines = get_outlines(doc, pagesdict)
  File "/L/soft/pdfannots/pdfannots.py", line 395, in get_outlines
    page = pagesdict[pageref.objid]
AttributeError: 'int' object has no attribute 'objid'

(I added backtrace myself).

I've attached an example of file on which it reproduces.

K52 Series disassembling guide.pdf

Failed to retrieve outlines: 'PDFObjRef' object is not subscriptable

Hit a WARNING: Failed to retrieve outlines: 'PDFObjRef' object is not subscriptable.

Using the latest master 57fd55d with python3.

Can provide the PDF in private.

rectangular (image) selection

Hi folks, thanks for your efforts with this tool. I was wondering if there are plans to add rectangular (image) selection to the tool?
The workflow that I imagine would be to select figures / tables / formulas using the often available rectangular selection, and then have those selection saved as images (.png, .jpg, etc...) and included as links in the markdown file.

Breaks with Highlighted text

This might be due to a change in pdfminer.six's PDFObjRef.

Traceback (most recent call last):
  File "pdfannots.py", line 351, in <module>
    main()
  File "pdfannots.py", line 348, in main
    printannots(fh)
  File "pdfannots.py", line 316, in printannots
    pageannots = getannots(pdfannots, pageno)
  File "pdfannots.py", line 149, in getannots
    a = Annotation(pageno, subtype.name.lower(), pa.get('QuadPoints'), pa.get('Rect'), contents)
  File "pdfannots.py", line 102, in __init__
    assert len(coords) % 8 == 0
TypeError: object of type 'PDFObjRef' has no len()

Environment.
ca-certificates 2018.03.07 0
certifi 2018.11.29 py36_0
chardet 3.0.4
libcxx 4.0.1 hcfea43d_1
libcxxabi 4.0.1 hcfea43d_1
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
ncurses 6.1 h0a44026_1
openssl 1.1.1a h1de35cc_0
pdfminer.six 20181108
pip 18.1 py36_0
pycryptodome 3.7.2
python 3.6.8 haf84260_0
readline 7.0 h1de35cc_5
setuptools 40.6.3 py36_0
six 1.12.0
sortedcontainers 2.1.0
sqlite 3.26.0 ha441bb4_0
tk 8.6.8 ha441bb4_0
wheel 0.32.3 py36_0
xz 5.2.4 h1de35cc_4
zlib 1.2.11 h1de35cc_3

If there is no highlighted text Document doesn't include outlines ("bookmarks") is returned. pdf2text.py works fine on the text regardless of highlight status.

Output in Zotfile style?

Many thanks for developing this wonderful tool!

Just an idea for the format of output file like this:
https://forum.zettlr.com/discussion/94/zotero-as-zettelkasten

I think it would make the citation of the notes and highlighted texts much easier.

Don't group by annotation type?

Showing highlights first, then comments etc. feels a bit odd to me.

I would much rather have the output sorted by the page / line they're found on.

Feature: Switch to turn off page label support

Hi!

For my specific use case it would be great to have an option to have pdfminer ignore page labels.

At the moment I am using a script that, in the resulting markdown file, adds links to the specific page in the PDF, like so:

[Page 2](<file.pdf#page=2>)

Obviously, the page labels often don't correspond to the actual page number in the file, which would make this type of switch useful.

How to get the headings from pdf?

Hai @0xabu,
Actually, I am trying to extract headings from pdf so, is it possible through this library.
I tried some code using this library but did not get any output apart from error and did not find either documentation or any use case of this library.

Can you please guide me.
Thanking you in advance.

Sample_PDF_file.pdf

Is it possible to detect shapes like Lines and Arrows in pdf

Great works, really appreciate it.

Just curious, is it possible to detect the shapes like, rectangle, line and arrows using this code, if possible for future improvement?

Assertion and TypeError

Great work from the author.
While other pdf work like a charm, I have issue when running over this pdf.

The compiler return an error

line 341, in format_bullet
    assert quotepos + quotelen <= len(paras)
AssertionError

While disabling the line 341 allow the program run, but,

the compiler throw an error while commented the assert quotepos + quotelen <= len(paras)

line 675, in main
    paper_title_long=comb_details_report[paper_title_index]['all_text']
TypeError: list indices must be integers or slices, not NoneType

Redirected stdout on Windows uses a non-UTF8 encoding

On Windows (not WSL), the following command produces a UTF8-encoded output file:

python3 -m pdfannots foo.pdf -o annots.txt

The following command does not, instead the resulting file has an 8-bit encoding -- on my machine, it uses some variant of ISO8859, but is presumably dependent on the codepage.

python3 -m pdfannots foo.pdf > annots.txt

As a workaround, when using Windows Python, it's advisable to use the -o option to write clean UTF8.

Related: #44

PDF example of truncated highlight

Hey all,
Thank you all for this fantastic script! It works very well, although I found a pdf (attached) whose highlights are being severely truncated. I tweaked boxhit function to return True if there is any overlap at all which gave me better results but then the script still does not pick up the last line of each highlight. It looks like original boxes and the rectangle in the Annotation object are indeed missing this last line (the annotation y0 is bigger than the item's)...

Anyway... I can provide more info if you'd like and I'd very much appreciate any insight into fixing this although it is also possible that it is more of a pdfminer issue...

pwc-tax-guide.pdf

Codec issue

With about half the pdf I try to extract from I get the following error message / or similar:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 274: ordinal not in range(256)

I suppose an additional UTF-8 specification is needed.

I have already changed "iso8859-15" to "utf-8" in the get_annots method, yet the issue remains.

I work on a recent Ubuntu Distro and python 3.6.

How to start opening the new file?

Given absolute path of the pdf file is as follow;

C:\abc.pdf

There are three ways to execute the pdfannots.py`

If you're running pdfannots.py from the command prompt, then you do so as, e.g.,

python pdfannots.py C:\abc.pdf

If you want to run it over multiple PDF files, then you do so as, e.g.,

python pdfannots.py C:\abc.pdf D:\xyz.pdf E:\foo.pdf

3.If you are interested to hard coded the pdf path for of ease debugging/understand the whole code using debugging tools

3.a)

def parse_args():
    p = argparse.ArgumentParser(description=__doc__)

    # p.add_argument("input", metavar="INFILE", type=argparse.FileType("rb"),
                   # help="PDF files to process", nargs='+')

    g = p.add_argument_group('Basic options')
    g.add_argument("-p", "--progress", default=False, action="store_true",
                   help="emit progress information")
    g.add_argument("-o", metavar="OUTFILE", type=argparse.FileType("w"), dest="output",
                   default=sys.stdout, help="output file (default is stdout)")
    g.add_argument("-n", "--cols", default=2, type=int, metavar="COLS", dest="cols",
                   help="number of columns per page in the document (default: 2)")

    g = p.add_argument_group('Options controlling output format')
    allsects = ["highlights", "comments", "nits"]
    g.add_argument("-s", "--sections", metavar="SEC", nargs="*",
                   choices=allsects, default=allsects,
                   help=("sections to emit (default: %s)" % ', '.join(allsects)))
    g.add_argument("--no-group", dest="group", default=True, action="store_false",
                   help="emit annotations in order, don't group into sections")
    g.add_argument("--print-filename", dest="printfilename", default=False, action="store_true",
                   help="print the filename when it has annotations")
    g.add_argument("-w", "--wrap", metavar="COLS", type=int,
                   help="wrap text at this many output columns")

    return p.parse_args()



def main():
    args = parse_args()

    global COLUMNS_PER_PAGE
    COLUMNS_PER_PAGE = args.cols

    for file in [r"C:\abc.pdf"]:
         file = open(file, 'rb')
         (annots, outlines) = process_file(file, args.progress)

        pp = PrettyPrinter(outlines, args.wrap)

        if args.printfilename and annots:
            print("# File: '%s'\n" % file.name)

        if args.group:
            pp.printall_grouped(args.sections, annots, args.output)
        else:
            pp.printall(annots, args.output)

    return 0

Credit to the OP from SO

Scan with OCR: words not split

With this PDF-file the words are not split. It's an OCR-scan. I tried modifying the word_margin in LAParams to no avail. When exporting the highlights using PDF Expert (my macOS-PDF Reader) it works fine though: here's the expected output.

Any thoughts?

Best regards

Windows does not use utf-8 output encoding by default

File cli.py, line 147:
args.output.write(line)

I replaced args.output.write(line) for
with open(args.output.name,"a",encoding="utf-8") as f:
f.write(line)

My PDFs have a lot of diacritics and with this change I was able to get the correct output.

Support "FreeText" annotations

The text annotation tool in the PDF Expert app creates annotations of subtype "FreeText". pdfannots.py doesn't support these, it complains "WARNING: Unsupported FreeText annotation ignored on page 1". It could support these annotations just like "Text" (sticky note) annotations.