Coder Social home page Coder Social logo

pdftitle's Introduction

pdftitle

CircleCI

pdftitle is a small utility to extract the title of a PDF article.

When you have some PDF articles where you cannot understand their content from their filenames, you can use this utility to extract the title and rename the files if you want. This utility does not look at the metadata of a PDF file. The title in the metadata can be empty. It works for ~80% of the PDFs I have and it is especially suited for PDF files of scientific articles.

pdftitle uses pdfminer.six project to parse PDF document with its own implementation of the PDF device and PDF interpreter. The names of the variables and calculations in the source code is very similar to how they are given in the PDF spec (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf).

Installation

pip install pdftitle

Usage

pdftitle -p <pdf-file> returns the title of the document if found.

$ pdftitle -p knuth65.pdf 
On the Translation of Languages from Left to Right

pdftitle -p <pdf-file> -c changes the document file name to the title of the document if found while removing the non-ascii chars. This command prints the new file name.

$ pdftitle -p knuth65.pdf -c
on_the_translation_of_languages_from_left_to_right.pdf

(Much) more info can be seen in verbose mode with -v.

The program follows this procedure:

  1. Look into every text object in the first page of a PDF document

  2. If the font and font size is same in consequent text objects, group their content as one

  3. Apply the algorithm, see below.

The assumption is that the title of the document is probably the text having the largest (or second largest etc.) font size in the first page and the one most close to the top of the page.

One problem is that not all documents uses space character between the words, so it is difficult to find word boundaries if space is not used. There is a recovery procedure for this, that may work.

It is possible that PDF has a character that does not exist in the font, in that case you receive an exception, and you can use the --replace-missing-char option to eliminate this issue.

Sometimes the found title has a strange case (first letter is small but last is big etc.), this can be corrected with -t option.

Algorithms

There are three algorithms at the moment:

  • original: finds the maximum font size, then finds the upmost (minimum Y) blocks with this font size and joins them.
  • max2: finds the maximum font size, then first adds the block with maximum font size, then the second maximum size, then continues adding either of them until a block with different font size is found. the block order is the natural order in the pdf, no x-y sorting is performed.
  • eliot: similar to original but can merge blocks having arbitrary number of font sizes ordered by size. the block order is y first then x. the font sizes to use are provided with --eliot-tfs option, this is the index of font sizes from the largest to the smallest, so --eliot-tfs 0,1 means the largest and the second largest fonts.

Algorithms are selected with -a option.

Changes

0.11:

  • functionally same as 0.10, including some pylint fixes.

0.10:

  • --page-number argument added. Related issue is here.
  • potentially a fix implemented for some files having non-zero Trm[1] and Trm2[] elements. This change might cause different outputs than previous versions of pdftitle. This is related to the issue raised here.
  • verbose and error messages improved.
  • pdfminer version updated.

0.9:

  • retrieve_spaces function is made non-recursive.
  • eliot algorithm is implemented for this issue, test file is woo2019.pdf
  • eliot-tfs option is implemented for eliot algorithm.
  • stack trace was printed only in verbose mode, this behavior is changed and now stack trace is printed always if there is an error.

0.8:

  • make the title like title case (-t) using Python title method.
  • pdfminer version updated.
  • algorithm flag (-a). default is the original algorithm so no change.
  • max2 algorithm is implemented for this issue, test file is paran2010.pdf.

0.7:

  • changes and fixes for pylint based on Jakob Guldberg Aaes's recommendation.
  • no functional changes.

0.6:

  • rename file name to title (-c). Contributed by Tommy Odland.
  • pdfminer version updated.

0.5:

  • fixed install problem with 0.4
  • pdfminer version updated.

0.4:

  • Merged #e4bb0d6 to detect and remove duplicate spaces in the returned title. Contributed by Jakob Guldberg Aaes (https://github.com/jakob1379).

0.3:

0.2:

  • changed version string to major.minor format.
  • pdftitle can be used as a library for a project, use get_title_from_io method
  • added chardet as a dependency
  • algorithm is changed but there are problems with finding the word boundaries

pdftitle's People

Contributors

fabien-couthouis avatar jakob1379 avatar metebalci avatar spiritus44 avatar tommyod avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdftitle's Issues

Title from incomplete file

Is it somehow possible to get a title from an incomplete file? This currently doesn't work. I assume the title is toward the top of a file, so an entire file shouldn't really be necessary. This will prevent needless full downloads of large files.

Algorithm "eliot" implemented incorrectly

In the implementation of the "eliot" algorithm, the y coordinates are sorted low-to-high:

pdftitle/pdftitle.py

Lines 543 to 548 in 5ebc1a0

# sort the selected blocks, put y min first, then x min if y min is
# same
# 1000000 is a magic number here, assuming no x value is greater
# than that
selected_blocks = sorted(selected_blocks,
key=lambda b:b[3]*1000000 + b[2])

Since the origin of a pdf is the bottom-left corner, the y coordinates should be sorted high-to-low, as in the implementation of the "original" algorithm:

pdftitle/pdftitle.py

Lines 492 to 494 in 5ebc1a0

# find the one with the highest y coordinate
# this is the most close to top
max_y = max(max_blocks, key=lambda x: x[3])[3]

Sorting could be done as:

selected_blocks = sorted(selected_blocks, key=lambda b: (-b[3], b[2]))

since tuples follow lexicographical ordering (see here)

improve space detection and remove pdfminer high level code

Text in the PDF file might not contain space character but the space might be indicated with an actual (additional) horizontal position difference between the glyphs before and after the space, so between the last char and the first char of the words. pdfminer has a high level code detecting this i.e. if the space between chars is greater than a certain threshold (possibly specified in the font file). It is better to do this manually and also implement spacing if vertical positions also changed (title in more than one lines). When this is done, I think, the get_title_from_io method can be simplified by removing the TextConverter and PDFPageInterpreter related parts.

Exception thrown

Here is a pdf where the extraction fails.

Traceback (most recent call last):
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 974, in to_unichr
return self.cid2unicode[cid]
KeyError: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py",
10.1.1.160.2604.pdf
line 404, in draw_cid
unichar = ts.Tf.to_unichr(cid)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 976, in to_unichr
raise PDFUnicodeNotDefined(None, cid)
pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 701, in run
title = get_title_from_file(args.pdf)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 581, in get_title_from_file
return get_title_from_io(raw_file)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 462, in get_title_from_io
interpreter.process_page(page)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
self.execute(list_value(streams))
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
func(*args)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 292, in do_Tj
self.do_TJ([s])
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 324, in do_TJ
self.device.process_string(self.mpts, seq)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 378, in process_string
self.draw_cid(ts, cid)
File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 410, in draw_cid
"exist in the font") from unicode_not_defined
Exception: PDF contains a unicode char that does not exist in the font

Failed to identify title from JMF

Hello metebalci,
I am not able to use pdftitle -p PDF to extract the title of scientific articles from the Journal of Medicinal Food.

For example this file do not produce a title:
woo2019.pdf

Is it possible to change a bit the algorithm for this kind of articles?

I have tried the new option pdftitle -a max2 -p PDF without success. I do not see a list of parameters that can be passed to -a in the readme, so to the best of my knowledge, reading this github repository, there is only the options -a max2 and -a default. If not, please note that I have not tried other algorithms.

Thank you!

Give suggestion to use `--replace-missing-char` in the error message?

I think this would help the user to discover the option. Currently the error message just say "PDF contains a unicode char that does not exist in the font", maybe add "or pass --replace-missing-char=" " to forcefully ignore this error" there.

Alternatively, it's also possible to replace it with some dummy character, then if the extracted title does not contain any dummy character then silently ignore the error, since there's no harm.

Making replace-missing-char the default is also an option, but I think it's a conscious design choice to not make it so. In my case, can be I'd overwhelmingly want to use that option however.

Can't set parameters from API

I might have overlooked something, but it seems there is no way to adjust the parameters from API calls, e.g. you can't call get_title_from_file(path, algo='max2').

New PyPI release

@metebalci Please make a new PyPI release. There are fixes that exist in the repo but don't exist in the release, preventing reasonable use of the package. I look forward to using the package, but it needs work.

Error installing with pip

I get the following error when running pip install pdftitle:

Collecting pdftitle
  Downloading https://files.pythonhosted.org/packages/e8/dc/2cfe5ebeb595d546e488d3586c92c638fb040e835680ea212413ae94293e/pdftitle-0.4.tar.gz
    ERROR: Command errored out with exit status 1:
     command: /home/tommy/anaconda3/envs/py37/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jbtqfepv/pdftitle/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jbtqfepv/pdftitle/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-jbtqfepv/pdftitle/pip-egg-info
         cwd: /tmp/pip-install-jbtqfepv/pdftitle/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-jbtqfepv/pdftitle/setup.py", line 7, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "/home/tommy/anaconda3/envs/py37/lib/python3.7/codecs.py", line 904, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-jbtqfepv/pdftitle/README.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Seems that setup.py wants to get long_description from README.md, but in the pypi source files the README is not available.

Couldn't extract title from a PDF with first page image

❯ pdftitle -p .\Downloads\test.pdf

Traceback (most recent call last):
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf

C++/CLI in Action

# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -

C++/CLI in Action

Here is the file: test.pdf

Get title from bytes

Currently, the get_title function allows getting the title from a path to a file. I'd like to be able to get the title from a bytes object instead. It should be easy to allow the code to also do this. Currently I'm having to create my own modified get_title function, duplicating most of its functionality, and this is too hacky. Thanks.

Read titles for every page and populate in a table or list

Hi, I was looking for some algorithms to read the pdf title and I came across your project which is doing the job perfectly for the first page.

Could you please add some arguments that select a specific page or extract the whole pdf pages in form of a table or list?

Thanks

raise Exception cause app crash

is it possible to create an informative error message instead of application crash.

Traceback (most recent call last):
  File "/home/zk/.local/lib/python3.9/site-packages/pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "/home/zk/.local/lib/python3.9/site-packages/pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "/home/zk/.local/lib/python3.9/site-packages/pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "/home/zk/.local/lib/python3.9/site-packages/pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

Large text returned as title

URL: http://www.pnas.org/content/pnas/suppl/2018/10/09/1809045115.DCSupplemental/pnas.1809045115.sapp.pdf

The returned title is:
1 SUPPORTIV E INFORMATION for Bruce N. Ames Perspective SI - 1 - V itamin and Mineral D eficiencies Numerous studies link poor nutrition to a variety of diseases of aging , as shown in the following sampling of recent references (1 - 12) . SI - 2 - Triage T heory Vitamin K (phylloquinone) is necessary for the function of 16 enzymes . A tri age rationing process is support ed by an analysis of the behavior of these enzymes under a mimic of vitamin K sho rtage (13) . Recent studies provide additional support: a Mendelian Randomization (MR) epidemiology study showed that both all - cause and cardiovascular disease ( CVD ) mortality are caused by vitamin K1 inadeq uacy, and confirmed that the low level of the inactive form of Mgp protein, which normally prevents arterial calcification, is diagnostic for vitamin K1 deficiency (14) . I ncreased dietary intake of vitamin K1 and menaquinone (K2=MK2) and other derivatives (such as MK7 in natto) was associated wi th lower all - cause cancer and CVD mortality (15) . A study of 166 adolescents supports the CVD findings by showing that subclinical cardiac stru cture and function variables are most favorable at higher phylloquinone (vitamin K1) levels (16 ) . Selenium i s necessary for the function of 25 enzymes . A triage - related rationing was also shown to be operating in the case of selenium (17) . A 4 - year Randomized Clinical Trial (RCT) (18) of selenium supplemen tation (200 µ g/ d) +CoQ10 (2 00mg /d) significant ly reduced CVD mortality risk by more than 40 % , a nd a lso significant ly reduc ed hypertension, IHD, impaired cardiac function, and diabetes in 443 elderly people in rural Sweden ( where soil is low in selenium) du ring a follow - up time of 1 2 years ; i mprovement in CVD biomarkers, such as echocardiography and natriuretic peptide levels, was also observed. SI - 3 - S urvival V /M that are also L ongevity V /M Vitamin D : A meta - analysis of vitamin D versus mortality in 5 Northern European countries (n=~29,000), using subjects of median age 62 years, showed that a blood level of 25(OH)D of less than 12 ng/ml was associated with maximum mortality, while levels between 30 to 40 ng/ml were as sociated with the lowest mortality (19) . Rodent eviden ce also showed that mutations in the vitamin D rec eptor in mice resulted in premature aging (20) . A meta - analysis of 32 studies (n = ~500,000) on vitamin D and all - cause mortality showed th at t he mortality hazard ratio between subjects with the lowest quantile (<9 ng/ml) and those with t he highest (>50 ng/ml ) serum levels of 25(OH)D was 1.9 (p=0.001). Levels of 25(OH)D less than or equal to 30 ng/ml were associated with significantly higher (p < .01 ) all - cause mortality than levels greater than 30 ng/ml (21) . A 12 - year German study of elderly individuals (n=9,579) in a statistically simulated intervention with vitamin D showed a large decrease in all - cause mortality and cancer (22) . A 29 year - long study of 95,000 Danes showed that a decreased plasma level of 25(OH)D was associated with early mortality and an increased risk of ischemic heart disease and myocardial infarction (23) . An MR analysis of this study showed that a low 25(OH)D level was causally associated with all - cause mortality and cancer mortality

[Feature request] author name and title

Hi and thanks for this pkg.
I was wondering whether it would have been feasible to add the ability to extract the author last name and title as well? For instance give a pdf file to ouput sth like authorLastNameYear_Title?

"TypeError: 'NoneType' object is not subscriptable" almost every time I make inference

Traceback (most recent call last):
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 669, in run
    title = get_title_from_file(args.pdf)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 557, in get_title_from_file
    return get_title_from_io(raw_file)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 452, in get_title_from_io
    dev.recover_last_paragraph()
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 340, in recover_last_paragraph
    if len(self.current_block[4]) > 0:
TypeError: 'NoneType' object is not subscriptable

Break up digraph?

Currently, the program may output digraph for certain PDFs.

For example https://arxiv.org/pdf/1506.02640.pdf .

$ pdftitle -p 1506.02640.pdf 
You Only Look Once: Unified, Real-Time Object Detection

Note the in Unified.

I think it may be preferable to break it up by default?

App fails when title text is inside XForm.

The algorithm only considered the case when the title text is in the top level, while in many pdf files, the title is indeed inside a XForm or a multi-level XForm. Can you improve the app to allow search all objects in the first page?

Tests and continuous integration (CI) by travis seems to be broken

The link behind the test-badge (i.e. https://www.travis-ci.com/metebalci/pdftitle) leads to a 404 error. I think this is because travis ceased its free service some time ago.

Also, when I run the tests locally on my machine I they fail due to pylint returning a nonzero exitcode.
However, the actual tests of the functionality work just fine.

I wanted to propose https://cloud.drone.io. However, unexpectedly I could not start a build process for the forked repo. Maybe I find the problem or an alternative solution. Meanwhile I would propose to remove the invalid badge from the README.md file.

Exception: current block is None

For some of my academic papers exception occur:
Exception: current block is None, this might be a bug. please report it together with the pdf file

The papers are:

  • "Why Does Social Exclusion Hurt? The Relationship Between Social and Physical Pain" (MacDonald & Leary 2005)
  • "Psychology and Epigenetics" (Masterpasqua 2009)

Remove duplicate spaces when returning title

I found pdftitle which have proven to be a true gem! From time to time it to returns duplicate spaces in the title, which can easily be circumvented in two ways

  1. using regular expression, just before returning title in get_title_from_io just add a
titlte = re.sub(' +' , ' ', title)
  1. using string manipulation
title = ' '.join(title.split())

Preferably a check with ' ' in title should be performed to check whether the correction should be executed or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.