atlanhq / camelot Goto Github PK

View Code? Open in Web Editor NEW

3.6K 82.0 350.0 16.43 MB

Camelot: PDF Table Extraction for Humans

Home Page: https://camelot-py.readthedocs.io

License: Other

Python 99.67% Makefile 0.33%

pdf table extract for-humans

camelot's Introduction

Camelot: PDF Table Extraction for Humans

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!

Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py[cv]

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

camelot's People

Contributors

Stargazers

Watchers

Forkers

vinayak-mehta christinegarcia vaibhavmule gridl derek-zl ideaplexus gitking shafiahmed dgreyling nofeetbird0321 vigaeatery latticeai dtsukiyama thecooltechguy valrcs adolfoeliazat rowhitswami 42b rockystevejobs m8e tumkit15 trendingtechnology dipakkr number0 tanmay4osme jmuskaan72 shelltips yurybolkonsky ares7 guptam tarsbase miraj-tariq devhttps hehuanshu96 hhy5277 eamanu kyoungrok0517 lishiji1992 shaunstanislauslau ozandogrultan yueguangguang jainkshitij jonathanlloyd jamwine icqw1983 zischwartz cxz makarevichy rbares lisa718 suyashb95 roysh claudiopinheiro surexdirect mokacao alex-code4okc sbhttcha jplattel sergiocorreia arlm-attic pecey diemesleno ssekuwanda lgf124 danish-mehmood cuphan gitouyou sainideepak abhayjb machdata sravyaysk paulobh 99ddzc migsr22 jaynoel wangchuan2008888 shinroo asifkhan69 gjwhw erodxd jeffrey98-ai haginile akshayjh jackblackpearl project-renard-survey fireae saurabharch ravedata liuyuhua-ha awesome-python shahidash davidkong0987 yatintaluja hckzwf mchaitanyak35 vivekpd15 todatamining mugurd macicco bhumilharia

camelot's Issues

Don't merge columns when a negative tolerance is given

Modify _merge_columns to account for a negative tolerance parameter and not merge x-axis column projections if they are within that tolerance.

Handle /tmp folder deletion

Do it in pdf.py itself.

Add option to get output as html/json/tsv

Would be useful.

Preserve table captions and other relevant info

Right now, it only parses text that is present within a table's boundary. Important information like table captions and footnotes should also be parsed.

Take care of super and subscripts

See the pdfminer code and find a way to handle super and subscripts.

troughs vs. text for stream

again, two ways of getting to the grid finally for using the stream method, either you analyse the text to get to vertices OR analyse the blank spaces to get to vertices.
will be parameterised, by default the stream method should utilize both to get to a final set of vertices.

Add logic for unorthodox spanning cells

We should discuss more of these cases.

Metrics to measure the quality of a parse

Extend vertical text detection to cells in Lattice

To classify a table as vertical, Lattice needs atleast 80% of its LTChars to be vertical. This could be removed and extended to cells to solve these kinds of cases.

pagination for debug screen viewing in bulk

do not want the screen to get blocked

longterm

Intelligence regarding line width parameterisation - for better splits in lattice

not all lines are useful. many times PDFs are divided into sections with different line widths. thus a parameter for tuning the line width to consider something as a line.

Improve image generation

Currently, sometimes due to poor image quality*, extra lines are detected which leads to more/less joints contributing to table building. For dealing with more joints, a filter based on line length can be added. For less joints, a better quality image by tweaking Wand parameters, or using heuristics like merging close contours could be done.

Improve API and CLI

This thread is for discussion about the new interface.

Make UI to modify detected lines, joints, contours

Something like tabula-table-editor (code: tabulapdf/tabula-table-editor) could be made using matplotlib widgets and event handling.

Fix testing framework

This thread is for discussion around improving tests.

Indicate the presence of lines explicitly in output

In case of tables where there are insufficient lines to demarcate all cells, the relatively smaller lines should be shown in the output as rows or columns of cells filled with the word line, just to help out someone in post processing.

Tolerance parameter for overlap and its cascading effect

Will be used for generating better grid coordinates for stream. Happens within the set of rows with mode number of columns and the overlap will lead to increase or decrease in the number of columns which stream guesses. We don't go beyond this 1-pass in the recursion.

Can initializing Stream column generation be improved?

Currently, the list of x-axis projections of text objects is sorted in (-y, x) style. These projections are then sequentially merged based on overlaps. This may fail when the projection of the last text object in a column extends to projection of the first text object in the next column, as it will merge the two columns into one. A better way to do this would be to group text objects into columns based on their list index and only merge their boundaries, though more discussion is required.

Add option to specify table bbox for Stream

One way would be to give a dict on page numbers with list of table_bbox, subsequently changing how ncolumns and columns are passed to Stream.

Add option to return page numbers along with row/column distribution

Vertical tables should be parsed correctly

There can be many variations like horizontal table, vertical headers or vertical table, vertical headers etc.

Replace Lattice chars with textlines

When filling chars in cells after sorting them according to their coordinates, 6^th gets filled as th6. By replacing chars with textlines, which are made by grouping chars, we can get 6th as output. Though we need to take care of splitting textlines if they span across multiple cells. There should also be an option in both methods to specify whether textlines should be splitted if they span across multiple cells.

Add Build/CI

This thread is for discussion about what CI tool to use.

Add option for batch input

For example,

camelot method pdf_dir/

camelot method *.pdf

Initially the files would be processed sequentially, but in the future, support for distributed processing should be implemented.

Shift text up in hybrid spanning cells

Shift text up based on the presence of horizontal lines and some metric based on blank rows. If the vertical lines are not present then, Stream generated columns/user given separators should be used.

For example:

Catch and log warnings

The UserWarnings raised in Lattice and Stream are not logged into a file right now.

Add milestones to the issues

To keep a track of what are the most immediate things necessary and what can be added iteratively.

Port to Python 3

Have to figure out if the following dependencies work as expected, in Python 3:

https://github.com/pdfminer/pdfminer.six (unofficial port)
OpenCV

Thinking of starting with using things from __future__ wherever possible (mostly print, division and unicode_literals), and create a release for Python 2. That should make it easy for us to create a Python 3 release once the deps are sorted.

Fix magic grid extension

In Stream, the table boundary is taken as (0, width) and (0, height) where width and height are PDF dimensions. This should be changed to (x0 of first text object, x1 of last text object on x-axis) and (y0 of first text object, y1 of last text object on y-axis) respectively because logic.

Improve logging

Current logs for such cases are of the form WARNING:root:Text did not fit any column. which don't provide any details.

PDFMiner sanity check

Check if the LTObjects hierarchy is changed when modifying margins. Also, see how PDFMiner generates newlines and spaces.

Informative output log

The output log should contain more information like the number of tables found, display any tables that couldn't be parsed etc.

Replace imagemagick with ghostscript

imagemagick uses ghostscript for PDF->PNG conversion. However, it calls gs twice, first for PDF->PS and then PS->PNG. According to this SO answer it doesn't keep quality in first step. gs can do PDF->PNG in one go and with better quality. (checked this on some PDFs in which more joints were being detected #51)

Improve documentation

This thread is for discussion around documentation, what to add/improve, where to host etc.

Multi-page tables

There should be a way to merge tables which span multiple pages.

Needs documentation

A documentation explaining the working, installation instructions etc. should be there. README should also be improved.

Process partial grid using lattice itself

Can process a partial grid using lattice method if the grid generation is done using hough but will need to specify some version of a bounding box calculation of which is already done using contours.
Division of the page will be needed pre-running the hough transform.

Page splitting is very slow for some PDFs

The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.

Process pages in chunks

When loading a lot of pages (~42000) at once, memory gets filled up quickly and parsing stalls. A way to solve this would be to process pages in chunks. (think of generators) Or using multiprocessing maybe. @sharky93 Would you like to add something?

Extend lattice algorithm to work on image-based pdfs

Will need to explore python-based tesseract libraries for OCR.

Make stats more usable

Use-cases:

Help the user drop tables in an ETL workflow based on parsing accuracy, whitespace in table cells.

Tolerance parameter for assigning text object into cell for Stream

Currently, LTTextLH are assigned to cells based on their x0 coord. Additionally, the area overlap between the LTTextLH and cell could be used for better assignment. A tolerance parameter for minimum area overlap required for assignment into a cell can be added. For example (see image), it makes sense to add the third text box to column 2 instead of 1. @sharky93

idea of stats or informative log is separate from the parsing score that is generated

for example - imagine a PDF with majority text and a small table that is of interest. the score is affected by only how the table is extracted, where as it is good to have stats on such and such LTText(s) etc were ignored et al.

Deprecate joint tolerance in Lattice

jtol is used to take care of any errors that might arise while converting from image coordinate space to the pdf coordinate space, converting line contours to lines. There is no need for it to be user configurable.

Update to the stream algorithm regarding left and right extreme grid points

Modifications to the grid needed post the mode calculation for the rows where data lies outside the grid formed using mode number of columns

Add option to modify the ToUnicode map

To be updated.

What license to choose?

This thread is for discussion around licenses.

Autodetect bounding box for Stream

This will help in cases where a pdf page has two or more tables with a box outline, but without internal lines to demarcate cells. Need to think on how to integrate find_table_contours from imgproc with Stream column generation.

Add verbose option and log to file

Find some way to write the log buffer to a file if log is True.

Configuration to guide the morph transform

Cases where the user is aware of which section of PDF houses the interesting things (tables here mostly), a simple configuration option to guide the image processing will reduce run-times further.
For example - when half of the PDF is blank, no need to process those pixels

atlanhq / camelot Goto Github PK

camelot's Introduction

Camelot: PDF Table Extraction for Humans

Why Camelot?

Installation

Using conda

Using pip

From the source code

Documentation

Development

Source code

Setting up a development environment

Testing

Versioning

License

camelot's People

Contributors

Stargazers

Watchers

Forkers

camelot's Issues

longterm

Recommend Projects

Recommend Topics

Recommend Org