Coder Social home page Coder Social logo

pdf_table_scraper's Introduction

PDF table scraper
-----------------

This script attempts to extract the data of a table from a pdf file.

It considers every single page of a pdf as a table, and attempts to make sense
of it. The output should be much easier to parse and 'somehow clean', but a
manual checking is required over the results.

It currently exports the data as a .html (for visualization) as well as in .csv
or in Python pickle form, for reuse in another script.

    ~/pdf_table_scraper$ ./pdf_table_scraper.py -h
    usage: pdf_table_scraper.py [-h] [--vskip VSKIP] [--page PAGE] [--html HTML]
                                [--csv CSV] [--pickle PICKLE] [--tmp_xml TMP_XML]
                                [-v]
                                filename

    Extracts a table from a .pdf file

    positional arguments:
      filename           the .pdf file

    optional arguments:
      -h, --help         show this help message and exit
      --vskip VSKIP      max vertical space between consecutive lines in the same
                         paragraph (usually ~8)
      --page PAGE        run the script on a specific page
      --html HTML        A filename for html output
      --csv CSV          A filename for csv output
      --pickle PICKLE    A filename for Python .pickle output
      --tmp_xml TMP_XML  A temporary XML file (output of pdftohtml)
      -v                 Increase the verbosity level

pdf_table_scraper's People

Contributors

nathanncohen avatar

Stargazers

JeMaGius avatar  avatar Francisco de la Peña avatar Joel Gombin avatar

Watchers

 avatar Tangui Morlier avatar François Massot avatar James Cloos avatar Benjamin Ooghe-Tabanou avatar Benjamin Ooghe-Tabanou avatar  avatar  avatar RegardsCitoyens Bot avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.