Coder Social home page Coder Social logo

pd3f.com's Introduction

pd3f

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

  • calculate runtime based on job.started_at and job.ended_at
  • Get average runtime of jobs and store data in redis list

more information about PDF

  • NER
  • entity linking
  • extract keywords
  • use textacy

add more language

  • check if flair has model
  • what to do if there is no fast model?

Python client

  • simple client based on request
  • send whole folders

Markdown / HTML export

  • go beyond text

use pdf-scripts / allow more processing

  • reduce size
  • repair PDF
  • detect if scanned
  • force to OCR again

improve logs / get better feedback

  • show uncertainty of ML model
  • allow different log levels

Related Work

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

pd3f.com's People

Contributors

gcushen avatar jfilter avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.