Coder Social home page Coder Social logo

marianna13 / doc2dataset Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 108 KB

A tool to extract text (and images) from documents (like PDFs)

License: MIT License

Makefile 1.41% Python 98.59%
big-data dataset document image interleaved multimodal text

doc2dataset's Introduction

doc2dataset

Open In Colab

Easily extract text (and images) from a bunch of pdf files (while preserving the original text formatting)

Install

pip install git+https://github.com/marianna13/doc2dataset.git

Python examples

Checkout these examples to use doc2dataset:

API

This module exposes a single function pdf_extractor which takes the same arguments as the command line tool:

  • file_list file (csv, parquet, txt etc) containing paths of documents. (required)
  • output_format Format of output dataset can be (default = "files")
    • files, samples saved in subdirectory for each shard (useful for debugging)
    • webdataset, samples saved in tars (useful for efficient loading)
    • parquet, sampels saved in parquet (as bytes)
  • output_folder: Desired location of output dataset (default = "dataset")
  • input_format: Format of the input, can be (default = "csv")
    • txt, text file with a url in each line
    • csv, csv file with urls, (and captions + metadata)
    • tsv, tsv - || -
    • parquet, loads urls and metadata as parquet
  • file_col: Column in input (if has columns) that contains the filename (default = "filename")
  • distributor whether to use multiprocessing or pyspark (default = "multiporocessing")
  • processes_count number of parallel processes (default = 1)
  • save_figures whether to save figures (default = True)
  • min_words_per_page mininum words per page (default = 100)
  • max_images_per_page maximum images per page (default: 5)
  • min_image_size minumum image size (default = 0)
  • max_image_area maximum image area (default = None)
  • max_aspect_ratio max aspect ration (default = None)
  • get_language whether to get the language of text using pycld2 (default = False)
  • remove_digits whether to remove digits (default = False), can mess up with images
  • count_words whether to count words(non-punctuation characters) (default = True)
  • max_pages maximum number of pages per document (decreasing this param can help speed up) (default = None)
  • get_drawings whether to extract SVG images (default = False)

Output examples

sample_output.md

For development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

doc2dataset's People

Contributors

marianna13 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

achalddave

doc2dataset's Issues

Extract SVG images

from svg.path import parse_path, Line
from xml.dom import minidom
from cairosvg import svg2png


def get_xy(z):
  return z.real, z.imag

svg = page.get_svg_image()
svg_doc = minidom.parseString(svg)

for use in svg_doc.getElementsByTagName('use'):
  use.parentNode.removeChild(use)

for path in svg_doc.getElementsByTagName('path'):
    d = path.getAttribute('fill')

    if path.getAttribute('id'):
      path.parentNode.removeChild(path)

svg_root = svg_doc.getElementsByTagName('svg')[0]
w, h = svg_root.getAttribute('width'), svg_root.getAttribute('height')
w = float(w.replace('pt', ''))
h = float(h.replace('pt', ''))

svg_str = str(BeautifulSoup(svg_doc.toxml(), 'lxml').find('svg'))

img = svg2png(file_obj=io.StringIO(svg_str), output_width=w, output_height=h, dpi=90)

img = np.array(Image.open(io.BytesIO(img)))

x, y = np.where(img[:, :, 0]!=0) # coordinates of SVG images

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.