Coder Social home page Coder Social logo

tesspage's Introduction

TessPage

Toolset for Tesseract training with PageXML Ground-Truth

Install & Setup

  1. Clone tesspage:

    $ git clone https://github.com/Jatzelberger/tesspage
    
  2. Install dependencies:

    $ sudo apt install -y tesseract-ocr libtesseract-dev libtool pkg-config make wget bash unzip bc
    $ cd tesspage
    $ pip install -r requirements.txt
    
  3. Setup:

    $ python3 tesspage.py setup
    

Structure

(after setup)

tesspage
│ README.md                 
│ requirements.txt          required pip packages
│ LICENSE                   license
└─ tesspage                 tesspage files
│  └ ...
└─ tesstrain¹               tesstrain files
│  │ data                   trained model data 
│  └ ...
└─ data
│  │ eval                   default dir for evaluation
│  │ ground_truth           default dir for ground_truth output
│  │ ocr_input              default dir for tesseract image input
│  │ ocr_output             default dir for tesseract output
│  │ tessconfigs²           tesseract config files
│  │ tessdata_best³         training start model 
│  └ training_data          default dir for pagexml input
└─ tesspage.py              entry point

¹ tesstrain, ² tessconfigs, ³ tessdata_best

Usage

Generate Ground-Truth

Copy PageXML + Image Files to ./data/training_data (or custom folder)

Run script to generate single line image files and matching ground truth .txt files:

python3 tesspage.py generate [--training_data <input_folder>] [--ground_truth <output_folder>]
  • --training_data: input folder containing pagexml and image files [default: ./data/training_data/]
  • --ground_truth: output folder (line image and text files after exec) [default: ./data/ground_truth/]

Train Model

Run script to train custom Tesseract model from base model with single line image files and ground truth .txt files

python3 tesspage.py training [--model_name <name>] [--start_model <model>] [--data_dir <folder>] [--ground_truth <folder>] [--tessdata <folder>] [--max_iterations <number>] [ARGS ...]
  • --model_name: name of trained model [default: foo]
  • --start_model: select start model. Previously trained model or lang-code (e.g. "eng") from langdata [default: eng]
  • --data_dir: tesstrain data dir [default: ./tesstrain/data/]
  • --ground_truth: ground truth folder (line image and text files) [default: ./data/ground_truth/]
  • --tessdata: training start model folder [default: ./data/tessdata_best/]
  • --max_iterations: training iterations [default: 10000]
  • ARGS: Full argument list here

Run Tesseract

Run Tesseract OCR with custom model

python3 tesspage.py tesseract --model_name <name> [--input <path>] [--output <path>] [--data_dir <folder>] [--config_dir <config_dir>] [--config <config>] [ARGS ...]
  • --model_name: select model, either language or custom trained model
  • --input: input directory or image file
  • --output: output directory
  • --data_dir: tesstrain data dir [default: ./tesstrain/data/]
  • --config_dir: Output config directory. [default: ./data/tessconfigs/configs/]
  • --config: Config file to be used (txt, pdf, hocr, tsv, pagexml, ...) [default: txt]
  • ARGS: guide here

Evaluate Model

Run to evaluate trained models (CER/WER)

python3 tesspage.py eval [--eval_input <folder>]
  • --eval_input: supports .txt, .hocr and .xml files [default: ./data/eval/]

Name Pattern:

  • Reference files: <name>.gt.<extension>
  • Prediction files: <name>.<extension>
  • Supported extensions: .txt, .hocr, .xml
  • Example: 0001.gt.xml / 0001.xml

Help

python3 tesspage.py -h

ZPD

Developed at Zentrum für Philologie und Digitalität at the Julius-Maximilians-Universität of Würzburg

tesspage's People

Contributors

jatzelberger avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.