kbnlresearch / ochre Goto Github PK

Toolbox for OCR post-correction

License: Apache License 2.0

Shell 0.72% Python 16.26% Jupyter Notebook 39.29% Common Workflow Language 42.84% HTML 0.90%

ochre's Introduction

Ochre

Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress!

Overview of OCR post-correction data sets
Preprocess data sets
Train character-based language models/LSTMs for OCR post-correction
Do the post-correction
Assess the performance of OCR post-correction
Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.

Data sets

VU DNC corpus
- Language: nl
- Format: FoLiA
- ~3340 newspaper articles, different genres, 5 newspapers, 1950/1951
- Gold standard is noisy
ICDAR 2017 shared task on OCR post correction
- Language: en and fr
- Format: txt (more info on the website)
- Periodicals and monographs
Digitzed yearbooks of the Swiss Alpine Club (19th century)
- Paper: http://www.zora.uzh.ch/124786/
- Languages: de and fr
Sydney Morning Herald 1842-1954 (Overproof)
- Paper: http://dl.acm.org/citation.cfm?id=2595200
- Languages: en
byu synthetic ocr data
- Paper: http://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2664&context=facpub
- Not (yet) available

Installation

git clone [email protected]:KBNLresearch/ochre.git
cd ochre
pip install -r requirements.txt
python setup.py develop

Using the CWL workflows requires (the development version of) nlppln and its requirements (see installation guidelines).
To run a CWL workflow type: cwltool|cwl-runner path/to/workflow.cwl <inputs> (if you run the command without inputs, the tool will tell you about what inputs are required and how to specify them). For more information on running CWL workflows, have a look at the nlppln documentation. This is especially relevant for Windows users.
Please note that some of the CWL workflows contain absolute paths, if you want to use them on your own machine, regenerate them using the associated Jupyter Notebooks.

Preprocessing

The software needs the data in the following formats:

ocr: text files containing the ocr-ed text, one file per unit (article, page, book, etc.)
gs: text files containing the gold standard (correct) text, one file per unit (article, page, book, etc.)
aligned: json files containing aligned character sequences:

{
    "ocr": ["E", "x", "a", "m", "p", "", "c"],
    "gs": ["E", "x", "a", "m", "p", "l", "e"]
}

Corresponding files in these directories should have the same name (or at least the same prefix), for example:

├── gs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
├── ocr
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── aligned
    ├── 1.json
    ├── 2.json
    └── 3.json

To create data in these formats, CWL workflows are available. First run a preprocess workflow to create the gs and ocr directories containing the expected files. Next run an align workflow to create the align directory.

VU DNC corpus: vudnc-preprocess-pack.cwl (can be run as stand-alone; associated notebook vudnc-preprocess-workflow.ipynb)
ICDAR 2017 shared task on OCR post correction: icdar2017st-extract-data-all.cwl (cannot be run as stand-alone; regenerate with notebook ICDAR2017_shared_task_workflows.ipynb)

To create the alignments, run one of:

align-dir-pack.cwl to align all files in the gs and ocr directories
align-test-files-pack.cwl to align the test files in a data division

These workflows can be run as stand-alone; associated notebook align-workflow.ipynb.

Training networks for OCR post-correction

First, you need to divide the data into a train, validation and test set:

python -m ochre.create_data_division /path/to/aligned

The result of this command is a json file containing lists of file names, for example:

{
    "train": ["1.json", "2.json", "3.json", "4.json", "5.json", ...],
    "test": ["6.json", ...],
    "val": ["7.json", ...]
}

Script: lstm_synched.py

OCR post-correction

If you trained a model, you can use it to correct OCR text using the lstm_synced_correct_ocr command:

python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

cwltool /path/to/ochre/cwl/lstm_synced_correct_ocr.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --txt /path/to/ocr/text/file

The command creates a text file containing the corrected text.

To generate corrected text for the test files of a dataset, do:

cwltool /path/to/ochre/cwl/post_correct_test_files.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --datadivision /path/to/data/division --in_dir /path/to/directory/with/ocr/text/files

To run it for a directory of text files, use:

cwltool /path/to/ochre/cwl/post_correct_dir.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --in_dir /path/to/directory/with/ocr/text/files

(these CWL workflows can be run as stand-alone; associated notebook post_correction_workflows.ipynb)

Explain merging of predictions

Performance

To calculate performance of the OCR (post-correction), the external tool ocrevalUAtion is used. More information about this tool can be found on the website and wiki.

Two workflows are available for calculating performance. The first calculates performance for all files in a directory. To use it type:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-wf-pack.cwl#main --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

The second calculates performance for all files in the test set:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-test-files-wf-pack.cwl --datadivision /path/to/datadivision.json --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

Both of these workflows are stand-alone (packed). The corresponding Jupyter notebook is ocr-evaluation-workflow.ipynb.

To use the ocrevalUAtion tool in your workflows, you have to add it to the WorkflowGenerator's steps library:

wf.load(step_file='https://raw.githubusercontent.com/nlppln/ocrevaluation-docker/master/ocrevaluation.cwl')

TODO: explain how to calculate performance with ignore case (or use lowercase-directory.cwl)

OCR error analysis

Different types of OCR errors exist, e.g., structural vs. random mistakes. OCR post-correction methods may be suitable for fixing different types of errors. Therefore, it is useful to gain insight into what types of OCR errors occur. We chose to approach this problem on the word level. In order to be able to compare OCR errors on the word level, words in the OCR text and gold standard text need to be mapped. CWL workflows are available to do this. To create word mappings for the test files of a dataset, use:

cwltool  /path/to/ochre/cwl/word-mapping-test-files.cwl --data_div /path/to/datadivision --gs_dir /path/to/directory/containing/the/gold/standard/texts --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

To create word mappings for two directories of files, do:

cwltool  /path/to/ochre/cwl/word-mapping-wf.cwl --gs_dir /path/to/directory/containing/the/gold/standard/texts/ --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

(These workflows can be regenerated using the notebook word-mapping-workflow.ipynb.)

The result is a csv-file containing mapped words. The first column contains a word id, the second column the gold standard text and the third column contains the OCR text of the word:

,gs,ocr
0,Hello,Hcllo
1,World,World
2,!,.

This csv file can be used to analyze the errors. See notebooks/categorize errors based on word mappings.ipynb for an example.

We use heuristics to categorize the following types of errors (ochre/ocrerrors.py):

TODO: add error types

OCR quality measure

Jupyter notebook

better (more balanced) training data is needed.

Generating training data

Scramble gold standard text

Ideas

Visualization of probabilities for each character (do the ocr mistakes have lower probability?) (probability=color)

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

ochre's People

Stargazers

Watchers

Forkers

fendaq fuzzythecat vinodhinir mdmoll noke8868 kamps mikesmitheu chintler pratikanubhav kapitsa2811 vikas-kumar-infrrd cccbleach dannysvof jnelson16 faizkhan49 trendingtechnology vanlar-cyber mbencherif

ochre's Issues

Make separate commands for the diffrent neural network architectures

to replace the lstm_synched.py script.

Update workflows for extracting datasets to accept zipfile as input

dncvu
icdar shared task

print error - ICDAR2017_shared_task_workflows.ipynb

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got:
ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

Error in align_output_to_input

In utils.py there is a try – except that try to align two strings.
In case of an exception the code continues using the argument that was define in the try block.

def align_output_to_input(input_str, output_str, empty_char=u'@'):
    t_output_str = output_str.encode('ASCII', 'replace')
    t_input_str = input_str.encode('ASCII', 'replace')
    try:
        r = edlib.align(t_input_str, t_output_str, task='path')
    except:
        print(input_str)
        print(output_str)
    r1, r2 = align_characters(input_str, output_str, r.get('cigar'),
                              empty_char=empty_char, sanity_check=False)
    while len(r2) < len(input_str):
        r2.append(empty_char)
    return u''.join(r2)

I don’t know if this is acceptable to get an exception there but is so you can’t use r in the next statement. And if not, the try is redundant.

My input and output string (the prints) are:
.rj-f - j r . m, w. 1 -
.rj-f - j r . m, w. 1 –

And the exception is error: 'bytes' object has no attribute 'encode'

Please advice

Thanks!

Unsatisfying results

Hi,
I'm trying to test ochre on the ICDAR 2017 dataset (English only). I'm not using the workflow but operating ochre myself:

I took all the English monographs and periodical and clean it from special chars.
I split the files into 2 folders (ocr and gs) and needed.
I've created alignment files using Hirschberg's algorithm and write JSON files.
I've called: create_data_division.py
I've called: lstm_synced.py (bilstm, lower=false)
then on each file I've called: lstm_synced_correct_ocr.py

And when using ocrevalUAtion I've calculated the CER and WER and got unsatisfying results. in most cases the results as less than the ocr itself.

Am I doing something wrong?
Did you get better results?

All chars assumption

Hi,
The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step.
Is it ok? Or this is something that needs to be addressed?

Thanks!
Omri

Using ochre to evaluate synthetic ocr post processing dataset generation

Hi,
I’m working on a method to synthetically generating an ocr post processing dataset.
I think that ochre could be a great project for benchmark different datasets and evaluate which is better.
The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg):
https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI
And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset):
https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY
(13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL.
I was wondering if you plan to provide more documentation and how to for this project?
And if you can add this scenario to your examples?

Thanks!
Omri

Additional OCR Post correction datasets

Can be added to the list of datasets.

MiBio
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/
- https://github.com/jie-mei/MiBio-OCR-dataset

/usr/bin/python: dateutil 2.5.0 is the minimum required version

When I use the command python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

I receive the message /usr/bin/python: dateutil 2.5.0 is the minimum required version
But already dateutil 2.5.0 is installed

Working without aligned file

Hi
I’m conducting research regarding OCR corpuses, and I would like to use this project for evaluation of how differences on the training corpus effects the quality of the post-processing.
But, I have OCR files and GS files without the aligned JSON file that needed. There is a way to generate it (maybe a smith waterman algorithm?) or work without it?

Thanks
Omri

Is test and training data format different.

Request to provide a sample test data format

Error during preprocessing

Hi,
I am just trying to run this project and encountered several problems in the preprocessing part. I am new to CWL, so my questions may be quite basic, thanks for help in advance:

Since vudnc-preprocess.cwl can be run as stand-alone, how should I run it. Please give me more detailed instructions.
When running the first cell of vudnc-preprocess-workflow.ipynb, it gives the following error:

ValueError Traceback (most recent call last)
in ()
13
14 changes_files, metadata_files = wf.align(file1=ocr, file2=gs, scatter=['file1', 'file2'], scatter_method='dotproduct')
---> 15 metadata = wf.merge_json(in_files=metadata_files, name=align_metadata)
16 changes = wf.merge_json(in_files=changes_files, name=align_changes)
17
(Something omitted here)
ValueError: "merge-json" not found in steps library. Please check your spelling or load additional steps

So, how can I solve this error?

When running the first cell of , I made no modification except setting working_dir='/home/ycsun/ochre/123/' , then it give the following warnings:

WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/sac-preprocess.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/icdar2017st-extract-data-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-test-files-wf.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-test-files-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-test-files.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/kb-tss-preprocess-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:../123/icdar2017st-extract-data_.cwl:21:1: checking field steps
../123/icdar2017st-extract-data_.cwl:28:3: checking object ../123/icdar2017st-extract-data_.cwl#icdar2017st-extract-text-1
../123/icdar2017st-extract-data_.cwl:29:5: Field run contains undefined reference to file:///home/ycsun/icdar2017st-extract-text.cwl
../123/icdar2017st-extract-data_.cwl:22:3: checking object ../123/icdar2017st-extract-data_.cwl#ls-2
../123/icdar2017st-extract-data_.cwl:23:5: Field run contains undefined reference to file:///home/ycsun/ls.cwl
../123/icdar2017st-extract-data_.cwl:39:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir
../123/icdar2017st-extract-data_.cwl:40:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:46:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-5
../123/icdar2017st-extract-data_.cwl:47:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:53:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-9
../123/icdar2017st-extract-data_.cwl:54:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl

In order to process the ICDAR 2017 dataset, what modifications should I make in the corresponding files?

Issues in testing

We are using OCHRE for Indian language data. After successfully installation the OCHRE and making a model, We are unable to do a test. When we run, this command
python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file
then we get the following error. I would also like to share with you that we have installed dateutil 2.5.1. But the same error comes-

/usr/bin/python: dateutil 2.5.0 is the minimum required version
I was wondering if you could please help me to fix the above issue.

In addition, I would like to know that we need to prepare charset manually to test the data or it will make automatically.

About OCR_aligned and Lost or missing text

Hi,
I'm working on the OCR post-correction tasks and Ochre really helps me a lot. But I still have some questions looking forward to your reply.
When using the Ochre for OCR post-correction tasks,we only have the OCR_input . So how can I get OCR_aligned from OCR_input without gs? Otherwise,how to deal with the Lost or missing text without aligned text?
Thanks!

Where to start?

I am an intern in one company and although ş have little experienced about python and machine learning they want me to be a part of ocr post correction project as you did but when i look at your project it looks very confusing for me. Can you provide me a guideline how to merge your codes and run it. Please it is very important for me. Best regards.

Pretrained models

I was not able to find pretrained models/weights files. Are some available?

Permanent failure with VU recepie

Hi,
I'm trying to run the code with VU DNC dataset.
The link you provided didn't work and I downloaded it from here.
Now, when I run the vudnc-preprocess.cwl as follows:

in_dir="/home/dataset/VU/FoLiACMDI"
ocr_dir_name="/home/dataset/VU/Preprocess/ocr"
gs_dir_name="/home/dataset/VU/Preprocess/gs"
aligned_dir_name="/home/dataset/VU/Preprocess/aligned"
tmp_dir="/home/ochre/vu-tmp/"
tmp_dir_out="/home/ochre/vu-tmp-out/"
cachedir="/home/ochre/cachedir/"
align_m="align_m.csv"
align_c="align_c.csv"
ocr_n="ocr_n.csv"
gs_n="gs_n.csv"

cwltool |cwl-runner ochre/cwl/vudnc-preprocess.cwl --in_dir $in_dir --ocr_dir_name $ocr_dir_name --gs_dir_name $gs_dir_name --aligned_dir_name $aligned_dir_name --ocr_n $ocr_n --gs_n $gs_n --align_m $align_m --align_c $align_c

Howerver, it is permanently failed with the following message:

[step merge-json] Cannot make job: Value for file:///home/ochre/ochre/cwl/align-texts-wf.cwl#merge-json/in_files not specified

[workflow align-texts-wf] completed permanentFail

I'd be grateful if you could help to figure out the problem.
Thanks
H