kbnlresearch / ochre Goto Github PK

Toolbox for OCR post-correction

License: Apache License 2.0

Shell 0.72% Python 16.26% Jupyter Notebook 39.29% Common Workflow Language 42.84% HTML 0.90%

ochre's Issues

About OCR_aligned and Lost or missing text

Hi,
I'm working on the OCR post-correction tasks and Ochre really helps me a lot. But I still have some questions looking forward to your reply.
When using the Ochre for OCR post-correction tasks,we only have the OCR_input . So how can I get OCR_aligned from OCR_input without gs? Otherwise,how to deal with the Lost or missing text without aligned text?
Thanks!

Permanent failure with VU recepie

Hi,
I'm trying to run the code with VU DNC dataset.
The link you provided didn't work and I downloaded it from here.
Now, when I run the vudnc-preprocess.cwl as follows:

in_dir="/home/dataset/VU/FoLiACMDI"
ocr_dir_name="/home/dataset/VU/Preprocess/ocr"
gs_dir_name="/home/dataset/VU/Preprocess/gs"
aligned_dir_name="/home/dataset/VU/Preprocess/aligned"
tmp_dir="/home/ochre/vu-tmp/"
tmp_dir_out="/home/ochre/vu-tmp-out/"
cachedir="/home/ochre/cachedir/"
align_m="align_m.csv"
align_c="align_c.csv"
ocr_n="ocr_n.csv"
gs_n="gs_n.csv"

cwltool |cwl-runner ochre/cwl/vudnc-preprocess.cwl --in_dir $in_dir --ocr_dir_name $ocr_dir_name --gs_dir_name $gs_dir_name --aligned_dir_name $aligned_dir_name --ocr_n $ocr_n --gs_n $gs_n --align_m $align_m --align_c $align_c

Howerver, it is permanently failed with the following message:

[step merge-json] Cannot make job: Value for file:///home/ochre/ochre/cwl/align-texts-wf.cwl#merge-json/in_files not specified

[workflow align-texts-wf] completed permanentFail

I'd be grateful if you could help to figure out the problem.
Thanks
H

print error - ICDAR2017_shared_task_workflows.ipynb

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got:
ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

Unsatisfying results

Hi,
I'm trying to test ochre on the ICDAR 2017 dataset (English only). I'm not using the workflow but operating ochre myself:

I took all the English monographs and periodical and clean it from special chars.
I split the files into 2 folders (ocr and gs) and needed.
I've created alignment files using Hirschberg's algorithm and write JSON files.
I've called: create_data_division.py
I've called: lstm_synced.py (bilstm, lower=false)
then on each file I've called: lstm_synced_correct_ocr.py

And when using ocrevalUAtion I've calculated the CER and WER and got unsatisfying results. in most cases the results as less than the ocr itself.

Am I doing something wrong?
Did you get better results?

Is test and training data format different.

Request to provide a sample test data format

Error during preprocessing

Hi,
I am just trying to run this project and encountered several problems in the preprocessing part. I am new to CWL, so my questions may be quite basic, thanks for help in advance:

Since vudnc-preprocess.cwl can be run as stand-alone, how should I run it. Please give me more detailed instructions.
When running the first cell of vudnc-preprocess-workflow.ipynb, it gives the following error:

ValueError Traceback (most recent call last)
in ()
13
14 changes_files, metadata_files = wf.align(file1=ocr, file2=gs, scatter=['file1', 'file2'], scatter_method='dotproduct')
---> 15 metadata = wf.merge_json(in_files=metadata_files, name=align_metadata)
16 changes = wf.merge_json(in_files=changes_files, name=align_changes)
17
(Something omitted here)
ValueError: "merge-json" not found in steps library. Please check your spelling or load additional steps

So, how can I solve this error?

When running the first cell of , I made no modification except setting working_dir='/home/ycsun/ochre/123/' , then it give the following warnings:

WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/sac-preprocess.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/icdar2017st-extract-data-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-test-files-wf.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-test-files-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-test-files.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/kb-tss-preprocess-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:../123/icdar2017st-extract-data_.cwl:21:1: checking field steps
../123/icdar2017st-extract-data_.cwl:28:3: checking object ../123/icdar2017st-extract-data_.cwl#icdar2017st-extract-text-1
../123/icdar2017st-extract-data_.cwl:29:5: Field run contains undefined reference to file:///home/ycsun/icdar2017st-extract-text.cwl
../123/icdar2017st-extract-data_.cwl:22:3: checking object ../123/icdar2017st-extract-data_.cwl#ls-2
../123/icdar2017st-extract-data_.cwl:23:5: Field run contains undefined reference to file:///home/ycsun/ls.cwl
../123/icdar2017st-extract-data_.cwl:39:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir
../123/icdar2017st-extract-data_.cwl:40:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:46:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-5
../123/icdar2017st-extract-data_.cwl:47:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:53:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-9
../123/icdar2017st-extract-data_.cwl:54:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl

In order to process the ICDAR 2017 dataset, what modifications should I make in the corresponding files?

Where to start?

I am an intern in one company and although ş have little experienced about python and machine learning they want me to be a part of ocr post correction project as you did but when i look at your project it looks very confusing for me. Can you provide me a guideline how to merge your codes and run it. Please it is very important for me. Best regards.

All chars assumption

Hi,
The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step.
Is it ok? Or this is something that needs to be addressed?

Thanks!
Omri

Issues in testing

We are using OCHRE for Indian language data. After successfully installation the OCHRE and making a model, We are unable to do a test. When we run, this command
python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file
then we get the following error. I would also like to share with you that we have installed dateutil 2.5.1. But the same error comes-

/usr/bin/python: dateutil 2.5.0 is the minimum required version
I was wondering if you could please help me to fix the above issue.

In addition, I would like to know that we need to prepare charset manually to test the data or it will make automatically.

Using ochre to evaluate synthetic ocr post processing dataset generation

Hi,
I’m working on a method to synthetically generating an ocr post processing dataset.
I think that ochre could be a great project for benchmark different datasets and evaluate which is better.
The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg):
https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI
And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset):
https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY
(13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL.
I was wondering if you plan to provide more documentation and how to for this project?
And if you can add this scenario to your examples?

Thanks!
Omri

Make separate commands for the diffrent neural network architectures

to replace the lstm_synched.py script.

Additional OCR Post correction datasets

Can be added to the list of datasets.

MiBio
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/
- https://github.com/jie-mei/MiBio-OCR-dataset

/usr/bin/python: dateutil 2.5.0 is the minimum required version

When I use the command python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

I receive the message /usr/bin/python: dateutil 2.5.0 is the minimum required version
But already dateutil 2.5.0 is installed

Pretrained models

I was not able to find pretrained models/weights files. Are some available?

Working without aligned file

Hi
I’m conducting research regarding OCR corpuses, and I would like to use this project for evaluation of how differences on the training corpus effects the quality of the post-processing.
But, I have OCR files and GS files without the aligned JSON file that needed. There is a way to generate it (maybe a smith waterman algorithm?) or work without it?

Thanks
Omri

Error in align_output_to_input

In utils.py there is a try – except that try to align two strings.
In case of an exception the code continues using the argument that was define in the try block.

def align_output_to_input(input_str, output_str, empty_char=u'@'):
    t_output_str = output_str.encode('ASCII', 'replace')
    t_input_str = input_str.encode('ASCII', 'replace')
    try:
        r = edlib.align(t_input_str, t_output_str, task='path')
    except:
        print(input_str)
        print(output_str)
    r1, r2 = align_characters(input_str, output_str, r.get('cigar'),
                              empty_char=empty_char, sanity_check=False)
    while len(r2) < len(input_str):
        r2.append(empty_char)
    return u''.join(r2)

I don’t know if this is acceptable to get an exception there but is so you can’t use r in the next statement. And if not, the try is redundant.

My input and output string (the prints) are:
.rj-f - j r . m, w. 1 -
.rj-f - j r . m, w. 1 –

And the exception is error: 'bytes' object has no attribute 'encode'

Please advice

Thanks!

Update workflows for extracting datasets to accept zipfile as input

dncvu
icdar shared task

kbnlresearch / ochre Goto Github PK

ochre's Issues

Recommend Projects

Recommend Topics

Recommend Org