Coder Social home page Coder Social logo

ochre's Issues

Using ochre to evaluate synthetic ocr post processing dataset generation

Hi,
I’m working on a method to synthetically generating an ocr post processing dataset.
I think that ochre could be a great project for benchmark different datasets and evaluate which is better.
The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg):
https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI
And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset):
https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY
(13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL.
I was wondering if you plan to provide more documentation and how to for this project?
And if you can add this scenario to your examples?

Thanks!
Omri

Working without aligned file

Hi
I’m conducting research regarding OCR corpuses, and I would like to use this project for evaluation of how differences on the training corpus effects the quality of the post-processing.
But, I have OCR files and GS files without the aligned JSON file that needed. There is a way to generate it (maybe a smith waterman algorithm?) or work without it?

Thanks
Omri

All chars assumption

Hi,
The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step.
Is it ok? Or this is something that needs to be addressed?

Thanks!
Omri

Error in align_output_to_input

In utils.py there is a try – except that try to align two strings.
In case of an exception the code continues using the argument that was define in the try block.

def align_output_to_input(input_str, output_str, empty_char=u'@'):
    t_output_str = output_str.encode('ASCII', 'replace')
    t_input_str = input_str.encode('ASCII', 'replace')
    try:
        r = edlib.align(t_input_str, t_output_str, task='path')
    except:
        print(input_str)
        print(output_str)
    r1, r2 = align_characters(input_str, output_str, r.get('cigar'),
                              empty_char=empty_char, sanity_check=False)
    while len(r2) < len(input_str):
        r2.append(empty_char)
    return u''.join(r2)

I don’t know if this is acceptable to get an exception there but is so you can’t use r in the next statement. And if not, the try is redundant.

My input and output string (the prints) are:
.rj-f - j r . m, w. 1 -
.rj-f - j r . m, w. 1 –

And the exception is error: 'bytes' object has no attribute 'encode'

Please advice

Thanks!

Issues in testing

We are using OCHRE for Indian language data. After successfully installation the OCHRE and making a model, We are unable to do a test. When we run, this command
python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file
then we get the following error. I would also like to share with you that we have installed dateutil 2.5.1. But the same error comes-

/usr/bin/python: dateutil 2.5.0 is the minimum required version
I was wondering if you could please help me to fix the above issue.

In addition, I would like to know that we need to prepare charset manually to test the data or it will make automatically.

About OCR_aligned and Lost or missing text

Hi,
I'm working on the OCR post-correction tasks and Ochre really helps me a lot. But I still have some questions looking forward to your reply.
When using the Ochre for OCR post-correction tasks,we only have the OCR_input . So how can I get OCR_aligned from OCR_input without gs? Otherwise,how to deal with the Lost or missing text without aligned text?
Thanks!

print error - ICDAR2017_shared_task_workflows.ipynb

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got:
ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

Error during preprocessing

Hi,
I am just trying to run this project and encountered several problems in the preprocessing part. I am new to CWL, so my questions may be quite basic, thanks for help in advance:

  1. Since vudnc-preprocess.cwl can be run as stand-alone, how should I run it. Please give me more detailed instructions.

  2. When running the first cell of vudnc-preprocess-workflow.ipynb, it gives the following error:

ValueError Traceback (most recent call last)
in ()
13
14 changes_files, metadata_files = wf.align(file1=ocr, file2=gs, scatter=['file1', 'file2'], scatter_method='dotproduct')
---> 15 metadata = wf.merge_json(in_files=metadata_files, name=align_metadata)
16 changes = wf.merge_json(in_files=changes_files, name=align_changes)
17
(Something omitted here)
ValueError: "merge-json" not found in steps library. Please check your spelling or load additional steps

So, how can I solve this error?

  1. When running the first cell of , I made no modification except setting working_dir='/home/ycsun/ochre/123/' , then it give the following warnings:

WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/sac-preprocess.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/icdar2017st-extract-data-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-dir.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/word-mapping-test-files-wf.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/ocrevaluation-performance-test-files-wf-pack.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/align-test-files.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:Not loading "/home/ycsun/ochre/123/kb-tss-preprocess-all.cwl", because it is a packed workflow.
WARNING:scriptcwl.library:../123/icdar2017st-extract-data_.cwl:21:1: checking field steps
../123/icdar2017st-extract-data_.cwl:28:3: checking object ../123/icdar2017st-extract-data_.cwl#icdar2017st-extract-text-1
../123/icdar2017st-extract-data_.cwl:29:5: Field run contains undefined reference to file:///home/ycsun/icdar2017st-extract-text.cwl
../123/icdar2017st-extract-data_.cwl:22:3: checking object ../123/icdar2017st-extract-data_.cwl#ls-2
../123/icdar2017st-extract-data_.cwl:23:5: Field run contains undefined reference to file:///home/ycsun/ls.cwl
../123/icdar2017st-extract-data_.cwl:39:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir
../123/icdar2017st-extract-data_.cwl:40:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:46:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-5
../123/icdar2017st-extract-data_.cwl:47:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl
../123/icdar2017st-extract-data_.cwl:53:3: checking object ../123/icdar2017st-extract-data_.cwl#save-files-to-dir-9
../123/icdar2017st-extract-data_.cwl:54:5: Field run contains undefined reference to file:///home/ycsun/save-files-to-dir.cwl

In order to process the ICDAR 2017 dataset, what modifications should I make in the corresponding files?

Permanent failure with VU recepie

Hi,
I'm trying to run the code with VU DNC dataset.
The link you provided didn't work and I downloaded it from here.
Now, when I run the vudnc-preprocess.cwl as follows:

in_dir="/home/dataset/VU/FoLiACMDI"
ocr_dir_name="/home/dataset/VU/Preprocess/ocr"
gs_dir_name="/home/dataset/VU/Preprocess/gs"
aligned_dir_name="/home/dataset/VU/Preprocess/aligned"
tmp_dir="/home/ochre/vu-tmp/"
tmp_dir_out="/home/ochre/vu-tmp-out/"
cachedir="/home/ochre/cachedir/"
align_m="align_m.csv"
align_c="align_c.csv"
ocr_n="ocr_n.csv"
gs_n="gs_n.csv"

cwltool |cwl-runner ochre/cwl/vudnc-preprocess.cwl --in_dir $in_dir --ocr_dir_name $ocr_dir_name --gs_dir_name $gs_dir_name --aligned_dir_name $aligned_dir_name --ocr_n $ocr_n --gs_n $gs_n --align_m $align_m --align_c $align_c

Howerver, it is permanently failed with the following message:

[step merge-json] Cannot make job: Value for file:///home/ochre/ochre/cwl/align-texts-wf.cwl#merge-json/in_files not specified

[workflow align-texts-wf] completed permanentFail

I'd be grateful if you could help to figure out the problem.
Thanks
H

Unsatisfying results

Hi,
I'm trying to test ochre on the ICDAR 2017 dataset (English only). I'm not using the workflow but operating ochre myself:

  1. I took all the English monographs and periodical and clean it from special chars.
  2. I split the files into 2 folders (ocr and gs) and needed.
  3. I've created alignment files using Hirschberg's algorithm and write JSON files.
  4. I've called: create_data_division.py
  5. I've called: lstm_synced.py (bilstm, lower=false)
  6. then on each file I've called: lstm_synced_correct_ocr.py

And when using ocrevalUAtion I've calculated the CER and WER and got unsatisfying results. in most cases the results as less than the ocr itself.

Am I doing something wrong?
Did you get better results?

/usr/bin/python: dateutil 2.5.0 is the minimum required version

When I use the command python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

I receive the message /usr/bin/python: dateutil 2.5.0 is the minimum required version
But already dateutil 2.5.0 is installed

Pretrained models

I was not able to find pretrained models/weights files. Are some available?

Where to start?

I am an intern in one company and although ş have little experienced about python and machine learning they want me to be a part of ocr post correction project as you did but when i look at your project it looks very confusing for me. Can you provide me a guideline how to merge your codes and run it. Please it is very important for me. Best regards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.