Coder Social home page Coder Social logo

Comments (3)

jvdzwaan avatar jvdzwaan commented on July 24, 2024

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

from ochre.

thiagopx avatar thiagopx commented on July 24, 2024

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

You are correct. I meant that I was not able to run vudnc-preprocess-pack.cwl.

For good results in english, do you recommend using the english monograph partition of ICDAR? I trained with both monograph and the periodical partitions in separated but the validation accuracy and loss were not good (and also the tests I made).

I would like to help with some additional documentation to improve reproducibility, but I need a roadmap of how to get significant results (mainly for english documents).

from ochre.

jvdzwaan avatar jvdzwaan commented on July 24, 2024

Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises there!

Generally speaking, the OCR post-correction datasets are small. That's why I'm making a list of them, so they can be used for generalization. I don't think that training on the English monograph data will give you a model that will work on other data, because OCR errors tend to depend on time period, font, the ocr software that was used, etc.

from ochre.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.