Coder Social home page Coder Social logo

htrec's Introduction

HTREC 2022: "One Rule Based System to rule them all" - Improving the HTR output of Greek papyri and Byzantine manuscripts in simple way

Ranking

1st place synthetic and original data on leaderboard.

1st place synthetic data.

4th place original data.

The codes are created by Team error_404_name_not_found, @konstantina_liagkou and @manos_papadatos.

The best single model we have obtained during the competition was an Rule Based model with scores on real and synthetic:

Data / Metrics CERR WERR
real 0.439 1.822
synthetic 0.096 1.292
real & synthetic 0.278 1.575

Introduction

The digitization of ancient texts is essential for analyzing ancient corpora and preserving cultural heritage. However, the transcription of ancient handwritten text using optical character recognition (OCR) methods remains challenging. Handwritten text recognition (HTR) concerns the conversion of scanned images of handwritten text into machine-encoded text. In contrast with OCR where the text to be transcribed is printed, HTR is more challenging and can lead to transcribed text that includes many more errors or even to no transcription at all when training data on the specific script (e.g., medieval) are not available.

Existing work on HTR combine OCR models and Natural language processing (NLP) methods from fields such as grammatical error correction (GEC), which can assist with the task of post-correcting transcription errors. The post-correction task has been reported as expensive, time-consuming, and challenging for the human expert, especially for OCRed text of historical newspapers, where the error rate is as low as 10%. The HTREC focus of this challenge will be on the post-correction of HTR transcription errors, attempting to build on recent NLP advances such as the successful applications of Transformers and transfer learning. The ground truth of the evaluation set will be used to score participating systems in terms of character error rate (CER).


Code

First and foremost, we did exploratory data analysis in greek text, eda.ipynb.

The challenge shared some baseline models that we brought it all together, baselines.ipynb.

Our initial approach was to apply advanced machine learning techniques. The first model that we tried was based on a char-to-char model lstm_seq2seq.ipynb. Then we used bert-to-bert model, which fine-tune either Ancient Greek BERT or Greek BERT bert2bert.ipynb. However, the best scores were retrieved from rule based models, best_RuleBased.ipynb. The folder, called results, includes the inferences of all the models. The error_analysis.ipynb compares all the results.

In the folder data, there are the dataset from the Challenge (train.csv,test.csv) that include data from both original and synthetic (original_test.csv and synthetic_test.csv)

If you find our work useful to your research, please cite this work as:

@inproceedings{liagkou-papadatos-2022-htrec,
    title = "HTREC 2022: "One Rule Based System to rule them all" - Improving the HTR output of Greek papyri and Byzantine manuscripts in simple way",
    author = "Liagkou, Konstantina  and Papadatos, Emmanouil ",
    month = November,
    year = "2022",
    address = "Venice, Italy"
}

htrec's People

Contributors

connalia avatar manospad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.