Coder Social home page Coder Social logo

methal-project / fete Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 30.88 MB

Fast Encoding of Theater in TEI: Automatic TEI generation based on OCR output

License: GNU General Public License v3.0

CSS 0.04% HTML 99.78% Shell 0.01% Python 0.17%
alsatian-theater digital-humanities humanites-numeriques ocr sequence-labeling sequence-labelling tei tei-xml theatre-alsacien conditional-random-fields

fete's Introduction

FETE: Fast Encoding of Theater in TEI

DOI

Introduction

FETE is an application to generate TEI for the body of theater plays in Alsatian, based on OCR output. It takes HOCR or ALTO formats as input. It outputs TEI for the play's body: <div> elements for acts and scenes (with the relevant @type attribute), as well as stage directions (<stage>) and character speech turns (<sp> elements and their children, also identifiying the <speaker> element). When speech is in verse, <l> elements are encoded.

The <teiHeader> element for the plays needs to be encoded separately, as does the play's <castList> element and other frontmatter preceding the play's first scene, besides backmatter after the last scene, if any exists.

Inspired by earlier literature (e.g. Grobid among others), the tool uses Conditional Random Fields (CRF) as implemented in sklearn-crfsuite. Lexical and typographical cues present in OCR output, besides token coordinates on the page, are exploited to generate TEI elements.

The tool was developed by Andrew Briand (University of Washington), in the context of work supervised by Pablo Ruiz within the Methal project (University of Strasbourg); the project is creating a large TEI-encoded corpus of theater in Alsatian varieties.

Application structure

  • example: example input, XML output obtained with it and CRF model used to predict the output.
  • hocr2alto: Scripts to convert between HOCR and ALTO formats.
    • Usage is documented in the script
    • Requires the ocr-fileformat package
  • sklearn_crfsuite: The main program is in this directory, see Generating TEI and Training a model below for its usage.
  • utils: Some scripts for common manipulations to HOCR and TEI documents. Usage described in the scripts.

Requirements

The tool requires the packages listed in requirements.txt. To install them, you can run pip install -r requirements.txt from the directory where requirements.txt resides.

It is not required, but a good practice, to create a virtual environment for projects using the tool and install the requirements there. To create an environment, you can use venv, or if you have Anaconda, you can create it with conda create --name fete python=3.12, then activate the environment (conda activate fete) and run pip install requirements.txt once the environment is active.

Generating TEI

To generate TEI based on a directory of HOCR files, use the following command from within the sklearn_crfsuite directory:

python main.py MODEL_PATH HOCR_DIRECTORY OUTPUT_TEI_PATH

For instance:

python main.py ../example/models/model-exp3-20221226.crf ../example/inputs/hocr-verbotte-fahne ../example/outputs/verbotte-fahne-exp3.xml

This will predict the ../example/outputs/verbotte-fahne-exp3.xml TEI file based on HOCR at ../example/inputs/hocr-verbotte-fahne

Training a model

Training data consist on HOCR files and manually corrected TEI for them.

At the moment the training corpus is located at pre-defined directories inside sklearn_crfsuite:

  • html: The HOCR (from which the features are computed)
  • tei: The TEI (the labels to predict)

To train a model, use the following command from within sklearn_crfsuite:

python train.py html tei MODEL_OUTPUT_PATH
python train.py html tei ../example/models/model-exp3-new.crf

The exp3 infix in the model filename was used for the follwoing reason: Several feature combinations were implemented in the tool. The best one was called exp3 and this model was trained with it, so we chose to include exp3 in the filename (output-file naming is manual)

Postprocessing the output XML

Let's show this with an example. If you trained a model using the example command above and use it to predict TEI for ../example/input/hocr-verbotte-fahne, your results should reproduce ../example/outputs/verbotte-fahne-exp3.xml.

The prediction doesn't look bad, but you'll see it is not valid XML. This is because the model is designed to handle the plays' body, from the start of the first act to the final curtain, but not the front matter and back matter that may precede and follow those. Since we did not remove HOCR files for the front matter and back matter, the model tried to generate TEI from them, but this was expected to give errors. Once the portions generated based on the front matter and back matter are removed, the file will be valid XML. You can compare the file before and after by comparing ../example/outputs/verbotte-fahne-exp3.xml with ../example/outputs/verbotte-fahne-exp3-postpro.xml.

Instead of postprocessing the output XML by removing the front- and backmatter content, we could also remove the input HOCR files (or paragraphs if the body does not start and end on its own page) for such content before generating the XML output.

Adapting to other languages

The lexical cues used by the tool are currently suitable for Alsatian theater. Paratext in Alsatian theater is often in German and sometimes in French. Accordingly, lexical cues are now provided in Alsatian varieties, besides German and French.

The tool's lexical features (see sklearn_crfsuite/features.py) could be adapted to further languages. For training, a corpus of HOCR (or ALTO) plays and their corresponding TEI-encoding versions is needed (see Training a model above).

How to cite

The software may be cited as:

  • Briand, Andrew & Ruiz Fabo, Pablo (2023). FETE: Fast Encoding of Theater in TEI.

You can also cite a related publication:

  • Ruiz Fabo, Pablo, Bernhard, Delphine, Briand, Andrew & Werner, Carole. (2024). Computational drama analysis from almost zero electronic text: The case of Alsatian theater. To appear in Andresen, Melanie & Reiter, Nils (eds.). Computational Drama Analysis: Reflecting Methods and Interpretations. Preprint at https://univoak.eu/islandora/object/islandora:157880

fete's People

Contributors

andrewbriand avatar pruizf avatar

Watchers

 avatar

fete's Issues

python modules missing?

I am not an expert user of python, so forgive me if there is an obvious explanation for this. But when I clone everything here, I expected to be able to retrain your mode like this:

(base) lou@foxglove:~/Public/FETE/sklearn_crfsuite$ python train.py html tei wibble

But this does not work. :-(

Traceback (most recent call last):
File "train.py", line 1, in
from acts import *
File "/mnt/86fb67d9-80ad-4e7d-b46e-5a8987b36728/Public/FETE/sklearn_crfsuite/acts.py", line 8, in
import sklearn_crfsuite
ModuleNotFoundError: No module named 'sklearn_crfsuite'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.