Coder Social home page Coder Social logo

harem_preprocessing's Introduction

HAREM Datasets Preprocessing

The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT> tags and | characters indicating multiple solutions). This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.

The script xml_to_json.py converts the XML file to JSON format and selects a single solution for all <ALT> tags and vague entities:

  1. For each Entity with multiple classes, it selects the first valid class.
  2. For each <ALT> tag, it selects the solution with the highest number of entities.

The script is tested for the following XML files:

Total and Selective scenarios

Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:

  1. PESSOA (Person)
  2. ORGANIZACAO (Organization)
  3. LOCAL (Location)
  4. TEMPO (Date)
  5. VALOR (Value)
  6. ABSTRACCAO (Abstraction)
  7. ACONTECIMENTO (Event)
  8. COISA (Thing)
  9. OBRA (Title)
  10. OUTRO (Other)

The Selective scenario considers only the first 5 classes of the list above.

The script is compatible to both scenarios and selects the entities respecting the chosen scenario.

Usage

The scripts are tested with Python 3.6.

Install the requirements:

$ pip install -r requirements.txt

Run the script:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]

The converted file will be saved with the same name and suffix -{scenario}.json

Tests

To run the tests, first install the test requirements and run the tests:

$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py

harem_preprocessing's People

Contributors

fabiocapsouza avatar

Stargazers

Pedro Gazzola avatar Rúben Almeida avatar Ruan Chaves avatar Gennaro S. Rodrigues avatar Bruno Henrique avatar Gustavo Dutra Martins avatar Elisa Terumi Rubel Schneider avatar Can Udomcharoenchaikit avatar

Watchers

James Cloos avatar  avatar

harem_preprocessing's Issues

Documentation Update on the Second HAREM support addition

As highlighted in the PR #6, the addition of the Second HAREM conversion feature requires documentation updates, regarding the following:

  1. Clarification on the format differences in the Second HAREM dataset and output file compared to the first and mini versions.
  2. Update the Usage section of the documentation with the new dataset conversion usage.
  3. Update the Usage section of the documentation with the output files options.

Add the Second Harem processing

As described in the project documentation, the script was tested in the First and mini Harem datasets.

The Second Harem dataset has changed the XML pattern, adding the <p> tag for separating sentences in the dataset. Thus the current scrip doesn't processes this version of the dataset.

The script should be modified so it would also work on the second Harem XML format, outputing a json file for each document in the dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.