Coder Social home page Coder Social logo

Thai about scriptshifter HOT 4 OPEN

lcnetdev avatar lcnetdev commented on July 24, 2024
Thai

from scriptshifter.

Comments (4)

scossu avatar scossu commented on July 24, 2024

Plangsarn for Thai:
The link is http://164.115.23.167/plangsarn/ . It is developed jointly by Thammasat University and National Electronics and Computer Technology Center of Thailand. It is free to use for anyone with access to Internet and follow ALA-LC romanization table. The accuracy of converter is 90% OK.

This is just informational. We may or may not want to integrate it. First we may want to know if the same is achievable with a table and hooks. According to https://www.loc.gov/catdir/cpso/romanization/thai.pdf the process is not trivial.

from scriptshifter.

scossu avatar scossu commented on July 24, 2024

Fixed with #79.

from scriptshifter.

scossu avatar scossu commented on July 24, 2024

Unfortunately, Aksharamukha does not provide ALA-LC compatibility. there are two options:

  1. Adapt Plangsarn (if we have access to the source code to reverse-engineer)
  2. Start from scratch under catalogers' supervision.

from scriptshifter.

scossu avatar scossu commented on July 24, 2024

From 2024-06-20 meeting with LC catalogers and further discussion, we agreed on the following:

  1. Script-to-Roman transliteration is not deterministic because ALA-LC romanization adds spaces between words that are not present in the source script.
  2. Roman-to-script transliteration is not deterministic because romanized Thai assimilates multiple forms of the same letter into one, and it is impossible to revert back to the original form from the ALA-LC romanization.

Currently, the most accurate tool available (for S2R ONLY) is Plangsarn. The integration capabilities of this tool are still TBD.

LC Thai catalogers' workflow needs the R2S function because Romanized text is inserted first in the cataloging process.

Given the above conditions, the ideal (not from a time efficiency perspective, but from an accuracy one) would be to propose some significant changes to the ALA-LC tables, so that at least R2S transliteration becomes lossless, by mapping exact variants of each Thai character to a Roman character with a diacritic added. (This may be a long-running task but the most beneficial for the community.)

Similarly, removing artificial word splitting from S2R transliteration can be proposed. It is unknown at the moment how the community will react to this proposal. @RandyBarry will lead this initiative.

For S2R spacing issues, I'm testing the integration of ML-based part-of-speech analysis tools such as https://huggingface.co/KoichiYasuoka/roberta-base-thai-spm-upos that have yielded good results (but not 100% accurate, especially on words that can be ambiguously compounded) . As an interim solution to a deterministic one based on an updated ALA-LC table, it seems viable.

from scriptshifter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.