Coder Social home page Coder Social logo

Comments (4)

r12a avatar r12a commented on June 14, 2024

In case it helps, you can see the examples at the following URLs, and by clicking on "Show codepoints" (just above the large text box) you can see the underlying sequence of characters.

Tamil: http://r12a.github.io/pickers/taml/?text=%E0%AE%95%E0%AF%81

Balinese: http://r12a.github.io/pickers/bali/?text=%E1%AC%93%E1%AD%84%E1%AC%B1%E1%AD%80

from handwriting-recognition.

wacky6 avatar wacky6 commented on June 14, 2024

Updated "grapheme cluster" to "grapheme".

As for if ᬓ᭄ᬱᭀ should be a single unit, I think if it can't be broken down then it probably should be a single unit.

This aside, I am not sure if recognizer models (for complex scripts) can handle these subtle differences. In this case, the graphemeSet hint is essentially ignored.

from handwriting-recognition.

r12a avatar r12a commented on June 14, 2024

I should have explaned the point about the Balinese in a little more detail. (I'm hoping to create some permanent resources that describe these kinds of issue, but in the meantime i'll write something here.) The tool i pointed to to view the Balinese can help you understand this by analysing the text, but for clarity let me point out some of the issues here (and this is by no means a complicated scenario as complex scripts go).

The sequence of characters in memory is:

Screenshot 2020-11-27 at 10 23 57

When displayed, this results in the following. The black text indicates the first grapheme cluster (2 code points, including one that becomes invisible in this situation, though not in others). The third glyph from the left (SA) is shown as a special conjoined form (which indicates that there is no vowel sound between this and the previous letter). The brown text (all of it) indicates the glyphs associated with the 2nd grapheme cluster (2 code points, one of which - the vowel o~ɔ - is split around the whole consonant cluster).

Screenshot 2020-11-27 at 10 21 42

Note, btw, that the 1st glyph on the left could also represent a different vowel (e~ɛ) were it not (eventually) followed by the final glyph on the right.

Even though this is 2 grapheme clusters, the sequence cannot be broken in the middle at a line end, although other text operations, such as backspacing, do affect only part of the sequence.

All this to illustrate the kind of things that crop up when trying to figure out what is written by looking at the visual text of languages written in complex scripts. Of course, it's nowhere near as simple as for Latin. A good deal of contextual analysis is needed, multiple visual sequences need to be mapped to the same code points, the number of permutations of glyph combinations can be quite large, and the minimal units used for comparison may need to be equivalent to less or more than one grapheme cluster.

from handwriting-recognition.

wacky6 avatar wacky6 commented on June 14, 2024

Closing. Terminologies have been updated to "grapheme" / "user-perceived character".

from handwriting-recognition.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.