<a href="https://github.com/WICG/handwriting-recognition/blob/main/explainer.md#recogn

Definition of grapheme cluster about handwriting-recognition HOT 4 CLOSED

wicg commented on June 14, 2024

Definition of grapheme cluster

from handwriting-recognition.

Comments (4)

r12a commented on June 14, 2024

In case it helps, you can see the examples at the following URLs, and by clicking on "Show codepoints" (just above the large text box) you can see the underlying sequence of characters.

Tamil: http://r12a.github.io/pickers/taml/?text=%E0%AE%95%E0%AF%81

Balinese: http://r12a.github.io/pickers/bali/?text=%E1%AC%93%E1%AD%84%E1%AC%B1%E1%AD%80

from handwriting-recognition.

wacky6 commented on June 14, 2024

Updated "grapheme cluster" to "grapheme".

As for if ᬓ᭄ᬱᭀ should be a single unit, I think if it can't be broken down then it probably should be a single unit.

This aside, I am not sure if recognizer models (for complex scripts) can handle these subtle differences. In this case, the graphemeSet hint is essentially ignored.

from handwriting-recognition.

r12a commented on June 14, 2024

I should have explaned the point about the Balinese in a little more detail. (I'm hoping to create some permanent resources that describe these kinds of issue, but in the meantime i'll write something here.) The tool i pointed to to view the Balinese can help you understand this by analysing the text, but for clarity let me point out some of the issues here (and this is by no means a complicated scenario as complex scripts go).

The sequence of characters in memory is:

When displayed, this results in the following. The black text indicates the first grapheme cluster (2 code points, including one that becomes invisible in this situation, though not in others). The third glyph from the left (SA) is shown as a special conjoined form (which indicates that there is no vowel sound between this and the previous letter). The brown text (all of it) indicates the glyphs associated with the 2nd grapheme cluster (2 code points, one of which - the vowel o~ɔ - is split around the whole consonant cluster).

Note, btw, that the 1st glyph on the left could also represent a different vowel (e~ɛ) were it not (eventually) followed by the final glyph on the right.

Even though this is 2 grapheme clusters, the sequence cannot be broken in the middle at a line end, although other text operations, such as backspacing, do affect only part of the sequence.

All this to illustrate the kind of things that crop up when trying to figure out what is written by looking at the visual text of languages written in complex scripts. Of course, it's nowhere near as simple as for Latin. A good deal of contextual analysis is needed, multiple visual sequences need to be mapped to the same code points, the number of permutations of glyph combinations can be quite large, and the minimal units used for comparison may need to be equivalent to less or more than one grapheme cluster.

from handwriting-recognition.

wacky6 commented on June 14, 2024

Closing. Terminologies have been updated to "grapheme" / "user-perceived character".

from handwriting-recognition.

Definition of grapheme cluster about handwriting-recognition HOT 4 CLOSED

Comments (4)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent