Comments (4)
In case it helps, you can see the examples at the following URLs, and by clicking on "Show codepoints" (just above the large text box) you can see the underlying sequence of characters.
Tamil: http://r12a.github.io/pickers/taml/?text=%E0%AE%95%E0%AF%81
Balinese: http://r12a.github.io/pickers/bali/?text=%E1%AC%93%E1%AD%84%E1%AC%B1%E1%AD%80
from handwriting-recognition.
Updated "grapheme cluster" to "grapheme".
As for if ᬓ᭄ᬱᭀ should be a single unit, I think if it can't be broken down then it probably should be a single unit.
This aside, I am not sure if recognizer models (for complex scripts) can handle these subtle differences. In this case, the graphemeSet hint is essentially ignored.
from handwriting-recognition.
I should have explaned the point about the Balinese in a little more detail. (I'm hoping to create some permanent resources that describe these kinds of issue, but in the meantime i'll write something here.) The tool i pointed to to view the Balinese can help you understand this by analysing the text, but for clarity let me point out some of the issues here (and this is by no means a complicated scenario as complex scripts go).
The sequence of characters in memory is:
When displayed, this results in the following. The black text indicates the first grapheme cluster (2 code points, including one that becomes invisible in this situation, though not in others). The third glyph from the left (SA) is shown as a special conjoined form (which indicates that there is no vowel sound between this and the previous letter). The brown text (all of it) indicates the glyphs associated with the 2nd grapheme cluster (2 code points, one of which - the vowel o~ɔ - is split around the whole consonant cluster).
Note, btw, that the 1st glyph on the left could also represent a different vowel (e~ɛ) were it not (eventually) followed by the final glyph on the right.
Even though this is 2 grapheme clusters, the sequence cannot be broken in the middle at a line end, although other text operations, such as backspacing, do affect only part of the sequence.
All this to illustrate the kind of things that crop up when trying to figure out what is written by looking at the visual text of languages written in complex scripts. Of course, it's nowhere near as simple as for Latin. A good deal of contextual analysis is needed, multiple visual sequences need to be mapped to the same code points, the number of permutations of glyph combinations can be quite large, and the minimal units used for comparison may need to be equivalent to less or more than one grapheme cluster.
from handwriting-recognition.
Closing. Terminologies have been updated to "grapheme" / "user-perceived character".
from handwriting-recognition.
Related Issues (11)
- Use proper BCP 47 language tags for Chinese HOT 7
- TypeScript Definitions HOT 5
- Language fallbacks HOT 4
- Dev interest in using Handwriting Recognition API HOT 5
- Consider using DOMHighResTimestamp instead of DOMTimeStamp HOT 3
- Consider an API with fewer mutable classes HOT 5
- Incubation status HOT 1
- Text direction needs to be taken into account HOT 3
- Handling confusable characters HOT 2
- Text segmentation will vary by language HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from handwriting-recognition.