Coder Social home page Coder Social logo

widecharwidth's People

Contributors

artoria2e5 avatar faho avatar krobelus avatar ridiculousfish avatar slysven avatar thefallentree avatar wez avatar xtaixe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

widecharwidth's Issues

Hangul Jamo vowels and trailing consonants should probably be 0 width

I've been looking at widths reported for Hangul Jamo in wcwidth implementations.

In glibc and MirBSD xterm, U+1160..U+11FF and U+D7B0..U+D7FF have 0 width.

In xterm/ncurses, glib(g_unichar_iszerowidth), and rust's
unicode-width U+1160..U+11FF have 0 width.

Konsole had U+1160..U+11FF with 0 width until October 2018, but moving
from a wcwidth() based on the Markus Kuhn one to one generated from
Unicode datafiles caused it to return width 1
(https://bugs.kde.org/show_bug.cgi?id=396435#c21).

musl, libunistring, vim/NeoVim, seem to know
nothing about Hangul Jamo, and return width 1.

Some context follows:

Korean Hangul is a writing system which uses syllable blocks
consisting of alphabetic components. A syllable consists of one or
more Leading Consonants, one or more Vowels, and zero or more trailing
consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

Hangul Jamo (U+1100..U+11FF).

U+1100..U+115F Choseong (initial, Leading Consonants) have
East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
U+1160..U+11A7 Jungseong (medial, Vowels) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
U+11A8..U+11FF Jongseong (final, Trailing consonants) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo

U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have
East_Asian_Width=Neutral
U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining
behavior, a sequence of L+V+T* gets rendered as a syllable block.
wcwidth() implementations tend to give U+1100..U+115F width 2, and
U+1160..U+11FF width 0, so the resulting syllable block has the
correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <[email protected]>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to
0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

Whitespace?

Right now U+0020 gives widechar_nonprint (in the js version with the PR to make it run at least), which would become "0" in the example code. Might be a good idea to have a return value for whitespace stuff, so we know when to put 1. (Well, for most "western" spaces even the EM quads it seems to be the convention in monospace fonts.)

Additional (negative return value) classification needed for "Non-characters"?

Is there any merit in detecting and returning a further special value should the widechar_wcwidth(...) function be called with a code-point value that is formally declared to be a "non-character" these are obviously never going to be assigned but they are distinct from the "unassigned" codepoints indicated by a widechar_unassigned return value?

Or, given the suggested width to use for those (of 0) is this an implementation detail that this project should not need to cover?

Erroneous comment!

widechar_non_character = -7 // The character is a noncharacter (e.g. a surrogate).

These ranges of character are nothing to do with surrogates (the means by which UTF-16 conveys Unicode codepoints beyond the BMP {basic multi plane} as pairs of code points). UTF-16, just like UTF-8, has been a variable length encoding since somewhere around Unicode 3.0 IIRC - which causes problems for coders {and some coding languages and libraries} that failed to remember this and think that 16 bits are enough to store a whole Unicode code point.

Instead they are codepoints that must be treated as valid as data but which are never going to be declared for use by the Unicode consortium so can be used internally by an application...

Slow on ASCII chars. Maybe check them first?

Soo... I made fish_wcwidth a ton faster by adding this to the beginning:

    // Normal ASCII characters.
    // These are the most common case, so it's worth checking them first.
    if (wc < 0x7f && wc >= 0x20) return 1;

I would wager that, given how prevalent these characters are in any machine-generated text (source code, but also HTML, json et al), that it's quite worth it to check them first, as opposed to returning 1 after checking everything else.

Suggestion: drop output down to single column to minimse git noise

table_columns = 3

Might I suggest changing that there may be an advantage to making this value one - this would have the eventual effect (after two iterations) of reducing the size of any git commit / difference between two different versions so that only ranges that ARE different get picked out as being different. Currently a single change of just one character added or removed (as an singleton) can produce a sizeable difference - I mean fancy trying to see where the changes are in this part of the unassigned character ranges:
image

I suppose a branch parallel to master with this sole change is another possibility...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.