ridiculousfish / widecharwidth Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 12.0 200 KB

public domain wcwidth implementation

License: Other

Makefile 0.27% Python 24.87% C++ 19.62% JavaScript 16.49% Rust 19.50% Java 19.25%

widecharwidth's People

Contributors

Stargazers

Watchers

Forkers

faho slysven fluffos doytsujin wez krobelus loee vadi2 silverhook artoria2e5

widecharwidth's Issues

Non-characters are not specifically identified as such

Could we have another special return value added please for the 66 Unicode Non-characters?

See the second chunk of https://www.unicode.org/faq/private_use.html - as it is those codepoints are valid Unicode codepoints (may be used internally to an application) but though they do share the General_Category value Cn (Unassigned) with unassigned reserved code points in the standard they do have a different interpretation.

Hangul Jamo vowels and trailing consonants should probably be 0 width

I've been looking at widths reported for Hangul Jamo in wcwidth implementations.

In glibc and MirBSD xterm, U+1160..U+11FF and U+D7B0..U+D7FF have 0 width.

In xterm/ncurses, glib(g_unichar_iszerowidth), and rust's
unicode-width U+1160..U+11FF have 0 width.

Konsole had U+1160..U+11FF with 0 width until October 2018, but moving
from a wcwidth() based on the Markus Kuhn one to one generated from
Unicode datafiles caused it to return width 1
(https://bugs.kde.org/show_bug.cgi?id=396435#c21).

musl, libunistring, vim/NeoVim, seem to know
nothing about Hangul Jamo, and return width 1.

Some context follows:

Korean Hangul is a writing system which uses syllable blocks
consisting of alphabetic components. A syllable consists of one or
more Leading Consonants, one or more Vowels, and zero or more trailing
consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

Hangul Jamo (U+1100..U+11FF).

U+1100..U+115F Choseong (initial, Leading Consonants) have
East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
U+1160..U+11A7 Jungseong (medial, Vowels) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
U+11A8..U+11FF Jongseong (final, Trailing consonants) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo

U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have
East_Asian_Width=Neutral
U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining
behavior, a sequence of L+V+T* gets rendered as a syllable block.
wcwidth() implementations tend to give U+1100..U+115F width 2, and
U+1160..U+11FF width 0, so the resulting syllable block has the
correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <[email protected]>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to
0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

JS array defs should not have [] after ident

Right now we get:

/* Non-characters. */
const widechar_nonchar_table[] = [

Which is an error.

Whitespace?

Right now U+0020 gives widechar_nonprint (in the js version with the PR to make it run at least), which would become "0" in the example code. Might be a good idea to have a return value for whitespace stuff, so we know when to put 1. (Well, for most "western" spaces even the EM quads it seems to be the convention in monospace fonts.)

Shield emoji width regression

The 🛡 emoji before, generated on 2020-03-21:

Width of 🛡 is reported as 2.

After, generated on 2021-04-17:

Width of 🛡 is reported as 1, and makes the text overlap.

The only change is an update to widechar_width file.

Sorry we didn't mention it earlier - things were hectic at the time.

Additional (negative return value) classification needed for "Non-characters"?

Is there any merit in detecting and returning a further special value should the widechar_wcwidth(...) function be called with a code-point value that is formally declared to be a "non-character" these are obviously never going to be assigned but they are distinct from the "unassigned" codepoints indicated by a widechar_unassigned return value?

Or, given the suggested width to use for those (of 0) is this an implementation detail that this project should not need to cover?

Erroneous comment!

widecharwidth/widechar_width.h

Line 29 in 5aeade7

    
           widechar_non_character = -7 // The character is a noncharacter (e.g. a surrogate).

These ranges of character are nothing to do with surrogates (the means by which UTF-16 conveys Unicode codepoints beyond the BMP {basic multi plane} as pairs of code points). UTF-16, just like UTF-8, has been a variable length encoding since somewhere around Unicode 3.0 IIRC - which causes problems for coders {and some coding languages and libraries} that failed to remember this and think that 16 bits are enough to store a whole Unicode code point.

Instead they are codepoints that must be treated as valid as data but which are never going to be declared for use by the Unicode consortium so can be used internally by an application...

Slow on ASCII chars. Maybe check them first?

Soo... I made fish_wcwidth a ton faster by adding this to the beginning:

    // Normal ASCII characters.
    // These are the most common case, so it's worth checking them first.
    if (wc < 0x7f && wc >= 0x20) return 1;

I would wager that, given how prevalent these characters are in any machine-generated text (source code, but also HTML, json et al), that it's quite worth it to check them first, as opposed to returning 1 after checking everything else.

Suggestion: drop output down to single column to minimse git noise

widecharwidth/generate.py

Line 227 in 5b7e917

table_columns = 3

Might I suggest changing that there may be an advantage to making this value one - this would have the eventual effect (after two iterations) of reducing the size of any git commit / difference between two different versions so that only ranges that ARE different get picked out as being different. Currently a single change of just one character added or removed (as an singleton) can produce a sizeable difference - I mean fancy trying to see where the changes are in this part of the unassigned character ranges:

I suppose a branch parallel to master with this sole change is another possibility...

Python 3 compatibility

This thing is apparently not compatible with python3. Which, considering that python 2 is on its way out (https://pythonclock.org/), would be nice to have.

Update emoji-data.txt version...

It seems that

widecharwidth/generate.py

Line 23 in 18b0d81

EMOJI_DATA_URL = 'https://unicode.org/Public/emoji/5.0/emoji-data.txt'

refers to a version 5.0 of this file ( https://unicode.org/Public/emoji/5.0/emoji-data.txt ) yet there is a later one at ( https://unicode.org/Public/emoji/12.0/emoji-data.txt ) so perhaps it is time to update that line...?

ridiculousfish / widecharwidth Goto Github PK

widecharwidth's People

Contributors

Stargazers

Watchers

Forkers

widecharwidth's Issues

Non-characters are not specifically identified as such

Hangul Jamo vowels and trailing consonants should probably be 0 width

JS array defs should not have [] after ident

Whitespace?

Shield emoji width regression

Additional (negative return value) classification needed for "Non-characters"?

Erroneous comment!

Slow on ASCII chars. Maybe check them first?

Suggestion: drop output down to single column to minimse git noise

Python 3 compatibility

Update emoji-data.txt version...

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent