ridiculousfish / widecharwidth Goto Github PK
View Code? Open in Web Editor NEWpublic domain wcwidth implementation
License: Other
public domain wcwidth implementation
License: Other
Could we have another special return value added please for the 66 Unicode Non-characters?
See the second chunk of https://www.unicode.org/faq/private_use.html - as it is those codepoints are valid Unicode codepoints (may be used internally to an application) but though they do share the General_Category value Cn (Unassigned) with unassigned reserved code points in the standard they do have a different interpretation.
I've been looking at widths reported for Hangul Jamo in wcwidth implementations.
In glibc and MirBSD xterm, U+1160..U+11FF and U+D7B0..U+D7FF have 0 width.
In xterm/ncurses, glib(g_unichar_iszerowidth), and rust's
unicode-width U+1160..U+11FF have 0 width.
Konsole had U+1160..U+11FF with 0 width until October 2018, but moving
from a wcwidth() based on the Markus Kuhn one to one generated from
Unicode datafiles caused it to return width 1
(https://bugs.kde.org/show_bug.cgi?id=396435#c21).
musl, libunistring, vim/NeoVim, seem to know
nothing about Hangul Jamo, and return width 1.
Some context follows:
Korean Hangul is a writing system which uses syllable blocks
consisting of alphabetic components. A syllable consists of one or
more Leading Consonants, one or more Vowels, and zero or more trailing
consonants.
Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).
There are also component Jamos:
Hangul Jamo (U+1100..U+11FF).
U+1100..U+115F Choseong (initial, Leading Consonants) have
East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
U+1160..U+11A7 Jungseong (medial, Vowels) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
U+11A8..U+11FF Jongseong (final, Trailing consonants) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo
U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have
East_Asian_Width=Neutral
U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
U+FFA0..U+FFDF half-width forms have no conjoining behavior.
U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining
behavior, a sequence of L+V+T* gets rendered as a syllable block.
wcwidth() implementations tend to give U+1100..U+115F width 2, and
U+1160..U+11FF width 0, so the resulting syllable block has the
correct total width.
U+D7B0..U+D7FF, should also have width 0.
glibc gave width 0 to conjoining jungseong and jongseong at:
commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date: Fri Jul 14 14:02:50 2017 +0200
Refresh generated charmap data and ChangeLog
[BZ #21750]
* charmaps/UTF-8: Refresh.
diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14 Thorsten Glaser <[email protected]>
+
+ [BZ #21750]
+ * charmaps/UTF-8: Refresh.
+ * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+ * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+ * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+ * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+ * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+ [BZ #19852]
+ * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+ UnicodeData lines so the latter have precedence; remove hack
+ to group output by EastAsianWidth ranges.
+
[ ... snip ...]
commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date: Tue Jun 16 08:29:40 2020 +0200
Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to
0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>
diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
<UABE8> 0
<UABED> 0
<UAC00>...<UD7A3> 2
+<UD7B0>...<UD7C6> 0
+<UD7CB>...<UD7FB> 0
<UF900>...<UFA6D> 2
<UFA70>...<UFAD9> 2
<UFB1E> 0
Right now we get:
/* Non-characters. */
const widechar_nonchar_table[] = [
Which is an error.
Right now U+0020 gives widechar_nonprint
(in the js version with the PR to make it run at least), which would become "0" in the example code. Might be a good idea to have a return value for whitespace stuff, so we know when to put 1. (Well, for most "western" spaces even the EM quads it seems to be the convention in monospace fonts.)
The ๐ก emoji before, generated on 2020-03-21
:
Width of 2
.
After, generated on 2021-04-17
:
Width of ๐ก is reported as 1
, and makes the text overlap.
The only change is an update to widechar_width file.
Sorry we didn't mention it earlier - things were hectic at the time.
Is there any merit in detecting and returning a further special value should the widechar_wcwidth(...)
function be called with a code-point value that is formally declared to be a "non-character" these are obviously never going to be assigned but they are distinct from the "unassigned" codepoints indicated by a widechar_unassigned
return value?
Or, given the suggested width to use for those (of 0
) is this an implementation detail that this project should not need to cover?
widecharwidth/widechar_width.h
Line 29 in 5aeade7
These ranges of character are nothing to do with surrogates (the means by which UTF-16 conveys Unicode codepoints beyond the BMP {basic multi plane} as pairs of code points). UTF-16, just like UTF-8, has been a variable length encoding since somewhere around Unicode 3.0 IIRC - which causes problems for coders {and some coding languages and libraries} that failed to remember this and think that 16 bits are enough to store a whole Unicode code point.
Instead they are codepoints that must be treated as valid as data but which are never going to be declared for use by the Unicode consortium so can be used internally by an application...
Soo... I made fish_wcwidth a ton faster by adding this to the beginning:
// Normal ASCII characters.
// These are the most common case, so it's worth checking them first.
if (wc < 0x7f && wc >= 0x20) return 1;
I would wager that, given how prevalent these characters are in any machine-generated text (source code, but also HTML, json et al), that it's quite worth it to check them first, as opposed to returning 1 after checking everything else.
Line 227 in 5b7e917
Might I suggest changing that there may be an advantage to making this value one - this would have the eventual effect (after two iterations) of reducing the size of any git commit / difference between two different versions so that only ranges that ARE different get picked out as being different. Currently a single change of just one character added or removed (as an singleton) can produce a sizeable difference - I mean fancy trying to see where the changes are in this part of the unassigned character ranges:
I suppose a branch parallel to master with this sole change is another possibility...
This thing is apparently not compatible with python3. Which, considering that python 2 is on its way out (https://pythonclock.org/), would be nice to have.
It seems that
Line 23 in 18b0d81
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.