Coder Social home page Coder Social logo

Comments (11)

sharkdp avatar sharkdp commented on May 21, 2024 2

@kilobyte & @lo48576 Thank you very much for the detailed explanations! In order to keep things simple, I'd like to go with the --ascii option that @lo48576 suggested.

Other opinions?

from hexyl.

lo48576 avatar lo48576 commented on May 21, 2024 1

@kilobyte In old days, yes they are full width because they use multiple bytes.
But it is not only reason they are full-width.
Some characters (especially punctuations like ) should be full width in (at least) Japanese environment, and it should not be forced to be half width for all languages over the world.
I think this is not a defect of terminals, though we suffered from it.
Appropriate EAW setting is necessary for many users.

EAW (East asian width)

EAW characters are characters which "should be rendered as full width (double width) in CJK context, but should be rendered as half width (single width) in other context".
This includes greek characters for example (like α), but not only alphabetical characters.
Many non-alphabetical symbols (such as ruled lines , times symbol ×, ellipsis , and many others.

wcwidth

It is hard to detect context of where the EAW characters are used, so usually locale information (usually specified by $LANG) is used.
Each locale has corresponding charmaps (in /usr/share/i18n/charmaps/), and they are connected by /etc/locale.gen file (in Linux).

wcwidth refers to the charmap database corresponding to the current locale (usually $LANG), and returns character width.
So, for example α can be 1 for some locales and environments, but also can be 2 for other locales or environments.

full width font and terminal

TR11 recommends alphabetic characters to be rendered as half (single) width, but I think this does not apply to non-alphabetic symbols (typically punctuation marks like , they should absolutely be rendered as full (double) width in CJK, but they are rendered as half width in non-CJK area).

Generic UTF-8 locale setting provided by glibc returns half width for EAW characters (because, I think, most users and developers live in non-CJK area), but this can be changed by users, and it would be completely legal.
So, terminal can use full width for EAW characters even if it has no bug, and we cannot say it is problem of terminal.

SSH and wcwidth

Charmap database is referred by glibc (or something like that), so if users use SSH and run apps on server, server's locale and charmap database is used.
This may be problematic in some cases, for example, "I use locale and charmap which specify full width for EAW characters, but my server uses C locale and it uses half width for EAW characters".

Problem

Hexyl uses some EAW characters (as far as I know: all ruled lines, ×, and , but there might be more).
They are full (double) width in some environments, but hexyl always consider them as half (single) width, so layout is broken.

How to solve

Make symbols customizable

IMHO, the best option is to make some special symbols customizable.
In this case, users can modify config files to use +-| as ruled lines, x instead of ×, . instead of .
This might be useful for EAW users, or users with poor font or poor terminal.
(And non-CJK users can use good-looking box drawing characters).

Add "ASCII-only" option

If customizability is not important, simply --ascii CLI option or something like that can be added.
This is less flexible, but useful enough like as with the first option.

Make hexyl wcwidth-aware

This works for some environments, but won't work as expected for some remote (SSH) environments, as @kilobyte pointed out.

from hexyl.

be5invis avatar be5invis commented on May 21, 2024 1

@12101111 Some legacy Chinese fonts may use full-width for U+00B7
I agree with @sharkdp’s --ascii option. It would be better (on win32) to detect user codepage with GetACP and turn on ascii under CP 932, 936, 949, 950.

from hexyl.

kilobyte avatar kilobyte commented on May 21, 2024

So EastAsianAmbiguous is still a thing in some terminals? That was about as bad an idea as CJK Unification.

I'd argue that your terminal is broken and needs to be fixed. The Unicode standard says:

In modern practice, most alphabetic characters are rendered by variable-width fonts using narrow characters, even if their encoding in common legacy sets uses multiple bytes.

although when I asked them about improving some EAW settings that made no sense, they refused to make a stance, saying the whole EAW database is obsolete and shouldn't be used anymore. They didn't provide a replacement — I guess it's time to ask for an explicit database. But that'd take many months for a draft, a year for a release, then several years to be actually obeyed by terminals.

In the interim, I'd say tool like hexyl should avoid using any EastAsianAmbiguous characters — running under a CJK/non-CJK locale doesn't mean anything about the terminal, as a Japanese person ssh-ing to a company server will have EastAsianAmbiguous=N on the machine running hexyl but EastAsianAmbiguous=W in the terminal. And even if you have ssh sending the locale correctly, there's no such option for serial links or e-mail.

And, there's less than 256 byte values to display, so avoiding such characters is trivial.

from hexyl.

sharkdp avatar sharkdp commented on May 21, 2024

In the interim, I'd say tool like hexyl should avoid using any EastAsianAmbiguous characters

I'm not really familiar with the details here. What is an "EastAsianAmbiguous character"? × is the "Multiplication sign" character in the "Latin-1 Supplement" block of Unicode and is the "Bullet" character in the "General Punctuation" block of Unicode. What do either of these have to do with East Asian characters?

I understand that your terminal somehow prints these characters with a width of 2. Could we call wcwidth('×') in hexyl and fall back to another character if it returns 2? (see https://docs.rs/wcwidth/1.0.1/wcwidth/fn.char_width.html ?)

from hexyl.

kilobyte avatar kilobyte commented on May 21, 2024

Alas, it's not so easy — wcwidth() returns 1 for those. And there's nothing you can do on the machine running hexyl — the display depends on the receiving side, possibly years after hexyl was run.

I do consider assuming that EastAsianAmbiguous allows width 2 a defect in the terminal: the whole concept comes from a technical detail of some ancient systems that assumed the width of every character is same as the number of bytes it takes to encode within that particular legacy encoding. So not only it's compat with something badly obsolete, it's also ambiguous wrt which ancient encoding it's striving to be compatible with. Some of those characters will display as narrow, some as wide, and you have no way to detect that.

There's a database: package "unicode-data", file /usr/share/unicode/EastAsianWidth.txt — anything marked as "A" is dangerous to use as it may exhibit this problem on some terminals. Anything "N", "Na" and "H" is safely narrow, anything "F" and "W" is wide. Here's the official standard.

But, as you need just a few characters, a solution seems trivial: just avoid anything marked as "A"; there's enough good alternatives to choose from.

from hexyl.

sharkdp avatar sharkdp commented on May 21, 2024

In the output above, there also seems to be a problem with the box drawing characters. I don't think there are reasonably good looking alternatives(?)

from hexyl.

12101111 avatar 12101111 commented on May 21, 2024

Add option make the program more complicated, and CJK users had to always turn on this option.
Maybe change U+2022 "Bullet" to U+00B7 "Middle dot" is better?
It seems middle dot works in both CJK environment and Western environments (for me).
image
image
I tried most Chinese monospaced fonts,and found × and · is halfwidth, is fullwidth .
It seems that × in the Japanese font is also fullwidth .I don't know if there is an alternative to ×.

from hexyl.

lo48576 avatar lo48576 commented on May 21, 2024

@12101111 The problem is not only bullet and cross sign, but also ruled lines...
They would be usually half-width in western environments, but full-width in CJK (at least in almost all Japanese fonts I know).

from hexyl.

be5invis avatar be5invis commented on May 21, 2024

@lo48576 Actually, the problem is that the command line application cannot know the font that the console is using and decide how many cells would be used to render such symbols under FAREAST environments :(
@sharkdp Always remember: text is hard.

from hexyl.

lo48576 avatar lo48576 commented on May 21, 2024

+1, it would be good if --ascii is automatically enabled on some environment (in future).
Then border mode should have three modes, --ascii={never,auto,always}, like --color of many tools (ls, grep, etc...), I think.
For example, hexyl will behave as hexyl --ascii=auto, and hexyl --ascii will behave as hexyl --ascii=always.

from hexyl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.