boinkor-net / chars Goto Github PK

View Code? Open in Web Editor NEW

181.0 5.0 13.0 3.89 MB

cha(rs) is a commandline tool to display information about unicode characters

Home Page: https://github.com/boinkor-net/chars

License: MIT License

Rust 99.95% Makefile 0.02% Shell 0.03%

characters cli rust unicode

chars's People

Contributors

Stargazers

Watchers

Forkers

evanmcc progval lebyanelm ktp-forked-repos sachingsachin wezm tchigher jean notriddle tabulon-ext vladimyr joseluis iq-scm

chars's Issues

`cargo +nightly clippy` has many complaints

I just ran cargo +nightly clippy on chars. There were many style nits!

The code base could use some style cleanups and a cargo +nightly clippy travis target (:

Suggestion: Unicode version codepoint was added

I deal with Unicode a fair bit and chars is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.

E.g the output from chars might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.

$ chars party
U+0001F973, &#129395; 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0

U+0001F389, &#127881; 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0

I think the information is available via the DerivedAge.txt file in the UCD.

Allow effective searching for flags and other zwj-joined symbols

Turns out we can't find, e.g., the transgender flag (new in unicode 13!) - its codepoints are

U+1F3F3
U+FE0F
U+200D
U+26A7
U+FE0F

...meaning we can only find the constituent codepoints, but not the whole. That's a problem for all kinds of flags, family configurations and other glyphs composed of multiple codepoints.

The sequences have names, so we ought to be able to retrieve them.

Searches with many results

For searches like chars arrow (2820 lines!) or chars box (875), the results are not easy to read. It would be nice if there were a way to have a single-line per result mode which output U+XXXX, prints as X, Unicode name: XXX.

I see two possible approaches:

Automatically switch to single-line results if more than some number match (more than one?)
Add a command-line argument to enable this mode

Pull in unicode_names as an internal crate

We appear to be dependent on a very recent unicode_names (or at least a synced-up one). Since that hasn't updated since unicode 8 (and that took a year), maybe we could pull in a slimmed-down version of https://github.com/ProgVal/unicode_names2 and release that as a workspace crate. It does use the same data file as we do, after all!

Maybe the same might apply to the unicode-width crate too, but it's less noticeable for my use case. Let's try unicode_names first.

Suggestion: support `--help` and `--version`

It would be nice if chars supported --help and --version. Most other command-line applications support those options, and when I first tried chars, I expected them to work.

Difficulty searching for small triangles

I can see that there are several small triangles that exist:

$ chars 'DOWN-POINTING TRIANGLE'

U+0001F783, 🞃 0x0001F783, \0373603, UTF-8: f0 9f 9e 83, UTF-16BE: d83ddf83 Width: 1, prints as 🞃 Quotes as \u{1f783} Unicode name: BLACK DOWN-POINTING ISOSCELES RIGHT TRIANGLE U+0001F53D, 🔽 0x0001F53D, \0372475, UTF-8: f0 9f 94 bd, UTF-16BE: d83ddd3d Width: 2, prints as 🔽 Quotes as \u{1f53d} Unicode name: DOWN-POINTING SMALL RED TRIANGLE U+0001F53B, 🔻 0x0001F53B, \0372473, UTF-8: f0 9f 94 bb, UTF-16BE: d83ddd3b Width: 2, prints as 🔻 Quotes as \u{1f53b} Unicode name: DOWN-POINTING RED TRIANGLE U+2BC6, ⯆ 0x2BC6, \025706, UTF-8: e2 af 86, UTF-16BE: 2bc6 Width: 1, prints as ⯆ Quotes as \u{2bc6} Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE CENTRED U+29E9, ⧩ 0x29E9, \024751, UTF-8: e2 a7 a9, UTF-16BE: 29e9 Width: 1, prints as ⧩ Quotes as \u{29e9} Unicode name: DOWN-POINTING TRIANGLE WITH RIGHT HALF BLACK U+29E8, ⧨ 0x29E8, \024750, UTF-8: e2 a7 a8, UTF-16BE: 29e8 Width: 1, prints as ⧨ Quotes as \u{29e8} Unicode name: DOWN-POINTING TRIANGLE WITH LEFT HALF BLACK U+26DB, ⛛ 0x26DB, \023333, UTF-8: e2 9b 9b, UTF-16BE: 26db Width: 1 (2 in CJK context), prints as ⛛ Quotes as \u{26db} Unicode name: HEAVY WHITE DOWN-POINTING TRIANGLE U+25BF, ▿ 0x25BF, \022677, UTF-8: e2 96 bf, UTF-16BE: 25bf Width: 1, prints as ▿ Quotes as \u{25bf} Unicode name: WHITE DOWN-POINTING SMALL TRIANGLE U+25BE, ▾ 0x25BE, \022676, UTF-8: e2 96 be, UTF-16BE: 25be Width: 1, prints as ▾ Quotes as \u{25be} Unicode name: BLACK DOWN-POINTING SMALL TRIANGLE U+25BD, ▽ 0x25BD, \022675, UTF-8: e2 96 bd, UTF-16BE: 25bd Width: 1 (2 in CJK context), prints as ▽ Quotes as \u{25bd} Unicode name: WHITE DOWN-POINTING TRIANGLE U+25BC, ▼ 0x25BC, \022674, UTF-8: e2 96 bc, UTF-16BE: 25bc Width: 1 (2 in CJK context), prints as ▼ Quotes as \u{25bc} Unicode name: BLACK DOWN-POINTING TRIANGLE U+23F7, ⏷ 0x23F7, \021767, UTF-8: e2 8f b7, UTF-16BE: 23f7 Width: 1, prints as ⏷ Quotes as \u{23f7} Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE U+23EC, ⏬ 0x23EC, \021754, UTF-8: e2 8f ac, UTF-16BE: 23ec Width: 2, prints as ⏬ Quotes as \u{23ec} Unicode name: BLACK DOWN-POINTING DOUBLE TRIANGLE

$

But, when I try to look at only the small triangles:

$ chars 'SMALL TRIANGLE'
$

I get nothing. If I search for medium triangles:

$ chars 'MEDIUM TRIANGLE'

U+0001F827, 🠧 0x0001F827, \0374047, UTF-8: f0 9f a0 a7, UTF-16BE: d83edc27 Width: 1, prints as 🠧 Quotes as \u{1f827} Unicode name: DOWNWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFT U+0001F826, 🠦 0x0001F826, \0374046, UTF-8: f0 9f a0 a6, UTF-16BE: d83edc26 Width: 1, prints as 🠦 Quotes as \u{1f826} Unicode name: RIGHTWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFT U+0001F825, 🠥 0x0001F825, \0374045, UTF-8: f0 9f a0 a5, UTF-16BE: d83edc25 Width: 1, prints as 🠥 Quotes as \u{1f825} Unicode name: UPWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFT U+0001F824, 🠤 0x0001F824, \0374044, UTF-8: f0 9f a0 a4, UTF-16BE: d83edc24 Width: 1, prints as 🠤 Quotes as \u{1f824} Unicode name: LEFTWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFT U+0001F807, 🠇 0x0001F807, \0374007, UTF-8: f0 9f a0 87, UTF-16BE: d83edc07 Width: 1, prints as 🠇 Quotes as \u{1f807} Unicode name: DOWNWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEAD U+0001F806, 🠆 0x0001F806, \0374006, UTF-8: f0 9f a0 86, UTF-16BE: d83edc06 Width: 1, prints as 🠆 Quotes as \u{1f806} Unicode name: RIGHTWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEAD U+0001F805, 🠅 0x0001F805, \0374005, UTF-8: f0 9f a0 85, UTF-16BE: d83edc05 Width: 1, prints as 🠅 Quotes as \u{1f805} Unicode name: UPWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEAD U+0001F804, 🠄 0x0001F804, \0374004, UTF-8: f0 9f a0 84, UTF-16BE: d83edc04 Width: 1, prints as 🠄 Quotes as \u{1f804} Unicode name: LEFTWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEAD U+2BC8, ⯈ 0x2BC8, \025710, UTF-8: e2 af 88, UTF-16BE: 2bc8 Width: 1, prints as ⯈ Quotes as \u{2bc8} Unicode name: BLACK MEDIUM RIGHT-POINTING TRIANGLE CENTRED U+2BC7, ⯇ 0x2BC7, \025707, UTF-8: e2 af 87, UTF-16BE: 2bc7 Width: 1, prints as ⯇ Quotes as \u{2bc7} Unicode name: BLACK MEDIUM LEFT-POINTING TRIANGLE CENTRED U+2BC6, ⯆ 0x2BC6, \025706, UTF-8: e2 af 86, UTF-16BE: 2bc6 Width: 1, prints as ⯆ Quotes as \u{2bc6} Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE CENTRED U+2BC5, ⯅ 0x2BC5, \025705, UTF-8: e2 af 85, UTF-16BE: 2bc5 Width: 1, prints as ⯅ Quotes as \u{2bc5} Unicode name: BLACK MEDIUM UP-POINTING TRIANGLE CENTRED U+23F7, ⏷ 0x23F7, \021767, UTF-8: e2 8f b7, UTF-16BE: 23f7 Width: 1, prints as ⏷ Quotes as \u{23f7} Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE U+23F6, ⏶ 0x23F6, \021766, UTF-8: e2 8f b6, UTF-16BE: 23f6 Width: 1, prints as ⏶ Quotes as \u{23f6} Unicode name: BLACK MEDIUM UP-POINTING TRIANGLE U+23F5, ⏵ 0x23F5, \021765, UTF-8: e2 8f b5, UTF-16BE: 23f5 Width: 1, prints as ⏵ Quotes as \u{23f5} Unicode name: BLACK MEDIUM RIGHT-POINTING TRIANGLE U+23F4, ⏴ 0x23F4, \021764, UTF-8: e2 8f b4, UTF-16BE: 23f4 Width: 1, prints as ⏴ Quotes as \u{23f4} Unicode name: BLACK MEDIUM LEFT-POINTING TRIANGLE

$

I still get plenty of results.

Suggestion: Make output colorful

Hello.

chars works pretty well and really helps a lot. However, its output looks a little boring and different parts of output take a while to distinguish.

So it would be great if the output is colorful. How do you think?

Suggestion: output something when there’s no results

When I ran,

chars --help

I was confused because chars gave me no output. From what I can tell, chars searched for “--help”, didn’t find anything and printed nothing as a result. It would be nice if chars printed something along the lines of “No results for ‘--help’.” That would make what chars is doing clearer.

[Feature] unicode character lookup by description

Being able to quickly look up a unicode character from your terminal could prove very useful (being able to call cha from vim, for example).

Is this in scope? If so I might submit a pull trying to implement this.

`cargo install` fails

When naively doing cargo install following the README, it fails with:

13:25~/git/chars(master)$ cargo install
error: found a virtual manifest at `/data/data/com.termux/files/home/git/chars/Cargo.toml` instead of a package manifest

cargo install chars --git https://github.com/antifuchs/chars.git works fine.

Sim

AUR Package

Hi, thanks for building this.

Just thought I'd let you know that I've added an AUR package for chars to make it easy to install on Arch Linux with the system packaging tools.

Might be worth including a link to the package in installation section of the README.

Suggestion: Include HTML character entity reference names in output and in search

With your tool it is possible to look up unicode characters by various criteria as you've stated in your readme, including "unicode name" and "also known as".

In HTML, named character escape sequences are available for things like the less than and the greater than signs, but also for quite a few other characters.

Back in the day, before UTF-8 encoding support was widespread, we'd use the ISO-8859-1 encoding for our HTML and we'd use named character escape sequences for characters like æ, ø, å for example.

Some of those names stuck with me and I sometimes search for those characters by those names on Google if I am on a machine where inputing said characters directly is not possible or just too cumbersome.

Even on my MacBook Air, where I can generally long-press certain keys to access other characters, some applications implement text input that does not support the long-press functionality, so I go to some other window on-screen and either long-press there, or search for it on Google whichever is most convenient at the time (convenience in this case is determined by which other windows I happen to have on screen at that moment).

I pretty much always have at least one terminal window open at any time, and if I don't then opening the terminal is fast and simple.

Prior to purchasing my MacBook Air, when I was running Linux on a ThinkPad, I made a few simple shellscripts that were named after the HTML character entity references for the characters that I most commonly needed; æ, ø, å, Æ, Ø, Å; aelig, oslash, aring, AElig, Oslash, Aring. When executed they would spit out the corresponding UTF-8 encoded byte sequence for the character in question.

oslash

ø

A full list of all HTML character entity references can be found at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML

Most notably for me personally, aside from the six mentioned above are laquo, raquo, ndash, mdash, eacute and Eacute, but they are all useful IMO and anyway if you agree to include the HTML character entity reference names then it would make the most sense to include them all I think.

So to get to the point, my suggestion is that based upon the table at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML, an additional field be added for applicable characters in the output for chars.

Some examples of what the output of chars would look like:

Example 1

chars U+002A

ASCII 2/a,  42, 0x2a, 0052, bits 00101010
Width: 1, prints as *
Unicode name: ASTERISK
Also known as: Star, Splat, Aster, Times, Gear, Dingle, Bug, Twinkle, Glob
HTML entity names: ast, midast

Example 2

chars U+00AE

LATIN1 ae, 174, 0xae, 0256, bits 10101110
Width: 1 (2 in CJK context), prints as ®
Quotes as \u{ae}
Unicode name: REGISTERED SIGN
HTML entity names: reg, circledR, REG

Example 3

chars U+00C6

LATIN1 c6, 198, 0xc6, 0306, bits 11000110
Width: 1 (2 in CJK context), prints as Æ
Upper case. Downcases to æ
Quotes as \u{c6}
Unicode name: LATIN CAPITAL LETTER AE
HTML entity name: AElig

In the examples above, a field named "HTML entity names" (where multiple names exist) or "HTML entity name" (where only one name exists) has been added.

Furthermore, I request that case-sensitive search is performed on this field where present, so that one can search for them and get results like shown in the following examples: