boinkor-net / chars Goto Github PK
View Code? Open in Web Editor NEWcha(rs) is a commandline tool to display information about unicode characters
Home Page: https://github.com/boinkor-net/chars
License: MIT License
cha(rs) is a commandline tool to display information about unicode characters
Home Page: https://github.com/boinkor-net/chars
License: MIT License
I just ran cargo +nightly clippy
on chars. There were many style nits!
The code base could use some style cleanups and a cargo +nightly clippy
travis target (:
I deal with Unicode a fair bit and chars
is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.
E.g the output from chars
might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.
$ chars party
U+0001F973, 🥳 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0
U+0001F389, 🎉 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0
I think the information is available via the DerivedAge.txt
file in the UCD.
Turns out we can't find, e.g., the transgender flag (new in unicode 13!) - its codepoints are
U+1F3F3
U+FE0F
U+200D
U+26A7
U+FE0F
...meaning we can only find the constituent codepoints, but not the whole. That's a problem for all kinds of flags, family configurations and other glyphs composed of multiple codepoints.
The sequences have names, so we ought to be able to retrieve them.
For searches like chars arrow
(2820 lines!) or chars box
(875), the results are not easy to read. It would be nice if there were a way to have a single-line per result mode which output U+XXXX, prints as X, Unicode name: XXX
.
I see two possible approaches:
We appear to be dependent on a very recent unicode_names (or at least a synced-up one). Since that hasn't updated since unicode 8 (and that took a year), maybe we could pull in a slimmed-down version of https://github.com/ProgVal/unicode_names2 and release that as a workspace crate. It does use the same data file as we do, after all!
Maybe the same might apply to the unicode-width crate too, but it's less noticeable for my use case. Let's try unicode_names first.
It would be nice if chars
supported --help
and --version
. Most other command-line applications support those options, and when I first tried chars
, I expected them to work.
I can see that there are several small triangles that exist:
$ chars 'DOWN-POINTING TRIANGLE'
U+0001F783, 🞃 0x0001F783, \0373603, UTF-8: f0 9f 9e 83, UTF-16BE: d83ddf83 Width: 1, prints as 🞃 Quotes as \u{1f783} Unicode name: BLACK DOWN-POINTING ISOSCELES RIGHT TRIANGLE
U+0001F53D, 🔽 0x0001F53D, \0372475, UTF-8: f0 9f 94 bd, UTF-16BE: d83ddd3d
Width: 2, prints as 🔽
Quotes as \u{1f53d}
Unicode name: DOWN-POINTING SMALL RED TRIANGLEU+0001F53B, 🔻 0x0001F53B, \0372473, UTF-8: f0 9f 94 bb, UTF-16BE: d83ddd3b
Width: 2, prints as 🔻
Quotes as \u{1f53b}
Unicode name: DOWN-POINTING RED TRIANGLEU+2BC6, ⯆ 0x2BC6, \025706, UTF-8: e2 af 86, UTF-16BE: 2bc6
Width: 1, prints as ⯆
Quotes as \u{2bc6}
Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE CENTREDU+29E9, ⧩ 0x29E9, \024751, UTF-8: e2 a7 a9, UTF-16BE: 29e9
Width: 1, prints as ⧩
Quotes as \u{29e9}
Unicode name: DOWN-POINTING TRIANGLE WITH RIGHT HALF BLACKU+29E8, ⧨ 0x29E8, \024750, UTF-8: e2 a7 a8, UTF-16BE: 29e8
Width: 1, prints as ⧨
Quotes as \u{29e8}
Unicode name: DOWN-POINTING TRIANGLE WITH LEFT HALF BLACKU+26DB, ⛛ 0x26DB, \023333, UTF-8: e2 9b 9b, UTF-16BE: 26db
Width: 1 (2 in CJK context), prints as ⛛
Quotes as \u{26db}
Unicode name: HEAVY WHITE DOWN-POINTING TRIANGLEU+25BF, ▿ 0x25BF, \022677, UTF-8: e2 96 bf, UTF-16BE: 25bf
Width: 1, prints as ▿
Quotes as \u{25bf}
Unicode name: WHITE DOWN-POINTING SMALL TRIANGLEU+25BE, ▾ 0x25BE, \022676, UTF-8: e2 96 be, UTF-16BE: 25be
Width: 1, prints as ▾
Quotes as \u{25be}
Unicode name: BLACK DOWN-POINTING SMALL TRIANGLEU+25BD, ▽ 0x25BD, \022675, UTF-8: e2 96 bd, UTF-16BE: 25bd
Width: 1 (2 in CJK context), prints as ▽
Quotes as \u{25bd}
Unicode name: WHITE DOWN-POINTING TRIANGLEU+25BC, ▼ 0x25BC, \022674, UTF-8: e2 96 bc, UTF-16BE: 25bc
Width: 1 (2 in CJK context), prints as ▼
Quotes as \u{25bc}
Unicode name: BLACK DOWN-POINTING TRIANGLEU+23F7, ⏷ 0x23F7, \021767, UTF-8: e2 8f b7, UTF-16BE: 23f7
Width: 1, prints as ⏷
Quotes as \u{23f7}
Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLEU+23EC, ⏬ 0x23EC, \021754, UTF-8: e2 8f ac, UTF-16BE: 23ec
Width: 2, prints as ⏬
Quotes as \u{23ec}
Unicode name: BLACK DOWN-POINTING DOUBLE TRIANGLE
$
But, when I try to look at only the small triangles:
$ chars 'SMALL TRIANGLE'
$
I get nothing. If I search for medium triangles:
$ chars 'MEDIUM TRIANGLE'
U+0001F827, 🠧 0x0001F827, \0374047, UTF-8: f0 9f a0 a7, UTF-16BE: d83edc27 Width: 1, prints as 🠧 Quotes as \u{1f827} Unicode name: DOWNWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFT
U+0001F826, 🠦 0x0001F826, \0374046, UTF-8: f0 9f a0 a6, UTF-16BE: d83edc26
Width: 1, prints as 🠦
Quotes as \u{1f826}
Unicode name: RIGHTWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFTU+0001F825, 🠥 0x0001F825, \0374045, UTF-8: f0 9f a0 a5, UTF-16BE: d83edc25
Width: 1, prints as 🠥
Quotes as \u{1f825}
Unicode name: UPWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFTU+0001F824, 🠤 0x0001F824, \0374044, UTF-8: f0 9f a0 a4, UTF-16BE: d83edc24
Width: 1, prints as 🠤
Quotes as \u{1f824}
Unicode name: LEFTWARDS TRIANGLE-HEADED ARROW WITH MEDIUM SHAFTU+0001F807, 🠇 0x0001F807, \0374007, UTF-8: f0 9f a0 87, UTF-16BE: d83edc07
Width: 1, prints as 🠇
Quotes as \u{1f807}
Unicode name: DOWNWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEADU+0001F806, 🠆 0x0001F806, \0374006, UTF-8: f0 9f a0 86, UTF-16BE: d83edc06
Width: 1, prints as 🠆
Quotes as \u{1f806}
Unicode name: RIGHTWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEADU+0001F805, 🠅 0x0001F805, \0374005, UTF-8: f0 9f a0 85, UTF-16BE: d83edc05
Width: 1, prints as 🠅
Quotes as \u{1f805}
Unicode name: UPWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEADU+0001F804, 🠄 0x0001F804, \0374004, UTF-8: f0 9f a0 84, UTF-16BE: d83edc04
Width: 1, prints as 🠄
Quotes as \u{1f804}
Unicode name: LEFTWARDS ARROW WITH MEDIUM TRIANGLE ARROWHEADU+2BC8, ⯈ 0x2BC8, \025710, UTF-8: e2 af 88, UTF-16BE: 2bc8
Width: 1, prints as ⯈
Quotes as \u{2bc8}
Unicode name: BLACK MEDIUM RIGHT-POINTING TRIANGLE CENTREDU+2BC7, ⯇ 0x2BC7, \025707, UTF-8: e2 af 87, UTF-16BE: 2bc7
Width: 1, prints as ⯇
Quotes as \u{2bc7}
Unicode name: BLACK MEDIUM LEFT-POINTING TRIANGLE CENTREDU+2BC6, ⯆ 0x2BC6, \025706, UTF-8: e2 af 86, UTF-16BE: 2bc6
Width: 1, prints as ⯆
Quotes as \u{2bc6}
Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLE CENTREDU+2BC5, ⯅ 0x2BC5, \025705, UTF-8: e2 af 85, UTF-16BE: 2bc5
Width: 1, prints as ⯅
Quotes as \u{2bc5}
Unicode name: BLACK MEDIUM UP-POINTING TRIANGLE CENTREDU+23F7, ⏷ 0x23F7, \021767, UTF-8: e2 8f b7, UTF-16BE: 23f7
Width: 1, prints as ⏷
Quotes as \u{23f7}
Unicode name: BLACK MEDIUM DOWN-POINTING TRIANGLEU+23F6, ⏶ 0x23F6, \021766, UTF-8: e2 8f b6, UTF-16BE: 23f6
Width: 1, prints as ⏶
Quotes as \u{23f6}
Unicode name: BLACK MEDIUM UP-POINTING TRIANGLEU+23F5, ⏵ 0x23F5, \021765, UTF-8: e2 8f b5, UTF-16BE: 23f5
Width: 1, prints as ⏵
Quotes as \u{23f5}
Unicode name: BLACK MEDIUM RIGHT-POINTING TRIANGLEU+23F4, ⏴ 0x23F4, \021764, UTF-8: e2 8f b4, UTF-16BE: 23f4
Width: 1, prints as ⏴
Quotes as \u{23f4}
Unicode name: BLACK MEDIUM LEFT-POINTING TRIANGLE
$
I still get plenty of results.
Hello.
chars works pretty well and really helps a lot. However, its output looks a little boring and different parts of output take a while to distinguish.
So it would be great if the output is colorful. How do you think?
When I ran,
chars --help
I was confused because chars
gave me no output. From what I can tell, chars
searched for “--help”, didn’t find anything and printed nothing as a result. It would be nice if chars
printed something along the lines of “No results for ‘--help’.” That would make what chars
is doing clearer.
Being able to quickly look up a unicode character from your terminal could prove very useful (being able to call cha
from vim, for example).
Is this in scope? If so I might submit a pull trying to implement this.
When naively doing cargo install
following the README, it fails with:
13:25~/git/chars(master)$ cargo install
error: found a virtual manifest at `/data/data/com.termux/files/home/git/chars/Cargo.toml` instead of a package manifest
cargo install chars --git https://github.com/antifuchs/chars.git
works fine.
Hi, thanks for building this.
Just thought I'd let you know that I've added an AUR package for chars to make it easy to install on Arch Linux with the system packaging tools.
Might be worth including a link to the package in installation section of the README.
With your tool it is possible to look up unicode characters by various criteria as you've stated in your readme, including "unicode name" and "also known as".
In HTML, named character escape sequences are available for things like the less than and the greater than signs, but also for quite a few other characters.
Back in the day, before UTF-8 encoding support was widespread, we'd use the ISO-8859-1 encoding for our HTML and we'd use named character escape sequences for characters like æ, ø, å for example.
Some of those names stuck with me and I sometimes search for those characters by those names on Google if I am on a machine where inputing said characters directly is not possible or just too cumbersome.
Even on my MacBook Air, where I can generally long-press certain keys to access other characters, some applications implement text input that does not support the long-press functionality, so I go to some other window on-screen and either long-press there, or search for it on Google whichever is most convenient at the time (convenience in this case is determined by which other windows I happen to have on screen at that moment).
I pretty much always have at least one terminal window open at any time, and if I don't then opening the terminal is fast and simple.
Prior to purchasing my MacBook Air, when I was running Linux on a ThinkPad, I made a few simple shellscripts that were named after the HTML character entity references for the characters that I most commonly needed; æ, ø, å, Æ, Ø, Å; aelig
, oslash
, aring
, AElig
, Oslash
, Aring
. When executed they would spit out the corresponding UTF-8 encoded byte sequence for the character in question.
oslash
ø
A full list of all HTML character entity references can be found at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML
Most notably for me personally, aside from the six mentioned above are laquo, raquo, ndash, mdash, eacute and Eacute, but they are all useful IMO and anyway if you agree to include the HTML character entity reference names then it would make the most sense to include them all I think.
So to get to the point, my suggestion is that based upon the table at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML, an additional field be added for applicable characters in the output for chars
.
Some examples of what the output of chars
would look like:
Example 1
chars U+002A
ASCII 2/a, 42, 0x2a, 0052, bits 00101010
Width: 1, prints as *
Unicode name: ASTERISK
Also known as: Star, Splat, Aster, Times, Gear, Dingle, Bug, Twinkle, Glob
HTML entity names: ast, midast
Example 2
chars U+00AE
LATIN1 ae, 174, 0xae, 0256, bits 10101110
Width: 1 (2 in CJK context), prints as ®
Quotes as \u{ae}
Unicode name: REGISTERED SIGN
HTML entity names: reg, circledR, REG
Example 3
chars U+00C6
LATIN1 c6, 198, 0xc6, 0306, bits 11000110
Width: 1 (2 in CJK context), prints as Æ
Upper case. Downcases to æ
Quotes as \u{c6}
Unicode name: LATIN CAPITAL LETTER AE
HTML entity name: AElig
In the examples above, a field named "HTML entity names" (where multiple names exist) or "HTML entity name" (where only one name exists) has been added.
Furthermore, I request that case-sensitive search is performed on this field where present, so that one can search for them and get results like shown in the following examples:
Example 1
chars Oslash
LATIN1 d8, 216, 0xd8, 0330, bits 11011000
Width: 1 (2 in CJK context), prints as Ø
Upper case. Downcases to ø
Quotes as \u{d8}
Unicode name: LATIN CAPITAL LETTER O WITH STROKE
HTML entity name: Oslash
Example 2
chars oslash
LATIN1 f8, 248, 0xf8, 0370, bits 11111000
Width: 1 (2 in CJK context), prints as ø
Lower case. Upcases to Ø
Quotes as \u{f8}
Unicode name: LATIN SMALL LETTER O WITH STROKE
HTML entity name: oslash
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.