Coder Social home page Coder Social logo

uniseg's Introduction

Unicode Text Segmentation for Go

Go Reference Go Report

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29, Unicode Line Breaking according to Unicode Standard Annex #14 (Unicode version 15.0.0), and monospace font string width calculation similar to wcwidth.

Background

Grapheme Clusters

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String Bytes (UTF-8) Code points (runes) Grapheme clusters
Kaฬˆse 6 bytes: 4b 61 cc 88 73 65 5 code points: 4b 61 308 73 65 4 clusters: [4b],[61 308],[73],[65]
๐Ÿณ๏ธโ€๐ŸŒˆ 14 bytes: f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88 4 code points: 1f3f3 fe0f 200d 1f308 1 cluster: [1f3f3 fe0f 200d 1f308]
๐Ÿ‡ฉ๐Ÿ‡ช 8 bytes: f0 9f 87 a9 f0 9f 87 aa 2 code points: 1f1e9 1f1ea 1 cluster: [1f1e9 1f1ea]

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Monospace Width

Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See here for more information.

Installation

go get github.com/rivo/uniseg

Examples

Counting Characters in a String

n := uniseg.GraphemeClusterCount("๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿณ๏ธโ€๐ŸŒˆ")
fmt.Println(n)
// 2

Calculating the Monospace String Width

width := uniseg.StringWidth("๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿณ๏ธโ€๐ŸŒˆ!")
fmt.Println(width)
// 5

Using the Graphemes Class

This is the most convenient method of iterating over grapheme clusters:

gr := uniseg.NewGraphemes("๐Ÿ‘๐Ÿผ!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
// [1f44d 1f3fc] [21]

Using the Step or StepString Function

This avoids allocating a new Graphemes object but it requires the handling of states and boundaries:

str := "๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿณ๏ธโ€๐ŸŒˆ"
state := -1
var c string
for len(str) > 0 {
	c, str, _, state = uniseg.StepString(str, state)
	fmt.Printf("%x ", []rune(c))
}
// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]

Advanced Examples

The Graphemes class offers the most convenient way to access all functionality of this package. But in some cases, it may be better to use the specialized functions directly. For example, if you're only interested in word segmentation, use FirstWord or FirstWordInString:

str := "Hello, world!"
state := -1
var c string
for len(str) > 0 {
	c, str, state = uniseg.FirstWordInString(str, state)
	fmt.Printf("(%s)\n", c)
}
// (Hello)
// (,)
// ( )
// (world)
// (!)

Similarly, use

If you're only interested in the width of characters, use FirstGraphemeCluster or FirstGraphemeClusterInString. It is much faster than using Step, StepString, or the Graphemes class because it does not include the logic for word / sentence / line boundaries.

Finally, if you need to reverse a string while preserving grapheme clusters, use ReverseString:

fmt.Println(uniseg.ReverseString("๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿณ๏ธโ€๐ŸŒˆ"))
// ๐Ÿณ๏ธโ€๐ŸŒˆ๐Ÿ‡ฉ๐Ÿ‡ช

Documentation

Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Sponsor this Project

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

uniseg's People

Contributors

dchapes avatar dolmen avatar elliotwutingfeng avatar fmatzy avatar junegunn avatar meowgorithm avatar rivo avatar shogo82148 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uniseg's Issues

Error building using gcc go 1.13.1

c29cfd2
go version go1.18 gccgo (GCC) 13.1.1 20230429 linux/amd64

go build
# github.com/rivo/uniseg
./properties.go:137:20: error: expected โ€˜(โ€™
  137 | func propertySearch[E interface{ [3]int | [4]int }](dictionary []E, r rune) (result E) {
      |                    ^
./properties.go:137:23: error: expected โ€˜]โ€™
  137 | func propertySearch[E interface{ [3]int | [4]int }](dictionary []E, r rune) (result E) {
      |                       ^
./properties.go:137:23: error: expected โ€˜;โ€™ or newline after top level declaration
./properties.go:140:9: error: expected declaration
  140 |         to := len(dictionary)
      |         ^
./properties.go:141:9: error: expected declaration
  141 |         for to > from {
      |         ^
./properties.go:143:17: error: expected declaration
  143 |                 cpRange := dictionary[middle]
      |                 ^
./properties.go:144:17: error: expected declaration
  144 |                 if int(r) < cpRange[0] {
      |                 ^
./properties.go:146:25: error: expected declaration
  146 |                         continue
      |                         ^
./properties.go:147:17: error: expected declaration
  147 |                 }
      |                 ^
./properties.go:148:17: error: expected declaration
  148 |                 if int(r) > cpRange[1] {
      |                 ^
./properties.go:150:25: error: expected declaration
  150 |                         continue
      |                         ^
./properties.go:151:17: error: expected declaration
  151 |                 }
      |                 ^
./properties.go:152:17: error: expected declaration
  152 |                 return cpRange
      |                 ^
./properties.go:153:9: error: expected declaration
  153 |         }
      |         ^
./properties.go:154:9: error: expected declaration
  154 |         return
      |         ^
./properties.go:155:1: error: expected declaration
  155 | }
      | ^

Codegen for Unicode Test cases?

After reading #6 (comment), I have a question about how those parts of the case where added and how they are maintained.

I.e.ย the lines starting at:

uniseg/grapheme_test.go

Lines 43 to 47 in f699dde

// The following tests are taken from
// http://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakTest.txt,
// see https://www.unicode.org/license.html for the Unicode license agreement.
{original: "\u0020\u0020", expected: [][]rune{{0x0020}, {0x0020}}}, // รท [0.2] SPACE (Other) รท [999.0] SPACE (Other) รท [0.3]
{original: "\u0020\u0308\u0020", expected: [][]rune{{0x0020, 0x0308}, {0x0020}}}, // รท [0.2] SPACE (Other) ร— [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) รท [999.0] SPACE (Other) รท [0.3]

Where these entered by hand or generated by code that you have but that isn't committed?

If the later, would you please consider committing it such that those tests can be re-generated, updated, etc via go generate. If you have such code but it's not of a suitable quality to commit (e.g.ย you hacked something disposable together that did the job "good enough" for the first commit) then, if you'd like, I could take that code and clean it up for you.

If the former, would you be open to a PR that given a URL would code generate the relevant test cases? I imaging something like a grapheme_test_gen.go file with a // +build generate tag that generates just the sub-slice of unicode.org tests to a new file, e.g. grapheme_gen_test.go, that is then just referenced from the existing hand written grapheme_test.go file, which would have an appropriate //go:generate go run grapheme_test_gen.go https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakTest.txt line that could be easily be changed.

Further, are there other parts of the code that could possibly benefit from code generation from other pages at https://www.unicode.org/Public/13.0.0/ucd/auxiliary/ or elsewhere? Possibly the rules in grTransitions or the property codePoints slice? Would you also be open to PRs for such changes? The goal wouldn't be to harm code readability in any way but instead to allow easier and far less error-prone updating as/when Unicode changes/evolves. Examples of doing such things can be found in places such as golang.org/x/text/unicode.

Thank you.

FirstGraphemeCluster does not need to preserve state across grapheme clusters

Hi,

The FirstGraphemeCluster function can be used to iteratively extract grapheme clusters from a string (without additional allocations). The function mentions that a state should be passed (initially set to -1), is then returned and should be passed again on the next call, in order to preserve some state across calls of this function.

This state contains the current grapheme cluster parser state, and the property of the next codepoint.

It did not make sense to me that decoding grapheme cluster depended on earlier state: I'd expected that each grapheme cluster was fully independent.

To test this, I took the full test case for grapheme cluster boundary processing of Unicode 14.0 (the version supported by the library), and ran a simple test by calling FirstGraphemeClusterInString and comparing the results with the spec:

  • When preserving the state across grapheme clusters: everything works (as expected: the library is compliant ๐Ÿ˜‹)
  • When explicitly resetting the state to -1 across calls to FirstGraphemeClusterInString (should be incorrect): everything still works, all tests pass!!!

This would mean that even when not preserving any state, the actual grapheme clusters that are returned are always the same.

So, from my understanding, there shouldn't be the need for any state at all between calls of the library; and the state parameter can be fully deprecated.

Full test case (see the TODO line), try running in the Go playground (prints All tests passed): https://gist.github.com/delthas/0965a2c198b3a114fbb6706435786b73

Decode last grapheme cluster

Would it be possible to efficiently decode the last grapheme cluster in a string/byte slice? The First functions are great, but it would be nice to be able to iterate in reverse as well, just like with runes with utf8.DecodeLastRune.

I briefly looked over the Unicode spec, and it said "By constructing a state table for the reverse direction from the same specification of the rules, reverse iteration is possible," which suggests that it is possible, though I don't know if it's easy to implement.

Thanks!

Emoji detection

Since there is the emojiPresentation map, could this library be extended to detect emojis? I have a use case where I want to remove emojis from text but due to lack of options it seems I have to use the github.com/forPelevin/gomoji, which uses this library, but it has the entire emoji db that is 1.25MB map that needs to be loaded in memory, which I am not liking. Hence my question.

Unicode 15.1.0 support

Unicode 15.1.0 is released.

It contains the change of UAX #14 and UAX #29.

Significant updates have been made to UAX #14, Unicode Line Breaking Algorithm and UAX #29, Unicode Text Segmentation adding better support for scripts of South and Southeast Asia, including grapheme cluster support for aksaras and consonant conjuncts, and line breaking at orthographic syllable boundaries.

Make runeWidth a public function

There are many modules using uniseg for the StringWidth function but also could use a RuneWidth function. Could this function be made accessible so there is no need for incorrect implementations to be used.

Examples of another module providing both these functions.
RuneWidth - https://github.com/mattn/go-runewidth/blob/master/runewidth.go#L307
StringWidth - https://github.com/mattn/go-runewidth/blob/master/runewidth.go#L322

As you can see, the StringWidth function relies of uniseg. Any application using go-runewidth for StringWidth could swap directly over to uniseg for the same capability. Comparing the benchmarks of the functions shows nearly 2x faster performance for uniseg.

Unfortunately, runeWidth is not public so I am unable to compare the performance. I do not see any negatives with making this function public. Is this something you would consider changing?

go get with error

[root@VM-4-7-centos width]# go version
go version go1.17.12 linux/amd64

[root@VM-4-7-centos width]# go get github.com/rivo/uniseg

github.com/rivo/uniseg

/root/go/src/github.com/rivo/uniseg/properties.go:137:6: missing function body
/root/go/src/github.com/rivo/uniseg/properties.go:137:20: syntax error: unexpected [, expecting (

Building with bazel

Have you used uniseg in a bazel project? I'm not able to import it due to there not being a BUILD.bazel file.

Issue

โ””โ”€$ go run main.go

github.com/rivo/uniseg

../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:137:20: error: expected โ€˜(โ€™
137 | func propertySearch[E interface{ [3]int | [4]int }](dictionary []E, r rune) (result E) {
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:137:23: error: expected โ€˜]โ€™
137 | func propertySearch[E interface{ [3]int | [4]int }](dictionary []E, r rune) (result E) {
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:137:23: error: expected โ€˜;โ€™ or newline after top level declaration
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:140:9: error: expected declaration
140 | to := len(dictionary)
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:141:9: error: expected declaration
141 | for to > from {
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:143:17: error: expected declaration
143 | cpRange := dictionary[middle]
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:144:17: error: expected declaration
144 | if int(r) < cpRange[0] {
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:146:25: error: expected declaration
146 | continue
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:147:17: error: expected declaration
147 | }
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:148:17: error: expected declaration
148 | if int(r) > cpRange[1] {
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:150:25: error: expected declaration
150 | continue
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:151:17: error: expected declaration
151 | }
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:152:17: error: expected declaration
152 | return cpRange
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:153:9: error: expected declaration
153 | }
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:154:9: error: expected declaration
154 | return
| ^
../../../go/pkg/mod/github.com/rivo/[email protected]/properties.go:155:1: error: expected declaration
155 | }

Inconsistencies for some string widths

Have been doing some testing with terminal compatibility and seeing how uniseg differs in string width results.

Using WezTerm, I have found the results nearly identical when comparing 4733 emojis. Great result compared to many other terminals. ๐Ÿ‘

However the following table was output from my test app with some differences. I am uncertain where the problems are and am seeking some guidance on the expected outcome.

โ”Œ12โ” โ”Œ String Width โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œ Info โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚๐Ÿปโ”‚ โ”‚ Library: 0 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f3fb]
โ”‚๐Ÿผโ”‚ โ”‚ Library: 0 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f3fc]
โ”‚๐Ÿฝโ”‚ โ”‚ Library: 0 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f3fd]
โ”‚๐Ÿพโ”‚ โ”‚ Library: 0 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f3fe]
โ”‚๐Ÿฟโ”‚ โ”‚ Library: 0 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f3ff]
โ”‚ใ€ฐโ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [3030]
โ”‚ใ€ฝโ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [303d]
โ”‚*๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [2a fe0f 20e3]
โ”‚0๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [30 fe0f 20e3]
โ”‚1๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [31 fe0f 20e3]
โ”‚2๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [32 fe0f 20e3]
โ”‚3๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [33 fe0f 20e3]
โ”‚4๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [34 fe0f 20e3]
โ”‚5๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [35 fe0f 20e3]
โ”‚6๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [36 fe0f 20e3]
โ”‚7๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [37 fe0f 20e3]
โ”‚8๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [38 fe0f 20e3]
โ”‚9๏ธโƒฃ โ”‚ โ”‚ Library: 2 Terminal: 1 Result: โœ— โ”‚ โ”‚ Padding: 1 Variant:  true โ”‚ Codepoints: [39 fe0f 20e3]
โ”‚๐Ÿˆ‚โ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f202]
โ”‚๐Ÿˆทโ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [1f237]
โ”‚ใŠ—โ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [3297]
โ”‚ใŠ™โ”‚ โ”‚ Library: 1 Terminal: 2 Result: โœ— โ”‚ โ”‚ Padding: 0 Variant: false โ”‚ Codepoints: [3299]
โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Success: 4711, Total: 4733 (99.54%)

In this output Library means uniseg and Terminal means WezTerm. My main concerns are the first 5 being reported as a width of 0 when clearly there is some output.

However, I do realise that these may be like other selectors and by themselves act like ZWJ (0 width). Is this a case where it only applies with people emojis and changes their skin colour? Does this mean by themselves they are incomplete (and meaningless)?

WezTerm was also configured with the following option:

config.unicode_version = 14

Adding @wez to this because either way, there are some minor Unicode differences between uniseg and WezTerm so there is a bug somewhere. ๐Ÿ˜„

@jquast I used your method from ucs-detect to query the cursor position before and after sending the emoji.

properties.go err

Hi,I got a problem compiling my ngrok program, which relies on this uniseg but has generated an error. I checked the local properties. go and the remote one are consistent. What should I do?

Characters like Japanese and Chinese have a width of 1.5

I am using this module for my CLI and I can't seem to get characters like Japanese and Chinese to working.

After some digging up I found that each char has a width of 1.5 not 1. So I think it to solve it. We can check if the character is from that specific range then multiply it by 1.5 and them round it off so it can work. There can be better solutions but this is one that I can think of!

I don't know any terminal which supports my module with these character sets.

Questions about possible PR for performance improvements

Before I spend further time on cleaning it up (currently it works well for me), or worse, just fire off a pull request as-is; I'd like to know if you would be open to some performance improvements that change the internals of the Graphemes typeยน and relegates the existing grTransitions map to use only at startup to generate a map of all the possible transitionsยฒ? If you are open to it, do you have any preferences for such a PR (over and above the tview contributing guide which you've mentioned applies here)? E.g.ย do you prefer such a PR to be all-in-one single-commit or split up piece-by-piece into multiple commits changing one logical piece at a time with commit comments describing each step in detail (which could make code review easier)?

I have code that makes the above changes (plus some other simpler ones) that together improves performance significantly on my (admittedly currently simpleยณ) benchmark testcases. benchstat reports ~-45%ย cpu and ~-65%ย allocations.
The only performance regression appears to be the (hopefully rare) case of client code doing repeated Reset/Runes calls on the same Graphemes object; i.e. something like the following, when the outer loop count is high, is ~10% slower and does more allocations (one small one per Runes call vs. two large ones per NewGraphemes call):

for g := uniseg.NewGraphemes(str); someCondition; g.Reset() {
	for g.Next() {
		doSomething(g.Runes())
	}
}

Also, I'm cognisant of the valid concern that optimisation should not (usually) be done at the cost of readability/complexity/maintenance; I personally find the changes I have to mostly improve readability (but of course that can be subjective opinion) and improve (or not change) maintainability.

Let me know what you think, thanks!


ยน Replacing the constructed codePoints and indices rune slices with just the original string.
I only imagine this could be an issue if you have ideas for future features/changes that relied on the existing pre-built codePoints or indices slices. E.g.ย something like randomly addressing grapheme clusters by index rather than the current sequential access. (By the way, a side effect of this is that the Runes method would no longer return a sub-slice of internal state that the client can, but shouldn't, change.)

ยฒ Currently my change is in the form of:

var grAllTransitions = func() map[โ€ฆ]โ€ฆ {
	var grTransitions = โ€ฆ existing map โ€ฆ
	for each grXXX, prXXX combination {
		โ€ฆ make an entry using existing code logic to find transition โ€ฆ
	}
	return new map
}()

(Here grAllTransitions is a map as a package global variable; grTransitions is a local variable so that it is either stack allocated or garbage collected after initialisation, partially mitigating the increased package memory requirements.)
Next() then just does a simple lookup in grAllTransitions avoiding the run-time logic of picking between entries. Currently this changes the 30 entry map into a 165 entry map (but the map entries are each slightly smaller as they can exclude the rule number; overall with some other type changes the total map size goes from ~2.7 to ~10ย Kbytes on amd64). I know the properties code has a comment indicating that it only has a subset of Unicode properties so a possible concern would be if you intend/expect to need many more properties (or states) than currently in use (11 states ร— 15 properties = 165 entries).
An alternative with the same performance benefits would be to move the grTransitions map to a source file only used by go generate to generate a static version of grAllTransitions. This would have the benefit of no package initialisation code but at the cost of go generate complexity.

ยณ Exact benchmark values depend heavily on the input string length; one TODO item I have is to add benchmarks for short, medium, and long inputs as well as variation from plain ASCII through to all multi-rune-grapheme-clusters inputs.
Any suggestions on sources of more representative inputs to benchmark against? I'm currently just ramming together the original field of testCases[10:20] as a single input to the benchmarks โ˜น๏ธ (i.e.ย a single input of 139 bytes, 44 runes, 18 grapheme clusters).
Also, the current benchmark results linked to above are effectively doing:

BenchmarkGraphemeXXX/New:   benchLoop { NewGraphemes(s);            for g.Next { g.XXX() } }
BenchmarkGraphemeXXX/Reset: NewGraphemes(s); benchLoop { g.Reset(); for g.Next { g.XXX() } }

Result difference between uniseg.GraphemeClusterCount and PHP grapheme_strlen

Hello!

I observed the following difference with the mentioned sequences. To be honest I'm not sure which one is correct, but could you help to confirm if the result is expected with uniseg library?

Thank you!

--

Golang with uniseg.GraphemeClusterCount

package main

import (
	"fmt"

	"github.com/rivo/uniseg" // v0.4.3
)

func main() {
	fmt.Println(uniseg.GraphemeClusterCount("\r\n\uFE0E"))
	fmt.Println(uniseg.GraphemeClusterCount("\n\uFE0E"))
}

Output:

1
2

https://goplay.tools/snippet/WBIJQfKZs7g

PHP 8.0.28 with grapheme_strlen

<?php
printf("%d\n", grapheme_strlen("\r\n\u{FE0E}"));
printf("%d\n", grapheme_strlen("\n\u{FE0E}"));

Output:

2
2

https://onlinephp.io/c/2cb86

Any chance of implementing word-segmenting?

Hello! I could really use a Golang implementation of the word-splitting rules in UAX#29. I think that with the kind of parser framework you have here, it'd be relatively easy to implement (I could even hack away at a probably-poor pull request, if you were interested). Have you considered adding it to the library?

Changelog? Semver?

Hi!

Thank you for this project!

Could you please add a changelog to it? And info if you follow semver? Because without these it's hard to make a decision when and how to start using its newer version.

Please consider tagging this repo

Hello,

I'm attempting to clean up a go.mod file to make it easier to discover what versions changed. One of the dependencies is this repo, which appears to have no semver compatible tags. Please consider tagging this repo (eg. v0.0.1) so that it is easier to see when it's updated in go.mod files without having to remember dates and hashes.

Thank you for your consideration, and for your work on this module!

Feature request: Grapheme.String()

Could you please make Grapheme compliant with Stringer interface?

// String returns the string used to initialize the cluster satisfying Stringer interface.
func (g *Graphemes) String() string {
	return g.original
}

Thanks!

libstemmer

hey Rivo !

DO you think i can use uniseg to replace lib stemmer ?
Its for indexing in bleve !

Syntax error in properties.go:130:20

Dear rivo,

Thank you very much for your contribution to uniseg. I encountered an error while installing ngrok in the "make release-client" step. It seems due to a syntax error in the file of properties.go. The details are as follows:

ๆœชๅ‘ฝๅ

as the function name is not purple here.
ๆœชๅ‘ฝๅ2

Support for Unicode 13

Unicode 13.0 was released on 10th March 2020 so I think this repo should be updated accordingly.

Variation Selectors incorrectly modify some StringWidths

When a Variation Selector succeeds a character that doesn't support it, the width should not be altered. Currently, uniseg reports a width of 2 for any grapheme which has a VS16 selector in it, regardless if the first rune is an emoji sequence or not.

Example

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	// 2
	fmt.Println(uniseg.StringWidth("x\uFE0F"))
}

From the unicode standard:

image

Proposed solution

When encountering a VS16 selector, uniseg should verify that the previous rune is indeed an emoji.

Improve emoji consistency with older terminals

For accurate rendering of emojis, it is important that both the terminal and the library are consistent in determining the width of emojis. Unicode 14 clarified that the width of emoji presentation using variation selector 16 was double width. This change has created a problem.

Many terminals (such as macOS terminal, Alacritty, Hyper, VS Code) do not support the latest Unicode standard. This means they may not display newer grapheme clusters correctly. However an even bigger problem is they render older emojis different especially those using variation selector 16.

One of the guiding principles of uniseg is @rivo's aim for perfection. uniseg feels like a reference implementation and helps identify problems with other implementations. But users don't care about perfection, they care about compatibility.

This puts uniseg in a difficult place. While uniseg is correct, from a compatibility perspective, it is providing different results to the majority of terminals. This creates a poor experience as the only option is to tell the developer of the terminal to upgrade their handling of Unicode. With many terminals dependent on xterm.js this is nearly impossible for them to fix.

Can we find a way to support older terminals without trying to support multiple Unicode versions?

A global option to override how variation selector 16 is handled is obviously one approach. The precedent has already been set with EastAsianAmbiguousWidth. It doesn't change much but would solve the biggest rendering difference. iTerm2 provided an advanced option specifically for this case. WezTerm has an option for choosing which Unicode standard is used.

Another approach (which I have created a proof of concept) is the ability to override the result for specific code points. By implementing this as a global (with all the downsides as well), this allows all dependencies that rely on uniseg to provide the same consistent results. While my implementation is crude, it should demonstrate this idea. Benefits being that this would shift compatibility overrides to applications that use uniseg - the tech debt stays out of uniseg.

I'd certainly understand if this was marked as WONTFIX, clearly this is not a problem with uniseg, but it is something that creates problems for Go applications that rely on uniseg but have users on older terminals. Maintaining my fork with the hack is fairly easy, but felt an issue to discuss this might be appropriate.

golang v1.19 support

Currently we are having problem while using this library with Golang v1.19 projects.

note: module requires Go 1.18

Golang 1.19 is now latest stable version. Are you planning to support v1.19 soon?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.