tc39 / proposal-intl-segmenter Goto Github PK

Unicode text segmentation for ECMAScript

Home Page: https://tc39.github.io/proposal-intl-segmenter/

HTML 95.37% Shell 4.63%

proposal-intl-segmenter's Issues

CreateIterResult is not specified

In https://tc39.github.io/proposal-intl-segmenter/#sec-segment-iterator-prototype-next
step 5 and 13 mention CreateIterResult. But such operation is not defined in the current spec.
"
5. If done is true, return CreateIterResult(undefined, true).
...
13. Return CreateIterResult(result, false).
"
@littledan @Ms2ger

Consider removing the locale from grapheme segmentation

A strong piece of feedback from the September 2016 TC39 meeting (from @bterlson, @rniwa and others) was that the locale is not needed for grapheme breaks, so it should not be a parameter. However, I later spoke with Mark Davis, who said that logic for "extended grapheme clusters", e.g., for Indic scripts, is still in flux, and he recommended that all new APIs for grapheme segmentation take a locale as a parameter for future-proofing.

Intl.Segmenter vs. Intl.Segmentation

This is perhaps minor, and it's difficult to muster strong feelings about it, but I believe that the new API proposed here should be named Intl.Segmentation rather than Intl.Segmenter. Not only is "segmentation" the term used in UAX #29, but it would also be more consistent with the existing constructors, which with the exception of Collator are not agent nouns even though they could be (e.g., we have Intl.NumberFormat rather than Intl.NumberFormatter and Intl.PluralRules rather than Intl.Pluralizer).

Please provide opinions on this change.

// analogous to: let formatter = new Intl.NumberFormat("fr");
let segmenter = new Intl.Segmentation("fr", {granularity: "word"});

// analogous to: formatter.format(number)
let boundaries = segmenter.segment(input);

False statement in Q: Why is this an Intl API instead of String methods

The only internationalization-related functions that are placed as methods on String, Date and Number are simpler functions that just convert to a string, e.g., toLocaleString(), toLocaleUpperCase()--these may take a locale argument, but no options bag.

this is not true:

date.toLocaleString(locales, options)
date.toLocaleDateString(locales, options)
date.toLocaleTimeString(locales, options)
number.toLocaleString(locales, options)
typedArray.toLocaleString(locales, options)

all take i18n-related options.
Usually these APIs are a subset of what is also exposed in different ways on the Intl object, e.g. new Intl.DateTimeFormat(locales, options).format(date), which also has formatToParts() that is not on the Date prototype. I think it's therefor fair to discuss whether parts of this Intl API should also be exposed on the String prototype to make it easier to use.

Review from the W3C i18n WG

The W3C Internationalization Working Group looks deeply into issues of text display, including line breaking, which this proposal also touches on. If they are available, I'd appreciate a review from them. cc @aphillips @stpeter

Add FAQ: why expose the API on `Intl` instead of `String`?

Provide a locale-less segmenter

We should have the ability to create a segmenter that does exactly what UAX 29 tells it to do, and nothing more.

Convenience API suggestion: return the length

E.g. Intl.Segmenter.count("string", "fr", { type: "grapheme" }) would be quite useful for e.g. a Twitter-style service limiting you to 140 chars.

Clarify initial value of [[SegmentIteratorBreakType]]

CreateSegmentIterator has:

Let iterator.[[SegmentIteratorBreakType]] be an implementation-dependent string representing a break at the edge of a string.

while AdvanceSegmentIterator has:

Set iterator.[[SegmentIteratorBreakType]] to a string representing the type of break found, using one of the values found in the table Table 2, or undefined if the boundaries of the string are reached, or if there is no meaningful type for the granularity.

I'm guessing the same restrictions are supposed to apply to the implementation-dependent value set in CreateSegmentIterator.

precedingSegmentType misrepresents list-valued segment data

As I've mentioned before, there are issues with using a scalar-valued precedingSegmentType to expose list-valued segment data:

Imprecision. The word- and sentence-granularity types in the current spec collapse together several combinations of what logically and in ICU are independent flags (e.g., a "word" can contain or not contain letters/numbers/kana/ideographs/etc., a non-word "none" can contain punctuation or whitespace, a "term" sentence may or may not include whitespace and/or a line terminator after its terminal punctuation.
Incompleteness. Consumers can observe that a sentence segment has terminating punctuation, but cannot easily determine what that punctuation is.
Nonconfigurability. Implementations are required to test characters as they iterate through a string, but consumers have no way to express specific details that matter for their application such as the presence of numbers or non-ASCII characters or special-significance symbols like # or @.
Future-hostility. If we ever do want to be more expressive, backwards compatibility will motivate preserving the existing field and its semantics with ugliness like describing a word-granularity segment "&" as { index: 42, precedingSegmentType: "none", precedingSegmentTags: ["punctuation"] }.

If collecting details about segments during internal iteration and exposing them later to avoid the need for author-level re-iteration is important (and it seems to be), then this API should expose that information in a way that doesn't suffer from the above issues.

For example, replacing string precedingSegmentType with array-of-strings precedingSegmentTags would address imprecision and future-hostility and also suggest a straightforward future extension to address incompleteness and nonconfigurability (e.g., new Intl.Segmenter(locale, {granularity: "word", customTags})).

Should segmentIterator.following(n) match a break position at n?

Current text from the README:

%SegmentIterator%.prototype.following(index)
Move the iterator to the next break position after the given code unit index index, or if no index is provided, after its current position. Returns true if the end of the string was reached.

This is divergent from analogous behavior in RegExp.prototype.exec (starting at lastIndex) and String.prototype.indexOf(searchString, position) (and possibly other preexisting APIs), which start at rather than after the relevant index. It would also be a bit odd for "reset" behavior to look like iterator.following(-1). Perhaps this aspect should be reconsidered.

Should next() include a property with the position?

The String.prototype.codePoints proposal does. See also tc39/proposal-string-prototype-codepoints#3

Code point vs code unit in position

In the October 2018 Intl call, @gibson042 raised some concern about the use of code units rather than code points to describe the offset.

To me, code unit is the only usable measure, given that that's what JS strings are based on; I didn't quite understand Richard's argument and can't reproduce it here well; maybe he could chime in with it.

cc @sffc @litherum

Where is the spec of reverseSegment?

In https://tc39.github.io/proposal-intl-segmenter/#segment-iterator-objects it mentioned
"and Intl.Segment.prototype.reverseSegment"
but I cannot find the spec of Intl.Segment.prototype.reverseSegment

Should we add

the specification Intl.Segment.prototype.reverseSegment or
remove "and Intl.Segment.prototype.reverseSegment" from the following text:
"The methods Intl.Segment.prototype.segment and Intl.Segment.prototype.reverseSegment return iterators over the segments for a particular string. This section describes those iterator objects."

as
"The methods Intl.Segment.prototype.segment returns iterators over the segments for a particular string. This section describes those iterator objects."

Match ECMA-402 formatting

`undefined` → *undefined*
*0* → 0
CR → <CR>, LF → <LF>, etc.
properly reference r.[[locale]] rather than r.[[Locale]] (cf. ResolveLocal)
add a section to describe the internal slots of segment iterators
etc.

Add back granularity: "line"?

Now that the interface has been clarified as iterating over boundaries, there is potential to reintroduce { granularity: "line" } implementing UAX #14 or am implementation-specific alternative. https://twitter.com/AaronPresley/status/1116424359223054336 provides a clear (if limited) use case that was missing at the time of #49.

But on the other hand, there's nothing preventing us from keeping it scoped out of this proposal and potentially adding it in a followup. I'm comfortable with either, but didn't want to bury the new demand that has emerged since December.

Script runs

Would splitting text into script runs be in scope for the Intl.Segmenter API?

The invocation of ResolveLocale is missing an argument

The current specification of Intl.Segmenter invokes ResolveLocale with only 4 arguments, omitting the mandatory localeData:

Let r be ResolveLocale(%Segmenter%.[[AvailableLocales]], requestedLocales, opt, %Segmenter%.[[RelevantExtensionKeys]]).

Add an @@iterator method to the iterator prototype

Without it, these objects don't actually implement the Iterable interface and for-of/Array.from/etc. won't properly consume them.

Based on the random access pattern, I think we just want a @@iterator that returns the receiver.

Should we be supporting multiple ways of grapheme breaking for a particular locale?

There are two ways of grapheme breaking consonant conjuncts in Indic texts. The unicode bug seems to want to provide APIs for both. Should we do so too?

Add lb option

The lb Unicode extension key is equivalent to the "strictness" option. In the spirit of parity between options and tags, we should add support for this key to the segmenter.

Lone surrogates

How does this API deal with lone surrogates? Are those a valid grapheme cluster?

Support CSS word-break property

CSS defines a couple tailorings of line breaking, in the word-break property, which can have three values: normal, keep-all, break-all. None of these, including break-all, expose graphemes breaks, but rather they are modifications of UAX 14. Expose these tailorings for direct usage through Intl.Segmenter.

cc @eaenet

{granularity: "line"} promotes reimplementing paragraph layout in script

The only use case I can imagine for line break iterators would be people trying to do their own paragraph layout themselves (e.g. eventually painting into a canvas).

The best way to perform paragraph layout in a browser is to use HTML elements and CSS. An author trying to do it themself with Javascript would almost certainly be both slower, less correct, and less accessible than doing it with the browser's engine.

This probably isn't true for the other segmenters - I can think of plenty of use cases for the other ones, but if there is wide adoption of line breaking, specifically, it would be unfortunate for the Web.

Editorial: Refactor for normative reference by higher web specs

Other web specs do various kinds of breaking, especially line breaking. Factor the Intl.Segmenter spec text such that there is an abstract algorithm that they can call to get at the break, to cement the fact that Intl.Segmenter uses the same breaking algorithm as higher web specs.

cc @annevk

Incomplete amount of exposed state on %SegmentIteratorPrototype%

It's a bit strange for each segmentIterator to expose some but not all of the data returned by its next method. I'd be in favor of dropping breakType, or—if keeping it serves some important purpose—replacing both it and position/index with an accessor that echoes the most recent object from next.

Should we include all the break types that ICU includes, or a more limited set?

ICU break types are documented here. Some things seem especially important (whether something is a word or not, whether it's a hard or soft line break, whether the sentence break is induced by a line break or punctuation token), and others seem less important (whether the word starts with a number, letter, ideographic or kana character).

Should we be taking that latter category of distinctions within word breaks with us when defining Intl.Segmenter? I'm no expert here, but the existing ICU distinctions feel a bit arbitrary. Also, when I ran a simple test, it seems like katakana and hiragana characters are categorized as "ideo"--is the kana category just historical? I don't see these distinctions documented within UAX 29 either--they don't seem to correspond to values of the Word_Break property.

I wonder if, in a new API, we should just group the word break types for number, letter, kana and ideo together into a category for the word (as opposed to the whitespace or punctuation). Thoughts?

cc @jungshik

Rename Segment Iterator to Boundary Iterator

Segment Iterator → Boundary Iterator
SegmentIterator* → BoundaryIterator*

cf. #67 (comment)

Please, BoundaryIterator rather than BreakIterator. "Boundary" is better because it applies more generally (e.g., there are grapheme boundaries but not really grapheme breaks). It was also the predominant term recorded in meeting notes:

RG: It sounds like there's a rough consensus over making this an iterator over the boundaries.

FT: How about "boundaryType"?

RG: It seems like we have an agreement on the conceptual model. I'd like to follow up with a PR. If you're breaking on words, for example, do you need to distinguish segments that are whitespace, for example, compared to a segment of letters?

Should only strings and objects be allowed in Segmenter.prototype.segment?

Currently we call ToString on the argument passed to Segmenter.prototype.segment giving us some nice wat results like:

var s = new Intl.Segmenter("en", {granularity: "word"});
s.segment().next().value.segment // "undefined"
s.segment(null).next().value.segment // "null"
s.segment(true).next().value.segment // "true"
s.segment(false).next().value.segment // "false"

Should we type check if the argument is a string or object and then only ToString them?

Should "breakType" rename to "segmentType"

During one of our design review, one of our colleague question why we name this API as "Segmenter" instead of BreakIterator but in the same time use the term "breakType" but not "segmentType". He suggest if we name this API as "segmenter", then we should make all the name consistent and therefore rename "breakType" in the spec as "segmentType" instead.

Different word for the two different "types"?

This is really minor, and aligning with existing APIs and vocabulary takes precendence IMO. But it might be worth considering whether we can use a different word (e.g. "kind") or some prefix ("segmentType") to disambiguate between the constructor option and the breakType.

Provide an example of complete iteration in the README

The current example in README starts to iterate over the words of "Ceci n'est pas une pipe", but breaks after the first result. I think it would be helpful to see the full results.

Support preceding and following methods

ICU's BreakIterator class supports preceding and following methods to find breaks before or after a given character offset, without iterating from the beginning. It works by starting from the given offset and iterating in the reverse direction until a "safe" breakpoint (not dependent on context) is found past the offset, and then moving the opposite direction until a contextual break is found. Note that this is different from simply slicing the string and using the reverse or forward iterator on it, because context created by the characters before or after the sliced range would be lost.

This is useful e.g. when word wrapping glyphs, in which case the maximum possible character index that will fit on the line is known. In this case, one could use the preceding method to find the nearest valid line boundary prior to that character index. This would be faster than iterating all of the possible breaks from the beginning of the line until the character index is passed.

I'm not sure of the best API here, but a possible one could be to add the methods to the SegmentIterator. This would have the effect of moving the iterator to the nearest preceeding or following break to the provided character index, and returning an iteration result similar to the one returned by next.

Rename %IteratorPrototype%.index to lastIndex?

They seem roughly equal in clarity, but lastIndex would be more consistent with e.g. RegExp.

Docs(MDN) : Documentation for Intl.Segmenter

Create Documentation for ** Intl.Segmenter**

Review Readme documentation and examples
Create MDN Main Docs Page

MDN Pages :

prototype
constructor
methods

Interactive Examples MDN :

Segmenter Generic Usage
Segmenter.prototype.segment(string)
Segmenter.prototype.next()
Segmenter.prototype.following(index)
Segmenter.prototype.preceding(index)
Segmenter.prototype.index
Segmenter.prototype.breakType

Browser compat-data :

Segmenter.prototype.segment(string)
Segmenter.prototype.next()
Segmenter.prototype.following(index)
Segmenter.prototype.preceding(index)
Segmenter.prototype.index
Segmenter.prototype.breakType

Support strictness

From @jungshik in tc39/ecma402#60

We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.

CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).

Rename segment type values

UAX #29 rules describe where boundaries exist and don't exist, but don't classify segments or use properties to describe them. ICU does apply properties, but seems to do so as multiple flags rather than as a single label—for example, the word segment "A113" might be described by both LETTER and NUMBER (assuming the data are derived from/analogous to the Default Word Boundary Specification), while the sentence segment "Why?\n" is described by STERM and LF (assuming the data are derived from/analogous to the Default Sentence Boundary Specification).

I'm still not sure how I feel about collapsing what is logically a list into a scalar, but taking that as given for the sake of discussion, I'd like to see better naming.

Suggestions:

grapheme: undefined → "grapheme"
word: "none" → "space" and "punctuation" (i.e., three types rather than two)
sentence: "term" → "terminated"
sentence: "sep" → "fragment"

Convenience API suggestion: return an array

E.g. Intl.Segmenter.segments("string", "fr", { type: "word" })

Allow implementations to determine their own breaking rules

Different web browsers use different line breaking rules. WebKit and Blink use ICU's algorithms based on the Unicode standard, whereas Edge and Firefox use other algorithms. Some of these browser might not even ship Unicode line breaking data.

Rather than normatively referencing Unicode algorithms here, instead say "such as" in a note, and leave it to implementations to use the appropriate breaking algorithms.

Enhance expressiveness and simplicity

Add a method to jump to a particular offset
Add a method to find the previous break
Remove the current convenience iterator in favor of even higher level convenience functionality along the lines of what @domenic suggests in other bugs here

Will this need full ICU in Node.js even for the 'en' locale?

With the last v8-canary build on Windows 7 x64 ([email protected], [email protected]):

'use strict';

console.log(Intl.Segmenter);
console.log(new Intl.Segmenter('en', {granularity: 'word'}));

With full ICU (node --icu-data-dir=.\node_modules\full-icu --harmony test.js):

[Function: Segmenter]
Segmenter [Intl.Segmenter] {}

Without full ICU (node --harmony test.js):

[Function: Segmenter]


#
# Fatal error in , line 0
# Check failed: U_SUCCESS(status).
#
#
#
#FailureMessage Object: 000000000021DB80

What is the prototype of %SegmentIteratorPrototype%?

Does %SegmentIteratorPrototype% have a [[Prototype]] slot with value %IteratorPrototype%? I don't see any spec text that says this. Is this an oversight?

Here's how the ECMA262 iterators specify this:

Cite special word break needs?

Some languages, such as Thai, Japanese, or Chinese, do not use spaces between words. Proper word breaking in these languages depends on special algorithms usually coupled with dictionaries. Since common operations that use word breaking included by-word text selection or indexing for full text search and these operations want true word boundary detection, it would be useful to note the special requirements of these languages. I believe the ICU library now incorporates several unencumbered dictionaries. I call this out because the references in the draft such as Unicode/TR29 and CLDR do not provide this support.

Consider adding @@toStringTag "Intl.Segmenter"

See tc39/proposal-intl-locale#44 and tc39/ecma402#176

Intl.Segmenter constructor should coerce options to Object first

In Intl.Segmenter, we have:

5. Let matcher be ? GetOption(options, "localeMatcher", "string", « "lookup",  "best fit" », "best fit").
6. Set opt.[[localeMatcher]] to matcher.
7. Let lineBreakStyle be ? GetOption(options, "lineBreakStyle", "string", « "strict",  "normal", "loose" », "normal").

and then much later we have the coercion of the options argument in to an Object:

11. If options is undefined, then
    a. Let options be ObjectCreate(null).
12. Else
    b. Let options be ? ToObject(options).

This is inconsistent with all the other intl objects. The coercion to Object should happen before properties of the options argument are accessed.

cc @FrankYFTang

Add lw, ss options?

There are additional options for breaking, specifically:

lw -- word break style (normal, keepall, breakall, matching CSS)
ss -- sentence break suppression (none, standard -- standard might be the better behavior here)

It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.

Reconsider "graphemes"

"grapheme" is a vague term. It can be used to mean the written counterpart of a phoneme (in which case "ee" in "feel" is a grapheme). It can also be used to talk about individual marks (like an "accent"). Unicode never tries to talk in terms of graphemes, and I don't think we should either. This is why Unicode defines grapheme cluster; to get rid of this ambiguity.

We could use "grapheme clusters" here, but we're doing tailored segmentation, so it's not really that.

Unicode defines GCs and EGCs as an approximation of "user-perceived character". I wonder if we can use "characters" instead. Though that's just as ambiguous.

Just hoping we can come up with a better name here.

Is this an API for iterating segments, or boundaries?

The README and spec text seem to bounce somewhat between describing the functionality as "iterating over segments" and "iterating over breaks/boundaries", and the confusion bleeds a little bit into the proposed interfaces. Can we settle on a consistent model and align everything to that?

Relevant Unicode vocabulary is described at #44 (comment) , but the even shorter summary is that both grapheme/word/sentence segmentation and line breaking completely partition a nonempty string into a sequence of nonempty segments terminated by boundaries if we normalize treatment at the start of text (where UAX #29 recognizes a boundary but UAX #14 prohibits a break opportunity). A grapheme boundary follows every "character", a word boundary follows every "word" and every non-word character, a sentence boundary follows every sentence-terminating punctuation after immediately following linear whitespace and up to one line terminator, and a line break opportunity follows every space. In all granularities, a boundary immediately precedes the end of text.

I personally feel like break associates more strongly with lines than with graphemes, words, or sentences, so I'd like to avoid general use of that term, leaving for consideration boundary vs. segment.

Given nonempty input,

a segment iterator identifying segments by the first included position would always have a first result starting at code unit index 0 and a final result starting at or before code unit index len − 1, and results (being segments) could easily be categorized by their constituent code points (although describing line segments as hard vs. soft or mandatory vs. optional is less intuitive than one would hope).
a boundary iterator identifying boundaries by the immediately following position would always have a first result starting at or after code unit index 1 and a final result starting at code unit index len, and results (being boundaries) would have and possibly reference preceding segments (though line boundaries could be intuitively categorized as hard vs. soft or mandatory vs. optional).

My preference is leaning towards the latter, but this determination should really be made by analyzing the consequences upon consumers, both via next and via methods like following/preceding (cf. #52).

"lb" Unicode extension key follow-up

https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter-internal-slots

CLDR defines several extension keys, but this specification does not expose them.

The note should to be updated now that "lb" is exposed.

https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter

The default value for "lineBreakStyle" needs to removed, otherwise it's not possible to modify the line break style using "lb" (because option values always override Unicode extension values, cf. ResolveLocale abstract op).
IOW new Intl.Segmenter("de-u-lb-strict", {granularity:"line"}).resolvedOptions().lineBreakStyle is "normal" with the current spec, because the default option value is always used.

Convenience: method to get current token from iterator

In every example I've written using this (like this one) I end up with this pattern:

const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
let pos = iterator.index();
for (let {index, breakType} of iterator) {
  if (breakType !== 'none')
    words.push(text.slice(pos, index));
  pos = index;
}

i.e. I need to maintain pos tracking the previous index and slice the string myself. How about:

%SegmentIterator%.prototype.segment

Return a substring of the input string from last index to the current index.

...so you can just write:

const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
for (let {index, breakType} of iterator) {
  if (breakType !== 'none')
    words.push(iterator.segment());
}

tc39 / proposal-intl-segmenter Goto Github PK

proposal-intl-segmenter's Issues

Recommend Projects

Recommend Topics

Recommend Org