tc39 / proposal-intl-segmenter Goto Github PK
View Code? Open in Web Editor NEWUnicode text segmentation for ECMAScript
Home Page: https://tc39.github.io/proposal-intl-segmenter/
Unicode text segmentation for ECMAScript
Home Page: https://tc39.github.io/proposal-intl-segmenter/
In https://tc39.github.io/proposal-intl-segmenter/#sec-segment-iterator-prototype-next
step 5 and 13 mention CreateIterResult. But such operation is not defined in the current spec.
"
5. If done is true, return CreateIterResult(undefined, true).
...
13. Return CreateIterResult(result, false).
"
@littledan @Ms2ger
A strong piece of feedback from the September 2016 TC39 meeting (from @bterlson, @rniwa and others) was that the locale is not needed for grapheme breaks, so it should not be a parameter. However, I later spoke with Mark Davis, who said that logic for "extended grapheme clusters", e.g., for Indic scripts, is still in flux, and he recommended that all new APIs for grapheme segmentation take a locale as a parameter for future-proofing.
This is perhaps minor, and it's difficult to muster strong feelings about it, but I believe that the new API proposed here should be named Intl.Segmentation
rather than Intl.Segmenter
. Not only is "segmentation" the term used in UAX #29, but it would also be more consistent with the existing constructors, which with the exception of Collator are not agent nouns even though they could be (e.g., we have Intl.NumberFormat
rather than Intl.NumberFormatter
and Intl.PluralRules
rather than Intl.Pluralizer
).
Please provide opinions on this change.
// analogous to: let formatter = new Intl.NumberFormat("fr");
let segmenter = new Intl.Segmentation("fr", {granularity: "word"});
// analogous to: formatter.format(number)
let boundaries = segmenter.segment(input);
The only internationalization-related functions that are placed as methods on String, Date and Number are simpler functions that just convert to a string, e.g., toLocaleString(), toLocaleUpperCase()--these may take a locale argument, but no options bag.
this is not true:
date.toLocaleString(locales, options)
date.toLocaleDateString(locales, options)
date.toLocaleTimeString(locales, options)
number.toLocaleString(locales, options)
typedArray.toLocaleString(locales, options)
all take i18n-related options
.
Usually these APIs are a subset of what is also exposed in different ways on the Intl object, e.g. new Intl.DateTimeFormat(locales, options).format(date)
, which also has formatToParts()
that is not on the Date
prototype. I think it's therefor fair to discuss whether parts of this Intl API should also be exposed on the String prototype to make it easier to use.
The W3C Internationalization Working Group looks deeply into issues of text display, including line breaking, which this proposal also touches on. If they are available, I'd appreciate a review from them. cc @aphillips @stpeter
We should have the ability to create a segmenter that does exactly what UAX 29 tells it to do, and nothing more.
E.g. Intl.Segmenter.count("string", "fr", { type: "grapheme" })
would be quite useful for e.g. a Twitter-style service limiting you to 140 chars.
CreateSegmentIterator has:
Let iterator.[[SegmentIteratorBreakType]] be an implementation-dependent string representing a break at the edge of a string.
while AdvanceSegmentIterator has:
Set iterator.[[SegmentIteratorBreakType]] to a string representing the type of break found, using one of the values found in the table Table 2, or undefined if the boundaries of the string are reached, or if there is no meaningful type for the granularity.
I'm guessing the same restrictions are supposed to apply to the implementation-dependent value set in CreateSegmentIterator.
As I've mentioned before, there are issues with using a scalar-valued precedingSegmentType
to expose list-valued segment data:
#
or @
."&"
as { index: 42, precedingSegmentType: "none", precedingSegmentTags: ["punctuation"] }
.If collecting details about segments during internal iteration and exposing them later to avoid the need for author-level re-iteration is important (and it seems to be), then this API should expose that information in a way that doesn't suffer from the above issues.
For example, replacing string precedingSegmentType
with array-of-strings precedingSegmentTags
would address imprecision and future-hostility and also suggest a straightforward future extension to address incompleteness and nonconfigurability (e.g., new Intl.Segmenter(locale, {granularity: "word", customTags})
).
Current text from the README:
%SegmentIterator%.prototype.following(index)
Move the iterator to the next break position after the given code unit index index, or if no index is provided, after its current position. Returns true if the end of the string was reached.
This is divergent from analogous behavior in RegExp.prototype.exec
(starting at lastIndex
) and String.prototype.indexOf(searchString, position)
(and possibly other preexisting APIs), which start at rather than after the relevant index. It would also be a bit odd for "reset" behavior to look like iterator.following(-1)
. Perhaps this aspect should be reconsidered.
The String.prototype.codePoints proposal does. See also tc39/proposal-string-prototype-codepoints#3
In the October 2018 Intl call, @gibson042 raised some concern about the use of code units rather than code points to describe the offset.
To me, code unit is the only usable measure, given that that's what JS strings are based on; I didn't quite understand Richard's argument and can't reproduce it here well; maybe he could chime in with it.
In https://tc39.github.io/proposal-intl-segmenter/#segment-iterator-objects it mentioned
"and Intl.Segment.prototype.reverseSegment"
but I cannot find the spec of Intl.Segment.prototype.reverseSegment
Should we add
as
"The methods Intl.Segment.prototype.segment returns iterators over the segments for a particular string. This section describes those iterator objects."
`undefined`
→ *undefined*
*0*
→ 0
CR
→ <CR>
, LF
→ <LF>
, etc.Now that the interface has been clarified as iterating over boundaries, there is potential to reintroduce { granularity: "line" }
implementing UAX #14 or am implementation-specific alternative. https://twitter.com/AaronPresley/status/1116424359223054336 provides a clear (if limited) use case that was missing at the time of #49.
But on the other hand, there's nothing preventing us from keeping it scoped out of this proposal and potentially adding it in a followup. I'm comfortable with either, but didn't want to bury the new demand that has emerged since December.
Would splitting text into script runs be in scope for the Intl.Segmenter
API?
The current specification of Intl.Segmenter
invokes ResolveLocale with only 4 arguments, omitting the mandatory localeData:
Let r be ResolveLocale(%Segmenter%.[[AvailableLocales]], requestedLocales, opt, %Segmenter%.[[RelevantExtensionKeys]]).
Without it, these objects don't actually implement the Iterable interface and for-of/Array.from
/etc. won't properly consume them.
Based on the random access pattern, I think we just want a @@iterator that returns the receiver.
There are two ways of grapheme breaking consonant conjuncts in Indic texts. The unicode bug seems to want to provide APIs for both. Should we do so too?
The lb Unicode extension key is equivalent to the "strictness" option. In the spirit of parity between options and tags, we should add support for this key to the segmenter.
How does this API deal with lone surrogates? Are those a valid grapheme cluster?
CSS defines a couple tailorings of line breaking, in the word-break
property, which can have three values: normal
, keep-all
, break-all
. None of these, including break-all
, expose graphemes breaks, but rather they are modifications of UAX 14. Expose these tailorings for direct usage through Intl.Segmenter.
cc @eaenet
The only use case I can imagine for line break iterators would be people trying to do their own paragraph layout themselves (e.g. eventually painting into a canvas).
The best way to perform paragraph layout in a browser is to use HTML elements and CSS. An author trying to do it themself with Javascript would almost certainly be both slower, less correct, and less accessible than doing it with the browser's engine.
This probably isn't true for the other segmenters - I can think of plenty of use cases for the other ones, but if there is wide adoption of line breaking, specifically, it would be unfortunate for the Web.
Other web specs do various kinds of breaking, especially line breaking. Factor the Intl.Segmenter spec text such that there is an abstract algorithm that they can call to get at the break, to cement the fact that Intl.Segmenter uses the same breaking algorithm as higher web specs.
cc @annevk
It's a bit strange for each segmentIterator to expose some but not all of the data returned by its next
method. I'd be in favor of dropping breakType
, or—if keeping it serves some important purpose—replacing both it and position
/index
with an accessor that echoes the most recent object from next
.
ICU break types are documented here. Some things seem especially important (whether something is a word or not, whether it's a hard or soft line break, whether the sentence break is induced by a line break or punctuation token), and others seem less important (whether the word starts with a number, letter, ideographic or kana character).
Should we be taking that latter category of distinctions within word breaks with us when defining Intl.Segmenter? I'm no expert here, but the existing ICU distinctions feel a bit arbitrary. Also, when I ran a simple test, it seems like katakana and hiragana characters are categorized as "ideo"--is the kana category just historical? I don't see these distinctions documented within UAX 29 either--they don't seem to correspond to values of the Word_Break property.
I wonder if, in a new API, we should just group the word break types for number, letter, kana and ideo together into a category for the word (as opposed to the whitespace or punctuation). Thoughts?
cc @jungshik
cf. #67 (comment)
Please, BoundaryIterator rather than BreakIterator. "Boundary" is better because it applies more generally (e.g., there are grapheme boundaries but not really grapheme breaks). It was also the predominant term recorded in meeting notes:
- RG: It sounds like there's a rough consensus over making this an iterator over the boundaries.
- FT: How about "boundaryType"?
- RG: It seems like we have an agreement on the conceptual model. I'd like to follow up with a PR. If you're breaking on words, for example, do you need to distinguish segments that are whitespace, for example, compared to a segment of letters?
Currently we call ToString
on the argument passed to Segmenter.prototype.segment
giving us some nice wat results like:
var s = new Intl.Segmenter("en", {granularity: "word"});
s.segment().next().value.segment // "undefined"
s.segment(null).next().value.segment // "null"
s.segment(true).next().value.segment // "true"
s.segment(false).next().value.segment // "false"
Should we type check if the argument is a string or object and then only ToString them?
During one of our design review, one of our colleague question why we name this API as "Segmenter" instead of BreakIterator but in the same time use the term "breakType" but not "segmentType". He suggest if we name this API as "segmenter", then we should make all the name consistent and therefore rename "breakType" in the spec as "segmentType" instead.
This is really minor, and aligning with existing APIs and vocabulary takes precendence IMO. But it might be worth considering whether we can use a different word (e.g. "kind") or some prefix ("segmentType") to disambiguate between the constructor option and the breakType.
The current example in README starts to iterate over the words of "Ceci n'est pas une pipe", but breaks after the first result. I think it would be helpful to see the full results.
ICU's BreakIterator class supports preceding and following methods to find breaks before or after a given character offset, without iterating from the beginning. It works by starting from the given offset and iterating in the reverse direction until a "safe" breakpoint (not dependent on context) is found past the offset, and then moving the opposite direction until a contextual break is found. Note that this is different from simply slicing the string and using the reverse or forward iterator on it, because context created by the characters before or after the sliced range would be lost.
This is useful e.g. when word wrapping glyphs, in which case the maximum possible character index that will fit on the line is known. In this case, one could use the preceding
method to find the nearest valid line boundary prior to that character index. This would be faster than iterating all of the possible breaks from the beginning of the line until the character index is passed.
I'm not sure of the best API here, but a possible one could be to add the methods to the SegmentIterator
. This would have the effect of moving the iterator to the nearest preceeding or following break to the provided character index, and returning an iteration result similar to the one returned by next
.
They seem roughly equal in clarity, but lastIndex
would be more consistent with e.g. RegExp.
Create Documentation for ** Intl.Segmenter**
MDN Pages :
Interactive Examples MDN :
Browser compat-data :
From @jungshik in tc39/ecma402#60
We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.
CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).
UAX #29 rules describe where boundaries exist and don't exist, but don't classify segments or use properties to describe them. ICU does apply properties, but seems to do so as multiple flags rather than as a single label—for example, the word segment "A113"
might be described by both LETTER and NUMBER (assuming the data are derived from/analogous to the Default Word Boundary Specification), while the sentence segment "Why?\n"
is described by STERM and LF (assuming the data are derived from/analogous to the Default Sentence Boundary Specification).
I'm still not sure how I feel about collapsing what is logically a list into a scalar, but taking that as given for the sake of discussion, I'd like to see better naming.
Suggestions:
E.g. Intl.Segmenter.segments("string", "fr", { type: "word" })
Different web browsers use different line breaking rules. WebKit and Blink use ICU's algorithms based on the Unicode standard, whereas Edge and Firefox use other algorithms. Some of these browser might not even ship Unicode line breaking data.
Rather than normatively referencing Unicode algorithms here, instead say "such as" in a note, and leave it to implementations to use the appropriate breaking algorithms.
With the last v8-canary
build on Windows 7 x64 ([email protected]
, [email protected]
):
'use strict';
console.log(Intl.Segmenter);
console.log(new Intl.Segmenter('en', {granularity: 'word'}));
With full ICU (node --icu-data-dir=.\node_modules\full-icu --harmony test.js
):
[Function: Segmenter]
Segmenter [Intl.Segmenter] {}
Without full ICU (node --harmony test.js
):
[Function: Segmenter]
#
# Fatal error in , line 0
# Check failed: U_SUCCESS(status).
#
#
#
#FailureMessage Object: 000000000021DB80
Does %SegmentIteratorPrototype%
have a [[Prototype]]
slot with value %IteratorPrototype%
? I don't see any spec text that says this. Is this an oversight?
Here's how the ECMA262 iterators specify this:
Some languages, such as Thai, Japanese, or Chinese, do not use spaces between words. Proper word breaking in these languages depends on special algorithms usually coupled with dictionaries. Since common operations that use word breaking included by-word text selection or indexing for full text search and these operations want true word boundary detection, it would be useful to note the special requirements of these languages. I believe the ICU library now incorporates several unencumbered dictionaries. I call this out because the references in the draft such as Unicode/TR29 and CLDR do not provide this support.
In Intl.Segmenter, we have:
5. Let matcher be ? GetOption(options, "localeMatcher", "string", « "lookup", "best fit" », "best fit").
6. Set opt.[[localeMatcher]] to matcher.
7. Let lineBreakStyle be ? GetOption(options, "lineBreakStyle", "string", « "strict", "normal", "loose" », "normal").
and then much later we have the coercion of the options
argument in to an Object
:
11. If options is undefined, then
a. Let options be ObjectCreate(null).
12. Else
b. Let options be ? ToObject(options).
This is inconsistent with all the other intl objects. The coercion to Object
should happen before properties of the options
argument are accessed.
cc @FrankYFTang
There are additional options for breaking, specifically:
It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.
"grapheme" is a vague term. It can be used to mean the written counterpart of a phoneme (in which case "ee" in "feel" is a grapheme). It can also be used to talk about individual marks (like an "accent"). Unicode never tries to talk in terms of graphemes, and I don't think we should either. This is why Unicode defines grapheme cluster; to get rid of this ambiguity.
We could use "grapheme clusters" here, but we're doing tailored segmentation, so it's not really that.
Unicode defines GCs and EGCs as an approximation of "user-perceived character". I wonder if we can use "characters" instead. Though that's just as ambiguous.
Just hoping we can come up with a better name here.
The README and spec text seem to bounce somewhat between describing the functionality as "iterating over segments" and "iterating over breaks/boundaries", and the confusion bleeds a little bit into the proposed interfaces. Can we settle on a consistent model and align everything to that?
Relevant Unicode vocabulary is described at #44 (comment) , but the even shorter summary is that both grapheme/word/sentence segmentation and line breaking completely partition a nonempty string into a sequence of nonempty segments terminated by boundaries if we normalize treatment at the start of text (where UAX #29 recognizes a boundary but UAX #14 prohibits a break opportunity). A grapheme boundary follows every "character", a word boundary follows every "word" and every non-word character, a sentence boundary follows every sentence-terminating punctuation after immediately following linear whitespace and up to one line terminator, and a line break opportunity follows every space. In all granularities, a boundary immediately precedes the end of text.
I personally feel like break associates more strongly with lines than with graphemes, words, or sentences, so I'd like to avoid general use of that term, leaving for consideration boundary vs. segment.
Given nonempty input,
My preference is leaning towards the latter, but this determination should really be made by analyzing the consequences upon consumers, both via next
and via methods like following
/preceding
(cf. #52).
https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter-internal-slots
CLDR defines several extension keys, but this specification does not expose them.
https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter
new Intl.Segmenter("de-u-lb-strict", {granularity:"line"}).resolvedOptions().lineBreakStyle
is "normal" with the current spec, because the default option value is always used.In every example I've written using this (like this one) I end up with this pattern:
const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
let pos = iterator.index();
for (let {index, breakType} of iterator) {
if (breakType !== 'none')
words.push(text.slice(pos, index));
pos = index;
}
i.e. I need to maintain pos
tracking the previous index and slice the string myself. How about:
%SegmentIterator%.prototype.segment
Return a substring of the input string from last index to the current index.
...so you can just write:
const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
for (let {index, breakType} of iterator) {
if (breakType !== 'none')
words.push(iterator.segment());
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.