Coder Social home page Coder Social logo

proposal-intl-segmenter's Introduction

Intl.Segmenter: Unicode segmentation in JavaScript

Stage 4 proposal, champion Richard Gibson

Motivation

A code point is not a "letter" or a displayed unit on the screen. That designation goes to the grapheme, which can consist of multiple code points (e.g., including accent marks, conjoining Korean characters). Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. This may be useful in implementing advanced editors/input methods, or other forms of text processing.

Unicode also defines an algorithm for finding boundaries between words and sentences, which CLDR tailors per locale. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences.

Grapheme, word and sentence segmentation is defined in UAX 29. Web browsers need an implementation of this kind of segmentation to function, and shipping it to JavaScript saves memory and network bandwidth as compared to expecting developers to implement it themselves in JavaScript.

Chrome has been shipping its own nonstandard segmentation API called Intl.v8BreakIterator for a few years. However, for a few reasons, this API does not seem suitable for standardization. This explainer outlines a new API which attempts to be more in accordance with modern, post-ES2015 JavaScript API design.

Examples

Segment iteration

Objects returned by the segment method of an Intl.Segmenter instance find boundaries and expose segments between them via the Iterable interface.

// Create a locale-specific word segmenter
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

// Use it to get an iterator for a string
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

// Use that for segmentation!
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}
// console.log output:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»

For flexibility and advanced use cases, they also support direct random access.

// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined

API

polyfill for a historical snapshot of this proposal

new Intl.Segmenter(locale, options)

Creates a new locale-dependent Segmenter. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity ("grapheme", "word", or "sentence", defaulting to "grapheme").

Intl.Segmenter.prototype.segment(string)

Creates a new Iterable %Segments% instance for the input string using the Segmenter's locale and granularity.

Segment data

Segments are described by plain objects with the following data properties:

  • segment is the string segment.
  • index is the code unit index in the string at which the segment begins.
  • input is the string being segmented.
  • isWordLike is true when granularity is "word" and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is "word" and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not "word".

Methods of %Segments%.prototype:

%Segments%.prototype.containing(index)

Returns a segment data object describing the segment in the string including the code unit at the specified index, or undefined if the index is out of bounds.

%Segments%.prototype[Symbol.iterator]

Creates a new %SegmentIterator% instance which will lazily find segments in the input string using the Segmenter's locale and granularity, keeping track of its current position within the string.

Methods of %SegmentIterator%.prototype:

%SegmentIterator%.prototype.next()

The next method implements the Iterator interface, finding the next segment and returning a corresponding IteratorResult object whose value property is a segment data object as described above.

FAQ

Why should we pass a locale and options bag for grapheme boundaries? Isn't there just one way to do it?

The situation is a little more complicated, e.g., for Indic scripts. Work is ongoing to support grapheme boundary options for these scripts better; see this bug, and in particular this CLDR wiki page. Seems like CLDR/ICU don't support this yet, but it's planned.

Shouldn't we be putting new APIs in built-in modules?

If built-in modules had come out before this gets to Stage 3, that sounds like a good option. However, so far the idea in TC39 has been not to block either thing on the other. Built-in modules still have some big questions to resolve, e.g., how/whether polyfills should interact with them.

Why is line breaking not included?

Line breaking was provided in an earlier version of this API, but it is excluded because simply a line breaking API would be incomplete: Line breaking is typically used when laying out text, and text layout requires a larger set of APIs, e.g., determining the width of a rendered string of text. For this reason, we suggest continued development of a line breaking API as part of the CSS Houdini effort.

Why is hyphenation not included?

Hyphenation is expected to have a different sort of API shape for various reasons:

  • Adding a hyphenation break may change the spelling of the affected text
  • There may be hyphenation breaks of different priorities
  • Hyphenation plays into line layout and font rendering in a more complex way, and we might want to expose it at that level (e.g., in the Web Platform rather than ECMAScript)
  • Hyphenation is just a less well-developed thing in the internationalization world. CLDR and ICU don't support it yet; certain web browsers are only getting support for it now in CSS. It's often not done perfectly. It could use some more time to bake. By contrast, word, grapheme, sentence and line breaks have been in the Unicode specification for a long time; this is a shovel-ready project.

Why is random-access stateless?

It would be possible to expose methods on %SegmentIterator%.prototype that mutate internal state (e.g., seek([inclusiveStartIndex = thisIterator.index + 1]) and seekBefore([exclusiveLastIndex = thisIterator.index]), and in fact these were part of earlier designs. They were dropped for consistency with other ECMA-262 iterators (whose movement is always forward and without gaps). If real-world use reveals that their absence is an ergonomic and/or performance flaw, they can be added in a followup proposal.

Why is this an Intl API instead of String methods?

All of these boundary types are actually locale-dependent, and some allow complex options. The result of the segment method is a SegmentIterator. For many non-trivial cases like this, analogous APIs are put in ECMA-402's Intl object. This allows for the work that happens on each instantiation to be shared, improving performance. We could make a convenience method on String as a follow-on proposal.

What exactly does the index refer to?

An index n refers to the code unit index within a string that is potentially the start of a segment. For example, when iterating over the string "Hello, world💙" by words in English,segments will start at indexes 0, 5, 6, 7, and 12 (i.e., the string gets segmented like ┃Hello┃,┃ ┃world┃💙┃, with the final segment consisting of a surrogate pair of two code units encoding a single code point). The definition of these boundary indexes does not depend on whether forwards or backwards iteration is used.

What happens when segmenting an empty string?

No segments will be found, and iterators will complete immediately upon first next() access.

What happens when I try to use random access with non-Number values?

Someone's in QA. 😉 The containing argument is processed into an integer Number—null, undefined, and NaN become 0, Booleans become 0 or 1, Strings are parsed as string numeric literals, Objects are cast to primitives, and Symbols and BigInts fail with a TypeError exception. Fractional components are truncated, but infinite Numbers are accepted as-is (although they are always out of bounds and will therefore never find a segment).

Implementations

proposal-intl-segmenter's People

Contributors

frankyftang avatar gibson042 avatar littledan avatar manishearth avatar mathiasbynens avatar methyl avatar mpcsh avatar ms2ger avatar sirivasv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proposal-intl-segmenter's Issues

Support preceding and following methods

ICU's BreakIterator class supports preceding and following methods to find breaks before or after a given character offset, without iterating from the beginning. It works by starting from the given offset and iterating in the reverse direction until a "safe" breakpoint (not dependent on context) is found past the offset, and then moving the opposite direction until a contextual break is found. Note that this is different from simply slicing the string and using the reverse or forward iterator on it, because context created by the characters before or after the sliced range would be lost.

This is useful e.g. when word wrapping glyphs, in which case the maximum possible character index that will fit on the line is known. In this case, one could use the preceding method to find the nearest valid line boundary prior to that character index. This would be faster than iterating all of the possible breaks from the beginning of the line until the character index is passed.

I'm not sure of the best API here, but a possible one could be to add the methods to the SegmentIterator. This would have the effect of moving the iterator to the nearest preceeding or following break to the provided character index, and returning an iteration result similar to the one returned by next.

Should we include all the break types that ICU includes, or a more limited set?

ICU break types are documented here. Some things seem especially important (whether something is a word or not, whether it's a hard or soft line break, whether the sentence break is induced by a line break or punctuation token), and others seem less important (whether the word starts with a number, letter, ideographic or kana character).

Should we be taking that latter category of distinctions within word breaks with us when defining Intl.Segmenter? I'm no expert here, but the existing ICU distinctions feel a bit arbitrary. Also, when I ran a simple test, it seems like katakana and hiragana characters are categorized as "ideo"--is the kana category just historical? I don't see these distinctions documented within UAX 29 either--they don't seem to correspond to values of the Word_Break property.

I wonder if, in a new API, we should just group the word break types for number, letter, kana and ideo together into a category for the word (as opposed to the whitespace or punctuation). Thoughts?

cc @jungshik

{granularity: "line"} promotes reimplementing paragraph layout in script

The only use case I can imagine for line break iterators would be people trying to do their own paragraph layout themselves (e.g. eventually painting into a canvas).

The best way to perform paragraph layout in a browser is to use HTML elements and CSS. An author trying to do it themself with Javascript would almost certainly be both slower, less correct, and less accessible than doing it with the browser's engine.

This probably isn't true for the other segmenters - I can think of plenty of use cases for the other ones, but if there is wide adoption of line breaking, specifically, it would be unfortunate for the Web.

Intl.Segmenter vs. Intl.Segmentation

This is perhaps minor, and it's difficult to muster strong feelings about it, but I believe that the new API proposed here should be named Intl.Segmentation rather than Intl.Segmenter. Not only is "segmentation" the term used in UAX #29, but it would also be more consistent with the existing constructors, which with the exception of Collator are not agent nouns even though they could be (e.g., we have Intl.NumberFormat rather than Intl.NumberFormatter and Intl.PluralRules rather than Intl.Pluralizer).

Please provide opinions on this change.

// analogous to: let formatter = new Intl.NumberFormat("fr");
let segmenter = new Intl.Segmentation("fr", {granularity: "word"});

// analogous to: formatter.format(number)
let boundaries = segmenter.segment(input);

Docs(MDN) : Documentation for Intl.Segmenter

Create Documentation for ** Intl.Segmenter**

  • Review Readme documentation and examples
  • Create MDN Main Docs Page

MDN Pages :

  • prototype
  • constructor
  • methods

Interactive Examples MDN :

  • Segmenter Generic Usage
  • Segmenter.prototype.segment(string)
  • Segmenter.prototype.next()
  • Segmenter.prototype.following(index)
  • Segmenter.prototype.preceding(index)
  • Segmenter.prototype.index
  • Segmenter.prototype.breakType

Browser compat-data :

  • Segmenter.prototype.segment(string)
  • Segmenter.prototype.next()
  • Segmenter.prototype.following(index)
  • Segmenter.prototype.preceding(index)
  • Segmenter.prototype.index
  • Segmenter.prototype.breakType

Match ECMA-402 formatting

  • `undefined`*undefined*
  • *0*0
  • CR<CR>, LF<LF>, etc.
  • properly reference r.[[locale]] rather than r.[[Locale]] (cf. ResolveLocal)
  • add a section to describe the internal slots of segment iterators
  • etc.

Consider removing the locale from grapheme segmentation

A strong piece of feedback from the September 2016 TC39 meeting (from @bterlson, @rniwa and others) was that the locale is not needed for grapheme breaks, so it should not be a parameter. However, I later spoke with Mark Davis, who said that logic for "extended grapheme clusters", e.g., for Indic scripts, is still in flux, and he recommended that all new APIs for grapheme segmentation take a locale as a parameter for future-proofing.

Should segmentIterator.following(n) match a break position at n?

Current text from the README:

%SegmentIterator%.prototype.following(index)
Move the iterator to the next break position after the given code unit index index, or if no index is provided, after its current position. Returns true if the end of the string was reached.

This is divergent from analogous behavior in RegExp.prototype.exec (starting at lastIndex) and String.prototype.indexOf(searchString, position) (and possibly other preexisting APIs), which start at rather than after the relevant index. It would also be a bit odd for "reset" behavior to look like iterator.following(-1). Perhaps this aspect should be reconsidered.

Different word for the two different "types"?

This is really minor, and aligning with existing APIs and vocabulary takes precendence IMO. But it might be worth considering whether we can use a different word (e.g. "kind") or some prefix ("segmentType") to disambiguate between the constructor option and the breakType.

"lb" Unicode extension key follow-up

https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter-internal-slots

CLDR defines several extension keys, but this specification does not expose them.

  • The note should to be updated now that "lb" is exposed.

https://tc39.github.io/proposal-intl-segmenter/#sec-Intl.Segmenter

  • The default value for "lineBreakStyle" needs to removed, otherwise it's not possible to modify the line break style using "lb" (because option values always override Unicode extension values, cf. ResolveLocale abstract op).
  • IOW new Intl.Segmenter("de-u-lb-strict", {granularity:"line"}).resolvedOptions().lineBreakStyle is "normal" with the current spec, because the default option value is always used.

Add lb option

The lb Unicode extension key is equivalent to the "strictness" option. In the spirit of parity between options and tags, we should add support for this key to the segmenter.

False statement in Q: Why is this an Intl API instead of String methods

The only internationalization-related functions that are placed as methods on String, Date and Number are simpler functions that just convert to a string, e.g., toLocaleString(), toLocaleUpperCase()--these may take a locale argument, but no options bag.

this is not true:

date.toLocaleString(locales, options)
date.toLocaleDateString(locales, options)
date.toLocaleTimeString(locales, options)
number.toLocaleString(locales, options)
typedArray.toLocaleString(locales, options)

all take i18n-related options.
Usually these APIs are a subset of what is also exposed in different ways on the Intl object, e.g. new Intl.DateTimeFormat(locales, options).format(date), which also has formatToParts() that is not on the Date prototype. I think it's therefor fair to discuss whether parts of this Intl API should also be exposed on the String prototype to make it easier to use.

Should "breakType" rename to "segmentType"

During one of our design review, one of our colleague question why we name this API as "Segmenter" instead of BreakIterator but in the same time use the term "breakType" but not "segmentType". He suggest if we name this API as "segmenter", then we should make all the name consistent and therefore rename "breakType" in the spec as "segmentType" instead.

Incomplete amount of exposed state on %SegmentIteratorPrototype%

It's a bit strange for each segmentIterator to expose some but not all of the data returned by its next method. I'd be in favor of dropping breakType, or—if keeping it serves some important purpose—replacing both it and position/index with an accessor that echoes the most recent object from next.

Editorial: Refactor for normative reference by higher web specs

Other web specs do various kinds of breaking, especially line breaking. Factor the Intl.Segmenter spec text such that there is an abstract algorithm that they can call to get at the break, to cement the fact that Intl.Segmenter uses the same breaking algorithm as higher web specs.

cc @annevk

Enhance expressiveness and simplicity

  • Add a method to jump to a particular offset
  • Add a method to find the previous break
  • Remove the current convenience iterator in favor of even higher level convenience functionality along the lines of what @domenic suggests in other bugs here

Code point vs code unit in position

In the October 2018 Intl call, @gibson042 raised some concern about the use of code units rather than code points to describe the offset.

To me, code unit is the only usable measure, given that that's what JS strings are based on; I didn't quite understand Richard's argument and can't reproduce it here well; maybe he could chime in with it.

cc @sffc @litherum

Convenience: method to get current token from iterator

In every example I've written using this (like this one) I end up with this pattern:

const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
let pos = iterator.index();
for (let {index, breakType} of iterator) {
  if (breakType !== 'none')
    words.push(text.slice(pos, index));
  pos = index;
}

i.e. I need to maintain pos tracking the previous index and slice the string myself. How about:

%SegmentIterator%.prototype.segment

Return a substring of the input string from last index to the current index.

...so you can just write:

const words = [];
const iterator = Intl.Segmenter(locale, {type: 'word'}).segment(text);
for (let {index, breakType} of iterator) {
  if (breakType !== 'none')
    words.push(iterator.segment());
}

Clarify initial value of [[SegmentIteratorBreakType]]

CreateSegmentIterator has:

Let iterator.[[SegmentIteratorBreakType]] be an implementation-dependent string representing a break at the edge of a string.

while AdvanceSegmentIterator has:

Set iterator.[[SegmentIteratorBreakType]] to a string representing the type of break found, using one of the values found in the table Table 2, or undefined if the boundaries of the string are reached, or if there is no meaningful type for the granularity.

I'm guessing the same restrictions are supposed to apply to the implementation-dependent value set in CreateSegmentIterator.

Lone surrogates

How does this API deal with lone surrogates? Are those a valid grapheme cluster?

Where is the spec of reverseSegment?

In https://tc39.github.io/proposal-intl-segmenter/#segment-iterator-objects it mentioned
"and Intl.Segment.prototype.reverseSegment"
but I cannot find the spec of Intl.Segment.prototype.reverseSegment

Should we add

  1. the specification Intl.Segment.prototype.reverseSegment or
  2. remove "and Intl.Segment.prototype.reverseSegment" from the following text:
    "The methods Intl.Segment.prototype.segment and Intl.Segment.prototype.reverseSegment return iterators over the segments for a particular string. This section describes those iterator objects."

as
"The methods Intl.Segment.prototype.segment returns iterators over the segments for a particular string. This section describes those iterator objects."

Intl.Segmenter constructor should coerce options to Object first

In Intl.Segmenter, we have:

5. Let matcher be ? GetOption(options, "localeMatcher", "string", « "lookup",  "best fit" », "best fit").
6. Set opt.[[localeMatcher]] to matcher.
7. Let lineBreakStyle be ? GetOption(options, "lineBreakStyle", "string", « "strict",  "normal", "loose" », "normal").

and then much later we have the coercion of the options argument in to an Object:

11. If options is undefined, then
    a. Let options be ObjectCreate(null).
12. Else
    b. Let options be ? ToObject(options).

This is inconsistent with all the other intl objects. The coercion to Object should happen before properties of the options argument are accessed.

cc @FrankYFTang

Support CSS word-break property

CSS defines a couple tailorings of line breaking, in the word-break property, which can have three values: normal, keep-all, break-all. None of these, including break-all, expose graphemes breaks, but rather they are modifications of UAX 14. Expose these tailorings for direct usage through Intl.Segmenter.

cc @eaenet

Reconsider "graphemes"

"grapheme" is a vague term. It can be used to mean the written counterpart of a phoneme (in which case "ee" in "feel" is a grapheme). It can also be used to talk about individual marks (like an "accent"). Unicode never tries to talk in terms of graphemes, and I don't think we should either. This is why Unicode defines grapheme cluster; to get rid of this ambiguity.

We could use "grapheme clusters" here, but we're doing tailored segmentation, so it's not really that.

Unicode defines GCs and EGCs as an approximation of "user-perceived character". I wonder if we can use "characters" instead. Though that's just as ambiguous.

Just hoping we can come up with a better name here.

Add back granularity: "line"?

Now that the interface has been clarified as iterating over boundaries, there is potential to reintroduce { granularity: "line" } implementing UAX #14 or am implementation-specific alternative. https://twitter.com/AaronPresley/status/1116424359223054336 provides a clear (if limited) use case that was missing at the time of #49.

But on the other hand, there's nothing preventing us from keeping it scoped out of this proposal and potentially adding it in a followup. I'm comfortable with either, but didn't want to bury the new demand that has emerged since December.

Rename segment type values

UAX #29 rules describe where boundaries exist and don't exist, but don't classify segments or use properties to describe them. ICU does apply properties, but seems to do so as multiple flags rather than as a single label—for example, the word segment "A113" might be described by both LETTER and NUMBER (assuming the data are derived from/analogous to the Default Word Boundary Specification), while the sentence segment "Why?\n" is described by STERM and LF (assuming the data are derived from/analogous to the Default Sentence Boundary Specification).

I'm still not sure how I feel about collapsing what is logically a list into a scalar, but taking that as given for the sake of discussion, I'd like to see better naming.

Suggestions:

  • grapheme: undefined → "grapheme"
  • word: "none" → "space" and "punctuation" (i.e., three types rather than two)
  • sentence: "term" → "terminated"
  • sentence: "sep" → "fragment"

Review from the W3C i18n WG

The W3C Internationalization Working Group looks deeply into issues of text display, including line breaking, which this proposal also touches on. If they are available, I'd appreciate a review from them. cc @aphillips @stpeter

Is this an API for iterating segments, or boundaries?

The README and spec text seem to bounce somewhat between describing the functionality as "iterating over segments" and "iterating over breaks/boundaries", and the confusion bleeds a little bit into the proposed interfaces. Can we settle on a consistent model and align everything to that?

Relevant Unicode vocabulary is described at #44 (comment) , but the even shorter summary is that both grapheme/word/sentence segmentation and line breaking completely partition a nonempty string into a sequence of nonempty segments terminated by boundaries if we normalize treatment at the start of text (where UAX #29 recognizes a boundary but UAX #14 prohibits a break opportunity). A grapheme boundary follows every "character", a word boundary follows every "word" and every non-word character, a sentence boundary follows every sentence-terminating punctuation after immediately following linear whitespace and up to one line terminator, and a line break opportunity follows every space. In all granularities, a boundary immediately precedes the end of text.

I personally feel like break associates more strongly with lines than with graphemes, words, or sentences, so I'd like to avoid general use of that term, leaving for consideration boundary vs. segment.

Given nonempty input,

  • a segment iterator identifying segments by the first included position would always have a first result starting at code unit index 0 and a final result starting at or before code unit index len − 1, and results (being segments) could easily be categorized by their constituent code points (although describing line segments as hard vs. soft or mandatory vs. optional is less intuitive than one would hope).
  • a boundary iterator identifying boundaries by the immediately following position would always have a first result starting at or after code unit index 1 and a final result starting at code unit index len, and results (being boundaries) would have and possibly reference preceding segments (though line boundaries could be intuitively categorized as hard vs. soft or mandatory vs. optional).

My preference is leaning towards the latter, but this determination should really be made by analyzing the consequences upon consumers, both via next and via methods like following/preceding (cf. #52).

Rename Segment Iterator to Boundary Iterator

  • Segment Iterator → Boundary Iterator
  • SegmentIterator* → BoundaryIterator*

cf. #67 (comment)

Please, BoundaryIterator rather than BreakIterator. "Boundary" is better because it applies more generally (e.g., there are grapheme boundaries but not really grapheme breaks). It was also the predominant term recorded in meeting notes:

  • RG: It sounds like there's a rough consensus over making this an iterator over the boundaries.
  • FT: How about "boundaryType"?
  • RG: It seems like we have an agreement on the conceptual model. I'd like to follow up with a PR. If you're breaking on words, for example, do you need to distinguish segments that are whitespace, for example, compared to a segment of letters?

Cite special word break needs?

Some languages, such as Thai, Japanese, or Chinese, do not use spaces between words. Proper word breaking in these languages depends on special algorithms usually coupled with dictionaries. Since common operations that use word breaking included by-word text selection or indexing for full text search and these operations want true word boundary detection, it would be useful to note the special requirements of these languages. I believe the ICU library now incorporates several unencumbered dictionaries. I call this out because the references in the draft such as Unicode/TR29 and CLDR do not provide this support.

Allow implementations to determine their own breaking rules

Different web browsers use different line breaking rules. WebKit and Blink use ICU's algorithms based on the Unicode standard, whereas Edge and Firefox use other algorithms. Some of these browser might not even ship Unicode line breaking data.

Rather than normatively referencing Unicode algorithms here, instead say "such as" in a note, and leave it to implementations to use the appropriate breaking algorithms.

Add lw, ss options?

There are additional options for breaking, specifically:

  • lw -- word break style (normal, keepall, breakall, matching CSS)
  • ss -- sentence break suppression (none, standard -- standard might be the better behavior here)

It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.

Will this need full ICU in Node.js even for the 'en' locale?

With the last v8-canary build on Windows 7 x64 ([email protected], [email protected]):

'use strict';

console.log(Intl.Segmenter);
console.log(new Intl.Segmenter('en', {granularity: 'word'}));

With full ICU (node --icu-data-dir=.\node_modules\full-icu --harmony test.js):

[Function: Segmenter]
Segmenter [Intl.Segmenter] {}

Without full ICU (node --harmony test.js):

[Function: Segmenter]


#
# Fatal error in , line 0
# Check failed: U_SUCCESS(status).
#
#
#
#FailureMessage Object: 000000000021DB80

precedingSegmentType misrepresents list-valued segment data

As I've mentioned before, there are issues with using a scalar-valued precedingSegmentType to expose list-valued segment data:

  1. Imprecision. The word- and sentence-granularity types in the current spec collapse together several combinations of what logically and in ICU are independent flags (e.g., a "word" can contain or not contain letters/numbers/kana/ideographs/etc., a non-word "none" can contain punctuation or whitespace, a "term" sentence may or may not include whitespace and/or a line terminator after its terminal punctuation.
  2. Incompleteness. Consumers can observe that a sentence segment has terminating punctuation, but cannot easily determine what that punctuation is.
  3. Nonconfigurability. Implementations are required to test characters as they iterate through a string, but consumers have no way to express specific details that matter for their application such as the presence of numbers or non-ASCII characters or special-significance symbols like # or @.
  4. Future-hostility. If we ever do want to be more expressive, backwards compatibility will motivate preserving the existing field and its semantics with ugliness like describing a word-granularity segment "&" as { index: 42, precedingSegmentType: "none", precedingSegmentTags: ["punctuation"] }.

If collecting details about segments during internal iteration and exposing them later to avoid the need for author-level re-iteration is important (and it seems to be), then this API should expose that information in a way that doesn't suffer from the above issues.

For example, replacing string precedingSegmentType with array-of-strings precedingSegmentTags would address imprecision and future-hostility and also suggest a straightforward future extension to address incompleteness and nonconfigurability (e.g., new Intl.Segmenter(locale, {granularity: "word", customTags})).

Should only strings and objects be allowed in Segmenter.prototype.segment?

Currently we call ToString on the argument passed to Segmenter.prototype.segment giving us some nice wat results like:

var s = new Intl.Segmenter("en", {granularity: "word"});
s.segment().next().value.segment // "undefined"
s.segment(null).next().value.segment // "null"
s.segment(true).next().value.segment // "true"
s.segment(false).next().value.segment // "false"

Should we type check if the argument is a string or object and then only ToString them?

Script runs

Would splitting text into script runs be in scope for the Intl.Segmenter API?

Support strictness

From @jungshik in tc39/ecma402#60

We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.

CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.