Coder Social home page Coder Social logo

Comments (5)

littledan avatar littledan commented on September 23, 2024

Good idea. (Or, should we call it, Intl.Breaker???) Added to the October 2018 Intl meeting agenda. https://github.com/tc39/ecma402/blob/master/meetings/agenda-2018-10-18.md

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 23, 2024

UAX #29 employs the following vocabulary:

  • [significant] text element: a sentence, word, or user-perceived character
  • segmentation: the process of boundary determination
  • boundary: a transition point between two segments
  • segment: synonym for [significant] text element
  • break: synonym for boundary
  • grapheme cluster: an algorithmically-defined approximation of a user-perceived character

UAX #14 adds:

  • line break: a position in text where one line ends
  • [line] break opportunity: a position in text where a line is allowed to end
  • mandatory break: a character property that requires an immediately following line break

Since this proposal is derived from those technical reports, it would be nice if the interface introduced by it hewed as closely to them as practical. ICU demonstrates that there is value in providing detail beyond the mere position of boundaries, but taking its interface (which targets low-level languages and has grown organically in specialized directions) as gospel seems like a mistake. And model accuracy is also important... boundaries don't have properties of their own, but their preceding segments do (and in combination rather than as partitioners, cf. getRuleStatusVec)—even mandatory vs. optional line break opportunities (or "hard" vs. "soft" in ICU vocabulary) are determined by whether or not the last character of the preceding segment is a terminator.

I'm not sure this proposal should include reflection of segment characteristics, but if it does then we should avoid the singular "type" altogether, in anticipation of future extensions describing segments by multiple dimensions (e.g., a word being foreign to the segmenter locale, having code points from multiple general categories, etc.). Do we want granularity-specific properties (e.g., mandatory: true or terminatingPunctuation: "!")? Or perhaps an array or set that is always present and contains granularity-specific values (e.g., segmentTags: ["word"])? But if you're worried about performance, it might be best to leave such determinations out of the implementation itself, or make them opt-in at iterator construction time.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 23, 2024

The current text does a poor job of defining what breakType is. Possible values seem to describe segments rather than boundaries, and it is not specified to which boundary-adjacent segment they correspond with. This is especially confusing for backwards iteration—what is the proper value of breakType after (new Intl.Segmenter("fr", {granularity: "word"})).segment("Ceci n'est pas une pipe").preceding(8)? There's also the issue of a missing definition for "numbers, letters, kana characters, ideographic characters, etc" and "sentence terminator ('.', '?', '!', etc.)".

I am in favor of removing breakType because it is easy for consumers to check the break-preceding code unit at index - 1 on their own, but if breakType or a renamed equivalent remains then it needs a better and more complete specification.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 23, 2024

The initial question of this issue was resolved by 242ce14.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 23, 2024

Closing per #72

from proposal-intl-segmenter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.