Comments (5)
Good idea. (Or, should we call it, Intl.Breaker???) Added to the October 2018 Intl meeting agenda. https://github.com/tc39/ecma402/blob/master/meetings/agenda-2018-10-18.md
from proposal-intl-segmenter.
UAX #29 employs the following vocabulary:
- [significant] text element: a sentence, word, or user-perceived character
- segmentation: the process of boundary determination
- boundary: a transition point between two segments
- segment: synonym for [significant] text element
- break: synonym for boundary
- grapheme cluster: an algorithmically-defined approximation of a user-perceived character
UAX #14 adds:
- line break: a position in text where one line ends
- [line] break opportunity: a position in text where a line is allowed to end
- mandatory break: a character property that requires an immediately following line break
Since this proposal is derived from those technical reports, it would be nice if the interface introduced by it hewed as closely to them as practical. ICU demonstrates that there is value in providing detail beyond the mere position of boundaries, but taking its interface (which targets low-level languages and has grown organically in specialized directions) as gospel seems like a mistake. And model accuracy is also important... boundaries don't have properties of their own, but their preceding segments do (and in combination rather than as partitioners, cf. getRuleStatusVec
)—even mandatory vs. optional line break opportunities (or "hard" vs. "soft" in ICU vocabulary) are determined by whether or not the last character of the preceding segment is a terminator.
I'm not sure this proposal should include reflection of segment characteristics, but if it does then we should avoid the singular "type" altogether, in anticipation of future extensions describing segments by multiple dimensions (e.g., a word being foreign to the segmenter locale, having code points from multiple general categories, etc.). Do we want granularity-specific properties (e.g., mandatory: true
or terminatingPunctuation: "!"
)? Or perhaps an array or set that is always present and contains granularity-specific values (e.g., segmentTags: ["word"]
)? But if you're worried about performance, it might be best to leave such determinations out of the implementation itself, or make them opt-in at iterator construction time.
from proposal-intl-segmenter.
The current text does a poor job of defining what breakType
is. Possible values seem to describe segments rather than boundaries, and it is not specified to which boundary-adjacent segment they correspond with. This is especially confusing for backwards iteration—what is the proper value of breakType
after (new Intl.Segmenter("fr", {granularity: "word"})).segment("Ceci n'est pas une pipe").preceding(8)
? There's also the issue of a missing definition for "numbers, letters, kana characters, ideographic characters, etc" and "sentence terminator ('.', '?', '!', etc.)".
I am in favor of removing breakType
because it is easy for consumers to check the break-preceding code unit at index - 1
on their own, but if breakType
or a renamed equivalent remains then it needs a better and more complete specification.
from proposal-intl-segmenter.
The initial question of this issue was resolved by 242ce14.
from proposal-intl-segmenter.
Closing per #72
from proposal-intl-segmenter.
Related Issues (20)
- Advance to stage 3 HOT 7
- Advance to stage 4 HOT 5
- Should we throw exception when the string in Intl.Segmenter.prototype.segment ( string ) is not type string HOT 2
- Should segment data objects expose the context string? HOT 1
- FYI: ICU+WASM based polyfill ongoing work HOT 2
- Consistency with Number.range model HOT 5
- Indexed access and/or Symbol.slice support? HOT 2
- Why do we need to create a isWordLike: undefined in CreateSegmentDataObject If granularity is NOT "word" HOT 2
- Confusing fragment in README.md
- Adopt new GetOptions behavior
- Custom Dictionaries HOT 32
- Extensibility for non-ICU approaches? HOT 2
- Word segmenter with generic locale HOT 10
- Punctuation in the word segmenter
- No locale grapheme segmenter
- Line break support HOT 1
- Unicode Database and Related APIs HOT 1
- -
- Sentence break suppressions
- `granularity: "syllable"` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from proposal-intl-segmenter.