Comments (19)
Ping- @gibson042 . the "ECMA-402 notes, 2019-01-17" show
"Richard Gibson will add more detail in the ticket" ,
"RG: Let's end up with something internally consistent. I'll sketch the alternatives. They are logically equivalent, but what we have right now is a mixture, and I don't like that."
"RG will take action item to prepare and clean up the changes to consolidate on segment, to discuss next meeting."
Could you give me an ETA for that? Thanks
from proposal-intl-segmenter.
Random access, returning "edge of input" boolean
Add following
and preceding
methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached") [after updating iterator state].
// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → false
boundaries.following(5) // → false
boundaries.following(6) // → false
boundaries.following(8) // → true
boundaries.following(9) // → RangeError
// preceding operations include an implicit initial decrement to avoid infinite loops
// (optionally limited to no-argument invocations).
boundaries.preceding(9) // → false
boundaries.preceding(8) // → false
boundaries.preceding(6) // → true
boundaries.preceding(5) // → true
boundaries.preceding(1) // → true
boundaries.preceding(0) // → RangeError
Please react with to express approval or to express disapproval.
from proposal-intl-segmenter.
Exposed iterator state
Add lastIndex
and/or input
properties to boundary iterators and/or an , in broad alignment with the RegExp interface.input
property to iterator result values
let boundaries = segmenter.segment(input);
// Iterate over boundaries.
-let lastIndex = 0;
+let lastIndex = boundaries.lastIndex;
for (let {index} of boundaries) {
- let segment = input.slice(lastIndex, index);
+ let segment = boundaries.input.slice(lastIndex, index);
console.log(`boundary before index ${index}, after segment «${segment}»`);
lastIndex = index;
}
Please react with to express approval or to express disapproval.
from proposal-intl-segmenter.
Richer iterator result values
Add lastIndex
and/or input
properties to boundary iterators and/orinput
/granularity
/etc. properties to iterator result values, in broad alignment with the RegExp interface.
let boundaries = segmenter.segment(input);
// Iterate over boundaries.
let lastIndex = 0;
-for (let {index} of boundaries) {
+for (let {granularity, input, index} of boundaries) {
let segment = input.slice(lastIndex, index);
- console.log(`boundary before index ${index}, after segment «${segment}»`);
+ console.log(`${granularity} boundary before index ${index}, after segment «${segment}»`);
lastIndex = index;
}
Please react with to express approval for inclusion or to express approval for exclusion.
from proposal-intl-segmenter.
Segment capture
Add a precedingSegment
to iteration result values. If the Exposed iterator state extension is accepted, then the property could also be added to boundary iterators. If the Random access extension is accepted, then preceding
calls should clear/exclude the property and replace it with followingSegment
.
-for (let {index} of boundaries) {
- let segment = input.slice(lastIndex, index);
- console.log(`boundary before index ${index}, after segment «${segment}»`);
- lastIndex = index;
+for (let {index, precedingSegment} of boundaries) {
+ console.log(`boundary before index ${index}, after segment «${precedingSegment}»`);
}
Please react with to express approval or to express disapproval.
from proposal-intl-segmenter.
I think that's a good summary of the changes, and I am happy to make them (with two corrections: preceedingSegmentType
→ precedingSegmentType
and "before the nth code point" → "before the nth code unit"). As for Segmenter vs. Segmentation, I'll open a separate issue for discussion: #69.
from proposal-intl-segmenter.
I don't have a strong opinion on naming. We've been fairly effective at coming to a conclusion on these naming issues in Intl meetings (whereas a thread on GitHub can just go on forever), so let's discuss this there. Please make a PR to the Intl agenda if that sounds good to you.
from proposal-intl-segmenter.
Resolved in meeting to boundary iteration, with corresponding renames and interface changes.
from proposal-intl-segmenter.
Huh, I'm not sure we made that resolution... If we end up removing line break support, maybe segment iteration makes more sense.
from proposal-intl-segmenter.
We seemed to conclude in the ECMA-402 meeting that we should switch to a break-based API. PRs welcome to make this more concrete.
from proposal-intl-segmenter.
The PR is still forthcoming, but here's a sketch. I have separated the minimal interface (pure iterators that expose only boundary indexes) and many extensions to it that have been discussed in this repository (but not including breakType
, which I believe to be both a premature optimization and an overly proprietary/restrictive affordance). It would be nice to have some feedback on those extensions before specifying them, so please use / reactions to vote on the following comments.
Minimal interface
// Top-level constructor name now aligns with the rest of Intl
// (e.g., "NumberFormat" rather than "NumberFormatter").
let segmenter = new Intl.Segmentation("fr", {granularity: "word"});
let input = "Allons-y!";
let boundaries = segmenter.segment(input);
// Iterate over boundaries.
let lastIndex = 0;
for (let {index} of boundaries) {
let segment = input.slice(lastIndex, index);
console.log(`boundary before index ${index}, after segment «${segment}»`);
lastIndex = index;
}
// console.log output:
// boundary before index 6, after segment «Allons»
// boundary before index 7, after segment «-»
// boundary before index 8, after segment «y»
// boundary before index 9, after segment «!»
Random access (optional)
Add following
and preceding
methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached"), but I personally much prefer this richer signature.
// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → { index: 6 }
boundaries.following(5) // → { index: 6 }
boundaries.following(6) // → { index: 7 }
boundaries.following(8) // → { index: 9 }
boundaries.following(9) // → RangeError
// preceding operations include an implicit initial decrement to avoid infinite loops
// (optionally limited to no-argument invocations).
boundaries.preceding(9) // → { index: 8 }
boundaries.preceding(8) // → { index: 7 }
boundaries.preceding(6) // → { index: 0 }
boundaries.preceding(5) // → { index: 0 }
boundaries.preceding(1) // → { index: 0 }
boundaries.preceding(0) // → RangeError
Exposed iterator state (optional)
Add lastIndex
and/or input
properties to boundary iterators and/or input
/granularity
/etc. properties to iterator result values, in broad alignment with the RegExp interface.
let boundaries = segmenter.segment(input);
// Iterate over boundaries.
-let lastIndex = 0;
-for (let {index} of boundaries) {
+let lastIndex = boundaries.lastIndex;
+for (let {granularity, input, index} of boundaries) {
let segment = input.slice(lastIndex, index);
- console.log(`boundary before index ${index}, after segment «${segment}»`);
+ console.log(`${granularity} boundary before index ${index}, after segment «${segment}»`);
lastIndex = index;
}
Segment capture (optional)
Add a precedingSegment
to iteration result values. If the Exposed iterator state extension is accepted, then the property could also be added to boundary iterators. If the Random access extension is accepted, then preceding
calls should clear/exclude the property and replace it with followingSegment
.
-for (let {index} of boundaries) {
- let segment = input.slice(lastIndex, index);
- console.log(`boundary before index ${index}, after segment «${segment}»`);
- lastIndex = index;
+for (let {index, precedingSegment} of boundaries) {
+ console.log(`boundary before index ${index}, after segment «${precedingSegment}»`);
}
from proposal-intl-segmenter.
Random access, returning iterator result values
Add following
and preceding
methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached"), but I personally much prefer this richer signature.
// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → { index: 6 }
boundaries.following(5) // → { index: 6 }
boundaries.following(6) // → { index: 7 }
boundaries.following(8) // → { index: 9 }
boundaries.following(9) // → RangeError
// preceding operations include an implicit initial decrement to avoid infinite loops.
boundaries.preceding(9) // → { index: 8 }
boundaries.preceding(8) // → { index: 7 }
boundaries.preceding(6) // → { index: 0 }
boundaries.preceding(5) // → { index: 0 }
boundaries.preceding(1) // → { index: 0 }
boundaries.preceding(0) // → RangeError
Please react with to express approval or to express disapproval.
from proposal-intl-segmenter.
@gibson042 Thanks for digging into this more, but I'm not sure why we need to make all of these changes. They seem like they are addressing orthogonal issues from the break vs segment question. I wrote up #67 to be a minimal change to switch to break iteration. It's still a "Segmenter" which provides the break iterator factory, and no capabilities are removed from the specification. I don't think there are any off-by-one errors, but your review would be appreciated to double-check that.
The motivation for random access was discussed in other issues, e.g., #9, cc @devongonett. I'd say the removal of line breaking, rather than the switch from segments to breaks, could be a reason to remove random access, but I also don't see any downsides to enabling random access.
from proposal-intl-segmenter.
With the exception of renaming Intl.Segmenter
to Intl.Segmentation
and considering input
, they're not orthogonal, at least not completely. Random access arguments have a different interpretation if we're looking for boundaries rather than segments, as does exposed iterator state (specifically index
), and segment capture would be the means of exposing beneficial aspects of segment iteration that motivated it in the first place.
As for random access specifically, downsides are an ability to alter the internal state of an iterator other than by the iterator protocol next
, which could disrupt active iteration in a way that isn't possible with other built-in iterators. I don't know if it's enough to justify removal, but it's certainly a consideration.
from proposal-intl-segmenter.
granularity
Not sure what is granularity here. Could you put down the example output to show us? From my understand, in the current draft, the term granularity is referring to grapheme, line and sentence and will stand constant in the same segmenter.
from proposal-intl-segmenter.
@FrankYFTang your understanding is correct. It would be used by a generic handler that did not itself create the iterator and therefore has no other means of determining its granularity.
from proposal-intl-segmenter.
I have to say, I'm still lost on the rationale for the name "Segmentation" vs "Segmenter". I thought the "doer" part of speech would make sense since we're talking about a factory of iterators, and so a Segmenter would be something that vends those iterators. This is not something I feel really strongly about, though.
In the Intl call, we concluded to rename breakType
to preceedingSegmentType
. We also concluded that, whether iterating forward or backward, the index n refers to a boundary before the nth code point, so, for example, preceding 5 on "hello world" is 0.
Are there any other changes that we should make? I believe we reaffirmed the motivation for backwards iteration and random access on that call. @gibson042 pledged to make these changes during the call.
from proposal-intl-segmenter.
OK, I've uploaded #70 , #71 and #72 to address the agreed-upon points above. Reviews would be welcome! cc @Ms2ger @gibson042
from proposal-intl-segmenter.
Closing due to #59 (comment)
from proposal-intl-segmenter.
Related Issues (20)
- Advance to stage 3 HOT 7
- Advance to stage 4 HOT 5
- Should we throw exception when the string in Intl.Segmenter.prototype.segment ( string ) is not type string HOT 2
- Should segment data objects expose the context string? HOT 1
- FYI: ICU+WASM based polyfill ongoing work HOT 2
- Consistency with Number.range model HOT 5
- Indexed access and/or Symbol.slice support? HOT 2
- Why do we need to create a isWordLike: undefined in CreateSegmentDataObject If granularity is NOT "word" HOT 2
- Confusing fragment in README.md
- Adopt new GetOptions behavior
- Custom Dictionaries HOT 32
- Extensibility for non-ICU approaches? HOT 2
- Word segmenter with generic locale HOT 10
- Punctuation in the word segmenter
- No locale grapheme segmenter
- Line break support HOT 1
- Unicode Database and Related APIs HOT 1
- -
- Sentence break suppressions
- `granularity: "syllable"` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from proposal-intl-segmenter.