Coder Social home page Coder Social logo

Comments (19)

FrankYFTang avatar FrankYFTang commented on September 24, 2024 1

Ping- @gibson042 . the "ECMA-402 notes, 2019-01-17" show
"Richard Gibson will add more detail in the ticket" ,
"RG: Let's end up with something internally consistent. I'll sketch the alternatives. They are logically equivalent, but what we have right now is a mixture, and I don't like that."
"RG will take action item to prepare and clean up the changes to consolidate on segment, to discuss next meeting."
Could you give me an ETA for that? Thanks

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024 1

Random access, returning "edge of input" boolean

Add following and preceding methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached") [after updating iterator state].

// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → false
boundaries.following(5) // → false
boundaries.following(6) // → false
boundaries.following(8) // → true
boundaries.following(9) // → RangeError

// preceding operations include an implicit initial decrement to avoid infinite loops
// (optionally limited to no-argument invocations).
boundaries.preceding(9) // → false
boundaries.preceding(8) // → false
boundaries.preceding(6) // → true
boundaries.preceding(5) // → true
boundaries.preceding(1) // → true
boundaries.preceding(0) // → RangeError

Please react with +1 to express approval or +1 to express disapproval.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024 1

Exposed iterator state

Add lastIndex and/or input properties to boundary iterators and/or an input property to iterator result values, in broad alignment with the RegExp interface.

 let boundaries = segmenter.segment(input);
 
 // Iterate over boundaries.
-let lastIndex = 0;
+let lastIndex = boundaries.lastIndex;
 for (let {index} of boundaries) {
-  let segment = input.slice(lastIndex, index);
+  let segment = boundaries.input.slice(lastIndex, index);
   console.log(`boundary before index ${index}, after segment «${segment}»`);
   lastIndex = index;
 }

Please react with +1 to express approval or +1 to express disapproval.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024 1

Richer iterator result values

Add lastIndex and/or input properties to boundary iterators and/or input/granularity/etc. properties to iterator result values, in broad alignment with the RegExp interface.

 let boundaries = segmenter.segment(input);
 
 // Iterate over boundaries.
 let lastIndex = 0;
-for (let {index} of boundaries) {
+for (let {granularity, input, index} of boundaries) {
   let segment = input.slice(lastIndex, index);
-  console.log(`boundary before index ${index}, after segment «${segment}»`);
+  console.log(`${granularity} boundary before index ${index}, after segment «${segment}»`);
   lastIndex = index;
 }

Please react with +1 to express approval for inclusion or +1 to express approval for exclusion.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024 1

Segment capture

Add a precedingSegment to iteration result values. If the Exposed iterator state extension is accepted, then the property could also be added to boundary iterators. If the Random access extension is accepted, then preceding calls should clear/exclude the property and replace it with followingSegment.

-for (let {index} of boundaries) {
-  let segment = input.slice(lastIndex, index);
-  console.log(`boundary before index ${index}, after segment «${segment}»`);
-  lastIndex = index;
+for (let {index, precedingSegment} of boundaries) {
+  console.log(`boundary before index ${index}, after segment «${precedingSegment}»`);
 }

Please react with +1 to express approval or +1 to express disapproval.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024 1

I think that's a good summary of the changes, and I am happy to make them (with two corrections: preceedingSegmentTypeprecedingSegmentType and "before the nth code point" → "before the nth code unit"). As for Segmenter vs. Segmentation, I'll open a separate issue for discussion: #69.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

I don't have a strong opinion on naming. We've been fairly effective at coming to a conclusion on these naming issues in Intl meetings (whereas a thread on GitHub can just go on forever), so let's discuss this there. Please make a PR to the Intl agenda if that sounds good to you.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024

Resolved in meeting to boundary iteration, with corresponding renames and interface changes.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

Huh, I'm not sure we made that resolution... If we end up removing line break support, maybe segment iteration makes more sense.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

We seemed to conclude in the ECMA-402 meeting that we should switch to a break-based API. PRs welcome to make this more concrete.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024

The PR is still forthcoming, but here's a sketch. I have separated the minimal interface (pure iterators that expose only boundary indexes) and many extensions to it that have been discussed in this repository (but not including breakType, which I believe to be both a premature optimization and an overly proprietary/restrictive affordance). It would be nice to have some feedback on those extensions before specifying them, so please use +1/+1 reactions to vote on the following comments.

Minimal interface

// Top-level constructor name now aligns with the rest of Intl
// (e.g., "NumberFormat" rather than "NumberFormatter").
let segmenter = new Intl.Segmentation("fr", {granularity: "word"});

let input = "Allons-y!";
let boundaries = segmenter.segment(input);

// Iterate over boundaries.
let lastIndex = 0;
for (let {index} of boundaries) {
  let segment = input.slice(lastIndex, index);
  console.log(`boundary before index ${index}, after segment «${segment}»`);
  lastIndex = index;
}
// console.log output:
// boundary before index 6, after segment «Allons»
// boundary before index 7, after segment «-»
// boundary before index 8, after segment «y»
// boundary before index 9, after segment «!»

Random access (optional)

Add following and preceding methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached"), but I personally much prefer this richer signature.

// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → { index: 6 }
boundaries.following(5) // → { index: 6 }
boundaries.following(6) // → { index: 7 }
boundaries.following(8) // → { index: 9 }
boundaries.following(9) // → RangeError

// preceding operations include an implicit initial decrement to avoid infinite loops
// (optionally limited to no-argument invocations).
boundaries.preceding(9) // → { index: 8 }
boundaries.preceding(8) // → { index: 7 }
boundaries.preceding(6) // → { index: 0 }
boundaries.preceding(5) // → { index: 0 }
boundaries.preceding(1) // → { index: 0 }
boundaries.preceding(0) // → RangeError

Exposed iterator state (optional)

Add lastIndex and/or input properties to boundary iterators and/or input/granularity/etc. properties to iterator result values, in broad alignment with the RegExp interface.

 let boundaries = segmenter.segment(input);
 
 // Iterate over boundaries.
-let lastIndex = 0;
-for (let {index} of boundaries) {
+let lastIndex = boundaries.lastIndex;
+for (let {granularity, input, index} of boundaries) {
   let segment = input.slice(lastIndex, index);
-  console.log(`boundary before index ${index}, after segment «${segment}»`);
+  console.log(`${granularity} boundary before index ${index}, after segment «${segment}»`);
   lastIndex = index;
 }

Segment capture (optional)

Add a precedingSegment to iteration result values. If the Exposed iterator state extension is accepted, then the property could also be added to boundary iterators. If the Random access extension is accepted, then preceding calls should clear/exclude the property and replace it with followingSegment.

-for (let {index} of boundaries) {
-  let segment = input.slice(lastIndex, index);
-  console.log(`boundary before index ${index}, after segment «${segment}»`);
-  lastIndex = index;
+for (let {index, precedingSegment} of boundaries) {
+  console.log(`boundary before index ${index}, after segment «${precedingSegment}»`);
 }

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024

Random access, returning iterator result values

Add following and preceding methods to boundary iterators that accept an optional code unit index argument and return iterator result values after updating iterator state. The methods could alternatively return a boolean as currently specified ("true if the {beginning,end} of the string was reached"), but I personally much prefer this richer signature.

// ┃0 1 2 3 4 5┃6┃7┃8┃
// ┃A l l o n s┃-┃y┃!┃
boundaries.following(0) // → { index: 6 }
boundaries.following(5) // → { index: 6 }
boundaries.following(6) // → { index: 7 }
boundaries.following(8) // → { index: 9 }
boundaries.following(9) // → RangeError

// preceding operations include an implicit initial decrement to avoid infinite loops.
boundaries.preceding(9) // → { index: 8 }
boundaries.preceding(8) // → { index: 7 }
boundaries.preceding(6) // → { index: 0 }
boundaries.preceding(5) // → { index: 0 }
boundaries.preceding(1) // → { index: 0 }
boundaries.preceding(0) // → RangeError

Please react with +1 to express approval or +1 to express disapproval.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

@gibson042 Thanks for digging into this more, but I'm not sure why we need to make all of these changes. They seem like they are addressing orthogonal issues from the break vs segment question. I wrote up #67 to be a minimal change to switch to break iteration. It's still a "Segmenter" which provides the break iterator factory, and no capabilities are removed from the specification. I don't think there are any off-by-one errors, but your review would be appreciated to double-check that.

The motivation for random access was discussed in other issues, e.g., #9, cc @devongonett. I'd say the removal of line breaking, rather than the switch from segments to breaks, could be a reason to remove random access, but I also don't see any downsides to enabling random access.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024

With the exception of renaming Intl.Segmenter to Intl.Segmentation and considering input, they're not orthogonal, at least not completely. Random access arguments have a different interpretation if we're looking for boundaries rather than segments, as does exposed iterator state (specifically index), and segment capture would be the means of exposing beneficial aspects of segment iteration that motivated it in the first place.

As for random access specifically, downsides are an ability to alter the internal state of an iterator other than by the iterator protocol next, which could disrupt active iteration in a way that isn't possible with other built-in iterators. I don't know if it's enough to justify removal, but it's certainly a consideration.

from proposal-intl-segmenter.

FrankYFTang avatar FrankYFTang commented on September 24, 2024
granularity

Not sure what is granularity here. Could you put down the example output to show us? From my understand, in the current draft, the term granularity is referring to grapheme, line and sentence and will stand constant in the same segmenter.

from proposal-intl-segmenter.

gibson042 avatar gibson042 commented on September 24, 2024

@FrankYFTang your understanding is correct. It would be used by a generic handler that did not itself create the iterator and therefore has no other means of determining its granularity.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

I have to say, I'm still lost on the rationale for the name "Segmentation" vs "Segmenter". I thought the "doer" part of speech would make sense since we're talking about a factory of iterators, and so a Segmenter would be something that vends those iterators. This is not something I feel really strongly about, though.


In the Intl call, we concluded to rename breakType to preceedingSegmentType. We also concluded that, whether iterating forward or backward, the index n refers to a boundary before the nth code point, so, for example, preceding 5 on "hello world" is 0.

Are there any other changes that we should make? I believe we reaffirmed the motivation for backwards iteration and random access on that call. @gibson042 pledged to make these changes during the call.

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

OK, I've uploaded #70 , #71 and #72 to address the agreed-upon points above. Reviews would be welcome! cc @Ms2ger @gibson042

from proposal-intl-segmenter.

littledan avatar littledan commented on September 24, 2024

Closing due to #59 (comment)

from proposal-intl-segmenter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.