I'm trying to write a polyfill for this spec and the spec text is quite confusing to m

yeah sgtm <g-emoji class="g-emoji" alias="+1" fallback-src="https://github.githubasset

Spec is vague regarding internal data structure representation about proposal-intl-locale HOT 12 OPEN

longlho commented on June 26, 2024

Spec is vague regarding internal data structure representation

from proposal-intl-locale.

Comments (12)

longlho commented on June 26, 2024 1

yeah sgtm 👍

from proposal-intl-locale.

zbraniecki commented on June 26, 2024 1

I'll wait for the stakeholders I CC'ed here, and file an issue against ecma402 based on this. Thank you!

from proposal-intl-locale.

zbraniecki commented on June 26, 2024

To me this sounds like we can basically specify a parsing + emitting algorithm in the spec with the ability to overwrite specific options.

What's the benefit of it? We have the syntax specified via EBNF in Unicode LDML - https://unicode.org/reports/tr35/tr35.html#Unicode_language_identifier
What's the benefit of copying that into our spec instead of referencing? How would it make it any cleaner for you?

However the spec is being very vague on the parsing and only spell out unicode extension and not others.

How does it spell out unicode extensions differently from other subtags?

Besides, in order to know if, e.g a language matches the unicode_language_subtag production of a unicode_locale_id, we have to parse it anyway.

That's true, but I don't understand how is it an issue. Everywhere in the spec where the spec specifies that a value has to match something, you need to parse it to know if it matches, no?

during reconstruction implementers seem to be expected to re-parse specific pieces.

What do you mean by this? Reconstruction of an Intl.Locale requires parsing of the input string, yes.

from proposal-intl-locale.

longlho commented on June 26, 2024

I don't mean to copy the EBNF in the spec, what I mean is:

During constructor, we have to parse the locale, which results in some form of data structure, e.g:

interface PuExtension {
    type: 'x';
    value: string;
}

interface Keyword {
    key: string;
    value: string;
}

interface TransformedExtension {
    type: 't';
    fields: string[];
    lang?: UnicodeLanguageId;
}

interface UnicodeExtension {
    type: 'u';
    keywords: Keyword[];
    attributes?: string[];
}

interface UnicodeLanguageId {
    lang: string;
    script?: string;
    region?: string;
    variants?: string[];
}

interface UnicodeLocaleId {
    lang: UnicodeLanguageId;
    unicodeExtension: UnicodeExtension;
    transformedExtension: TransformedExtension;
    puExtension: PuExtension;
    otherExtensions: Record<string, string>;
}

The data structure also inherently has structural integrity checks like no duplicate singletons & such.

At this point I know it's structurally valid, along with all the components in the locale. However, in the spec, most modification algorithms involve replacing a well-formed substring with another substring, which in my head means:

Re-parse the old substring
Validate the new substring
Replace certain pieces in the old substring w/ the new substring
Serialize the result

But if I already parse the original input into a data structure, why do I have to re-parse to conform to the substring language of the spec? IMO it might be easier to define the Internal Slots with a data structure like tc39/proposal-unified-intl-numberformat#26 (comment) and then any option that override would replace that slot, and at the end specify a serialization algorithm (or reference one). So the flow in my head is something like:

Constructor -> parse the locale into Internal Slots -> applyOptions -> serialize

Does that make sense?

from proposal-intl-locale.

zbraniecki commented on June 26, 2024

I think I understand now what area your concern is around!

Re-parse the old substring

I don't understand what makes you see the replacements as requiring a reparsing the substring.

Implementations are free to store intermediate representation of the data for the use in the algorithm, and all implementers do this all across the code, and usually not on internal slots as defined by the spec.
In my mind the internal slots are mostly useful to define input data for the algorithms in a implementation-independent model (usually supplied by CLDR for us).

@anba, @littledan , @sffc, @jswalden - thoughts?

from proposal-intl-locale.

zbraniecki commented on June 26, 2024

To be transparent - we're aiming to request advancement of this proposal to Stage 4 during the ongoing TC39 meeting.
I'd also appreciate the position of all stakeholders (esp. @longlho ) on whether this issue should cause us to drop this advancement request from the agenda.

from proposal-intl-locale.

longlho commented on June 26, 2024

I don't wanna hold stage-4 back and I think due to the nature of most ECMA-402 spec implementations using ICU, the end result of the API will be correct :) I think from a non-ICU implementer perspective this is fairly non-straightforward.

I understand that intermediate representation is up to implementers but based on the current language of the spec right now the intermediate representations being passed around in abstract calls are all String-based (by the language of replacing substring with another substring) so it's becoming an implicit requirement for implementers.

Take language getter for example. loc.[[Locale]] is a string since we're returning the substring of locale corresponding to the unicode_language_subtag production of the unicode_language_id.. But given we already parsed it and apply options to it already, it seems to implicitly mean that we parse, apply options, store it as string, then when the getter gets triggered, reparse/revalidate that, and then return the correct unicode_language_subtag substring.

from proposal-intl-locale.

zbraniecki commented on June 26, 2024

I think from a non-ICU implementer perspective this is fairly non-straightforward.

I did write a very early polyfill back in 2016, and a Rust implementation and this has not been an area of concern for me when reading the spec. I'm wondering if that's because of my implicit assumptions and experience with ECMA402?

But given we already parsed it and apply options to it already, it seems to implicitly mean that we parse, apply options, store it as string, then when the getter gets triggered, reparse/revalidate that, and then return the correct unicode_language_subtag substring.

I also assume that, but I'm not sure what is the value of including an exact structure of the intermediate data stored by the implementations.

from proposal-intl-locale.

longlho commented on June 26, 2024

I think the value of it is turning language getter to just return loc.[[Language]] (given that Language is an internal slot). Having an intermediate data structure, as I mentioned, can also correctly reflect structural integrity like no duplicate singleton, no duplicate variant subtags and the like.

I did take a look at your early polyfill :) I'd say that if we just have constructor + toString it makes the intermediate representation an impl details, cause all you need is parse -> <some data structure> -> serialize. With the getter it becomes parse -> <some data structure> -> get a field from the data structure but the data structure is actually not specified.

Side question: does this spec effectively get rid of grandfathered locales?

from proposal-intl-locale.

longlho commented on June 26, 2024

Thanks for the Rust impl, I took a look as well. I think it's not fully following the ecma spec (https://github.com/zbraniecki/unic-locale/blob/master/unic-locale-impl/src/lib.rs#L255) and seems like there is internal impl data structure.

I think all in all, the spec can be implemented but it'd be a lot clearer/easier having more structure to it.

from proposal-intl-locale.

zbraniecki commented on June 26, 2024

Side question: does this spec effectively get rid of grandfathered locales?

I believe our switch to Unicode BCP47 Locale Identifiers did.

I think all in all, the spec can be implemented but it'd be a lot clearer/easier having more structure to it.

My position is that its a tradeoff between "clearer" spec and overspecification that attempts to describe what internal logic should do. At best implementers will diverge without any observable impact, but at worst we'll have some observable impact on such internal fields being defined.

I'm open to make such change as a separate PR against ECMA402 spec after merging this into the spec, if other stakeholders agree with you.

Is that an acceptable way forward for you?

from proposal-intl-locale.

littledan commented on June 26, 2024

I agree with @zbraniecki that this specification was designed to permit more straightforward internal representations that don't imply reparsing. I'm open to editorial PRs to make this change. I think these PRs should land in the ecma402 repo, not here, given that Intl.Locale has already been merged into the main spec.

from proposal-intl-locale.

Spec is vague regarding internal data structure representation about proposal-intl-locale HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent