unicode-org / rust-discuss Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 6 KB

OmnICU-SC: For discussion of i18n in Rust.

License: Other

rust-discuss's People

Contributors

Stargazers

Watchers

Forkers

filmil seanpm2001

rust-discuss's Issues

Mailing List

It's good to have a Google Groups mailing list so we can share docs and calendar events and things. Typically for these things I make an alias in the chromium.org domain. This doesn't mean we're affiliated with Chromium; it's just the most convenient GSuite account to use since a lot of us already have accounts there, and it's the same GSuite that owns the mailing lists and docs for MFWG and TC39-TG2.

Any concerns with making [email protected]?

Future of intl_pluralrules

https://github.com/zbraniecki/pluralrules is currently maintained by me, but I'd prefer to grow the maintainers group.

It's in a good shape, with a helper crates to parse the pluralrules into Rust functions and a script to generate the rust tables based on CLDR JSON data which hopefully makes it easy to maintain and update to latest CLDR at all times.

I'd like to donate the crate to an org which would make it possible for others to update the crate as needed.

unic_datetime

I started experimenting with DateTime formatter. The codebase is very messy, I apologize for that, but I believe I can share the initial performance results.

For the minimum POC I focused on a single locale (pl) and 10 different combinations of dateStyle/timeStyle. I omitted two which require timezone names.

Since DateTime patterns take much more space than pluralrules or locale RTL/likelySubtags data which I worked with before, I experimented with three different models of loading data, as per #5:

Patterns are parsed and inlined into .rs code
Patterns are fetched and parsed from JSON CLDR
Patterns are fetched from an already parsed binary resource file

For the binary scenario I used bincode crate, fetched JSON CLDR, parsed the patterns, and serialized the resulting structure to a res file, which I then loaded into memory at runtime.

I got the following results:

ICU4C_65 - 849 us
UNIC JSON CLDR - 150 us
UNIC bin resource - 70 us
UNIC inlined - 24 us

I'd appreciate if someone could try to replicate my measurements and verify the results.

If the results hold, I believe this to be one more evidence that investment in Rust based crates may lead to significant performance gains.

I haven't evaluate memory use, but the ca-gregory.json file for pl is 18201 bytes and pl.res generated from it is 2187 bytes. I'd expect the latter to take the same amount in memory.

Test corpus

In order to maintain compatibility between ICU and Rust Unicode crates, we'd need to develop test data which is platform independent and can be shared between at least Java, C++ and Rust.

Future of open-i18n

Let's kick off with the discussion on what do we want to do with https://github.com/open-i18n

It covers some of the ICU, in particular:

unic-char: Unicode Character Tools.
unic-ucd: Unicode Character Database
(UAX#44).
unic-bidi: Unicode Bidirectional Algorithm
(UAX#9).
unic-normal: Unicode Normalization Forms
(UAX#15).
unic-segment: Unicode Text Segmentation Algorithms
(UAX#29).
unic-idna: Unicode IDNA Compatibility Processing
(UTS#46).
unic-emoji: Unicode Emoji
(UTS#51).

I'm not sure how up to date those are to ICU 65 and well maintained. My impression was that Behnam is looking to grow the maintainers group and/or hand them over, but I'm not sure if they should be merged together with crates such as unic-langid or intl_pluralrules or some future date time format, or is it ok to keep the unicode characters crates separate from the higher level intl formatting and locale management.

Thoughts?

Data sources

The three crates I maintain - intl_pluralrules, unic-langid and unic-locale currently all store their data in the source code tables generated as a separate step.

This provides a very good performance, but is inflexible and potentially may lead to data duplication.

It would be good for all crates around Unicode in Rust to have cohesive data loading models. I suggest two:

baked in data as an optional feature
data loaded from optimized resource files

We'd need to design such format, write tooling for building such resource files and API for loading them into memory.

License

It may be quite early to start the conversation, but I prefer to have it early than having to later chase people around.

Most of the intl-related crates in Rust use MIT/Apache dual-license model.

ruct_unic - https://github.com/open-i18n/rust-unic
intl_pluralrules - https://github.com/zbraniecki/pluralrules/tree/master/intl_pluralrules
unicode-rs - https://github.com/unicode-rs
unic-locale - https://github.com/zbraniecki/unic-locale

Now, @jfkthame released hyphenation crate which uses MLP-2.0.

I filed jfkthame/mapped_hyph#6 to ask to add MIT/Apache, but I'm also wondering if that combo is reasonable for all stakeholders to move forward with.

Future of unic-locale

https://github.com/zbraniecki/unic-locale contains two crates - unic-langid and unic-locale. The former separates out Unicode LanguageIdentifier parsing/modifying/serializing and the latter adds Unicode Extensions.

The former is quite feature complete on the well-formed level, with optional features such as likely subtags and layout information (character orientation - LTR/RTL) and scripts to easily update the data to latest CLDR as needed.

I'd like to donate the crates to an org that will distribute the maintenance and reduce the bus factor.

UTS 39

Just opening an issue to let y'all know we're working on UTS #39 Unicode Security Mechanisms in Rust here, to be used in the Rust compiler.

(Perhaps we should maintain a list of all Rust projects maintained by members of this group in the readme? Once we start consolidating things it would probably be useful)

ICU functionality support in Rust

This is similar to issue #7 in that it's calling for support for a well understood library in Rust.

Also mentioned in https://www.arewewebyet.org/topics/i18n/

Full disclosure, my project (Fuchsia OS) needs ICU today, so we started by wrapping ICU4C, learning from other work in this space. See references below for details.

It's hard to do justice to all the references to prior art, but I'll try, and will try to backfill with more information where possible.

Bindings

Related issues

Data updates cohesion and policies

With a number of crates that use data from Unicode and CLDR tables, it would be benefitial to design a policy and scripting around cohesive updates of those to minimize the scenario where one crate uses Unicode 12, while another is stuck on Unicode 11 etc.

Some open questions are like:

Should a CLDR data update be considered a minor or major update?
Is there a value in some meta-crate which would collect the subcrates around a particular version of Unicode/CLDR (so, rust-icu 65 would depend on all crates in versions using Unicode 12 and CLDR 36)?
Can we design some basic tooling to make updating data of all Unicode related crates easier - point at CLDR dir, update the code for all crates, release.
... ?

Unicode Charter

Hi all,

I wrote up an OmnICU Charter for sanctioning under the Unicode Consortium. Please take a look:

http://bit.ly/omnicu-charter

Please leave comments. Now is the best time to amend this charter so we can all come into agreement on the goals of an industry project under Unicode.

Regarding this repo: I could either (1) move this repo to unicode-org/omnicu, (2) leave this repo alone indefinitely and make a new repo unicode-org/omnicu, or (3) make a new unicode-org/omnicu and also move this repo into something like unicode-org/rust-discuss. Thoughts?

ECMA-402 support in Rust

I think there is value for supporting well thought-out I18N APIs in Rust as well as in other languages. This is because no single person can be an all-around expert on all matters I18N just by the nature of the subject, and we should be well served if we all work together to minimize the amount of wheel reinvention. ECMA-402 in particular has had tremendous amount of work invested in it, and seems like a well understood basis for this work.

A few topics that seem of interest in the Rust world as of today:

ECMA 402 compatible implementation for Rust
ECMA 402 compatible API definition, to enable multiple implementations of ECMA 402 to exist if needed: not all requirements are equally valid in all contexts.
Conformance tests, to ensure implementations are mutually compatible
A way for Rust to contribute language-neutral proposals to ECMA 402 in case we would like to propose additional functionality.

Traitify `unic_langid::LanguageIdentifier`

Hi folks. Now that both unic-locale and rust_icu have language ID implementations, may I propose an exercise. Let's see what it would take to factor out one or a few common traits from the two implementations so that we can have lazy and/or eager parsing of the locales and basic manipulation.

WDYT? @zbraniecki , others?

OmnICU API surface

I made a doc discussing the API surface for OmnICU in several host languages. I have a section about Rust.

https://docs.google.com/document/d/1tXACn0p2EuzSCJ0Gd8Nd1RV-DgdYG54N6lDwqYeQE4Q/edit#

One big open question is about named arguments delivery. In ECMAScript, we just use object literals and it's easy. In Rust, I suggested two options:

Struct that implements the Default trait
Builder pattern

Thoughts?

An introduction

Hi 👋,

Manish suggested I introduce myself to this effort via an issue, so here goes: I work at YesLogic on Prince. I've recently been replacing our C based Unicode/UCD 9.0 related code with Rust and updating it to Unicode 12.1.

We use lookup tables that trade off some space for faster lookups than the typical binary search approach used in the Rust standard library and Unicode crates in the ecosystem. My approach has been to use ucd-generate to generate simple space efficient tables and then use a build script generate the lookup tables from those.

Where possible I've extended existing crates, but I've also published several new crates. I'm waiting for some initial feedback on my first ucd-generate pull request before opening more, although BurntSushi seems busy at the moment. The thought is that when all the ucd-generate changes are merged I'll open PRs against the other crates that need that newer ucd-generate version.

So far I have:

Added support for bidi-mirrored, joining-type, arabic-shaping, and case mapping to ucd-generate
Updated unicode-script to Unicode 12.1 data with ucd-generate and implemented the lookup table approach
Updated unicode-bidi to Unicode 12.1 data with ucd-generate and implemented the lookup table approach
Published https://crates.io/crates/unicode-case-mapping
Published https://crates.io/crates/unicode-general-category
Published https://crates.io/crates/unicode-joining-type

Manish suggested this work and new crates might be relevant and from our point of view an, "ICU in Rust", is also of interest.

Regards,
Wes

Hyphenation

@jfkthame wrote mapped_hyph which, as far as I know, is not overlapping with any other Rust Intl crate!