arty-name / locale-index-of Goto Github PK
View Code? Open in Web Editor NEWFind „cafe“ in „Fondation Café“
Find „cafe“ in „Fondation Café“
Sorry to trouble you, again.
Currently, when Intl.Segmenter isn't available, it uses the following to segment the string
Lines 58 to 65 in cb7f400
I just learned, however, that this won't handle surrogate pairs properly. For example,
[...segmenter.segment('𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡')]
Expected:
[
{ segment: '𝟘', index: 0 },
{ segment: '𝟙', index: 2 },
{ segment: '𝟚', index: 4 },
{ segment: '𝟛', index: 6 },
{ segment: '𝟜', index: 8 },
{ segment: '𝟝', index: 10 },
{ segment: '𝟞', index: 12 },
{ segment: '𝟟', index: 14 },
{ segment: '𝟠', index: 16 },
{ segment: '𝟡', index: 18 }
]
Actual:
[
{ segment: '\ud835', index: 0 },
{ segment: '\udfd8', index: 1 },
{ segment: '\ud835', index: 2 },
{ segment: '\udfd9', index: 3 },
{ segment: '\ud835', index: 4 },
{ segment: '\udfda', index: 5 },
{ segment: '\ud835', index: 6 },
{ segment: '\udfdb', index: 7 },
{ segment: '\ud835', index: 8 },
{ segment: '\udfdc', index: 9 },
{ segment: '\ud835', index: 10 },
{ segment: '\udfdd', index: 11 },
{ segment: '\ud835', index: 12 },
{ segment: '\udfde', index: 13 },
{ segment: '\ud835', index: 14 },
{ segment: '\udfdf', index: 15 },
{ segment: '\ud835', index: 16 },
{ segment: '\udfe0', index: 17 },
{ segment: '\ud835', index: 18 },
{ segment: '\udfe1', index: 19 }
]
The length and index of the string refer to code units, not code points. The iterator is actually the preferred method in almost all cases, as one doesn't normally want to deal with code units:
const segmenter = {
*segment(string) {
let index = 0;
for (const segment of string) {
yield { segment, index };
index += segment.length;
}
}
}
In the case of string searching, I think it's actually fine to split at code units. It should still be able get a correct match.
But since it's a very easy fix I don't think there's any reason not to handle it properly, unless one wants to support IE, which doesn't support String.prototype[@@iterator]()
...
4.11.0
to 4.12.0
.This version is covered by your current version range and after updating it in your project the build failed.
tape is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.
The new version differs by 9 commits.
42c84d6
v4.12.0
3e0a341
[Deps] update is-regex
, string.prototype.trim
ba7e2b2
[Dev Deps] update eslint
f3a5925
[Tests] use shared travis-ci configs
6e94800
[Deps] update deep-equal
, glob
, object-inspect
, resolve
, string.prototype.trim
8150c3b
[Refactor] use is-regex
instead of instanceof RegExp
24487cb
add tap-nyc to pretty-reporters
c283615
[New] when the error type is wrong, show the message and stack
44cbbf5
[Tests] add a test for the wrong kind of error
See the full diff
There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot 🌴
In some locales, e.g. sk-SK
there are letters that consist of more than one sign: ch
is one letter in Slovak and calling:
const shouldNotFindAnything = localeIndexOf('ch', 'h', 'sk-SK');
should return -1. If we use e.g.
"ch".localeCompare("h", "sk-SK");
we are getting "1" that shows that collation in localeCompare
works fine (for en
it's comparing first "c" with "h" and returns -1.
Another test case that you can use to fix the problem is:
const shouldBeSeven = localeIndexOf('chodit hore', 'ho', 'sk-SK');
The first h
on position 1 should be ignored as it is a part of ch
. Expected result: 7. Current result: 1.
The reason for it is faulty Intl.Segmenter
. It is not able to correctly segment characters in a locale. In the fix we would need to either make an issue on the segmenter and fix it or use some other, workaround function.
const segmenterSk = new Intl.Segmenter('sk-SK', { granularity: 'grapheme' });
Array.from(segmenterSk.segment("chh"));
// (3) [{…}, {…}, {…}]
// 0 : {segment: 'c', index: 0, input: 'chh'}
// 1 : {segment: 'h', index: 1, input: 'chh'}
// 2 : {segment: 'h', index: 2, input: 'chh'}
// length : 3 [[Prototype]] : Array(0)
It seems that it does not currently handle decomposed forms. Example:
const collator = new Intl.Collator('fr', { sensitivity: 'base' })
indexOf(collator, 'caf\u0065\u0301 au lait, caf\u00e9 au lait', 'cafe au lait')
// expected: 0
// actual: 15
A simple solution would be to use Intl.Segmenter
to align the strings:
const indexOf_ = (segmenter, collator, string, substring) => {
const substringLength = Array.from(segmenter.segment(substring)).length
const segments = Array.from(segmenter.segment(string))
for (let i = 0; i <= segments.length - substringLength; i++) {
const potentialMatch = segments
.slice(i, i + substringLength)
.map(x => x.segment).join('')
if (collator.compare(potentialMatch, substring) === 0) {
return segments[i].index
}
}
return -1
}
const segmenter = new Intl.Segmenter('fr', { granularity: 'grapheme' })
const collator = new Intl.Collator('fr', { sensitivity: 'base' })
indexOf_(segmenter, collator, 'caf\u0065\u0301 au lait, caf\u00e9 au lait', 'cafe au lait')
// => 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.