Comments (14)
I wonder if this feature can benefit from Intl.Segmenter (requires a polyfill for FireFox). Segmenter can take the locale and automatically determine where the word boundaries should be. Also, potentially reducing library size and improving tokenization performance. It works on the server side too.
from orama.
@SoTosorrow we could make rules for languages such as Chinese where we operate on tokens differently. But we need examples and documentation to understand how to operate properly, here we might need your help ๐
from orama.
Hi @SoTosorrow ,
absolutely, any PR is very appreciated ๐
from orama.
Hi @SoTosorrow , absolutely, any PR is very appreciated ๐
thx for your reply.
I just realized that the index of lyra starts with the beginning of a split word,.
For example, lyra can search "lov" to "i love her", but can not search "ove" to "i love her"(with exact).
which means that for languages with consecutive word๏ผwith no or fewer split) such as Chinese, Japanese, similar rules cannot be simply applied.
Chinese's sentence always like "ABC,EF" ("iloveher,ofcourse"), that i can not search the sentence by "B"("love") or "C"("her").i can only search it by "A.."("ilove")
It seems that I can't give my PR easily.
hhhhhhh
from orama.
Hi @SoTosorrow , absolutely, any PR is very appreciated ๐
It's not easy to support chinese (or any other language which use consecutive-word with no split) by append a simple regular expression in pure lyra, if i want to retrieval chinese, i need to break down words before "insert" and "search".
Should i add the regular expression and prompt the user that chinese sentences needs to be processed first or give up this method?
from orama.
@SoTosorrow we could make rules for languages such as Chinese where we operate on tokens differently. But we need examples and documentation to understand how to operate properly, here we might need your help ๐
I'd love to help with examples and documentation. I will give the relevant information after sorting it out.
Should i open a discussion for the examples and documentation or continue in this issue?
from orama.
Let's open a discussion for that, will act as future documentation
from orama.
Let's open a discussion for that, will act as future documentation
copy that! thanks
from orama.
I wonder if this feature can benefit from Intl.Segmenter (requires a polyfill for FireFox). Segmenter can take the locale and automatically determine where the word boundaries should be. Also, potentially reducing library size and improving tokenization performance. It works on the server side too.
It seemd works, i will do more test, thanks for your guidance๏ผ
from orama.
@SoTosorrow Did you manage to get Chinese working? If so could you provide an example?
from orama.
Based on the help provided by the comments above, I implemented the Chinese tokenizer using Intl.Segmenter
, which may be able to help you.
Intl.Segmenter
works great in chrome and cloudflare workers.
// override default english tokenizer
const chineseTokenizer = {
language: "english",
normalizationCache: new Map(),
tokenize: (raw: string) => {
const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
const _iterator = segmenter.segment(raw)[Symbol.iterator]();
return Array.from(_iterator).map((i) => i.segment);
},
};
const db: Orama<typeof schema> = await create({
schema,
components: {
tokenizer: chineseTokenizer,
},
});
update:
Although no errors were reported when doing this, most of the time I couldn't search for the results I wanted, and I think further adaptation is needed somewhere. But then I won't be able to do it. At present, I will choose other engines to connect to my project.
from orama.
Based on the help provided by the comments above, I implemented the Chinese tokenizer using
Intl.Segmenter
, which may be able to help you.
Intl.Segmenter
works great in chrome and cloudflare workers.// override default english tokenizer const chineseTokenizer = { language: "english", normalizationCache: new Map(), tokenize: (raw: string) => { const segmenter = new Intl.Segmenter("zh", { granularity: "word" }); const _iterator = segmenter.segment(raw)[Symbol.iterator](); return Array.from(_iterator).map((i) => i.segment); }, }; const db: Orama<typeof schema> = await create({ schema, components: { tokenizer: chineseTokenizer, }, });update:
Although no errors were reported when doing this, most of the time I couldn't search for the results I wanted, and I think further adaptation is needed somewhere. But then I won't be able to do it. At present, I will choose other engines to connect to my project.
I have also tried Intl segmentation based on the comments above, but the result on Chinese is not always good, and there may be some dependency issues.
I have also tried other word segmentation libraries such as "jieba", and some of them have good results, but they will introduce additional third-party packages and need to modify the core function of word segmentation (at that time) to adapt to Chinese word segmentation.
considering the possible impact. so I had stop.
from orama.
@SoTosorrow What search engine did you choose in the end?I'm going to try algolia.
from orama.
@SoTosorrow What search engine did you choose in the end?I'm going to try algolia.
I didn't use js search services in the end, so I regret that I can't give you more suggestions.
from orama.
Related Issues (20)
- Provide creation options while restoring the database - Data Persistence Plugin
- Relevance parameters are not respected HOT 1
- Threshold 0 not working as documented HOT 3
- plugin-docusaurus-v3 does not work when Docusaurus in blog only mode HOT 2
- devcontainer build failing HOT 4
- Failed detection of Node.js-environment in Next.js 14+ for orama/plugin-data-persistence HOT 13
- Cannot add OpenAI API Key HOT 6
- Cannot import from @orama/orama/components HOT 3
- Sorting property with multiple values in list of string HOT 1
- plugin-match-highlight does not work with array field HOT 3
- Generating development ssr bundle failed
- [feat] add binary quantization support vector search HOT 1
- Keyboard navigation with @orama/searchbox and vitepress HOT 1
- Different search result after persist and restore database index HOT 2
- Cannot read properties of undefined ( reading: 'title') in Vitepress HOT 1
- tsconfig with module option set to Node16 orama refuses to import HOT 1
- Plugin for Docusaurus not working + gets confusing on 2nd attempt of serve HOT 2
- Orama Vitepress Plugin does not support custom vitepress source dir
- Feature request: Multiple vector properties to enable multi-modal search HOT 3
- Extend Crawler queries by a custom "data-orama" attribute
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orama.