Comments (7)
@andersonbcdefg Can I try your dataset?
from usearch.
https://huggingface.co/datasets/teknium/OpenHermes-2.5
from usearch.
Great, @andersonbcdefg I downloaded this data file . What model/service did you use to vectorize the text? Can you share your vectorization technique/code? That way i can try to reproduce exactly what you are seeing.
from usearch.
Thanks for the issue, @andersonbcdefg! Current version has very high variance depending on the dataset and other function arguments. I will be releasing a different algorithm in v3 😉
Will you be open to help test it before the public release?
from usearch.
i'm down! i did just rip all the usearch clustering out of my codebase and replace it with K-means, haha. but i'll test it outside of prod on similar datasets and see if it's faster, and if so i can put it back! :D I do think the resulting K-means clusters are a bit worse, but it's 20 seconds for streaming k-means vs 10 minutes for usearch so that was a pretty major difference.
from usearch.
@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes
from usearch.
@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes
Thanks @andersonbcdefg. Per Ash's comment, I was assuming there is something going on with the statistical distribution of your vectors giving you worst-case performance. When I get a chance, I'll try a different vectorizer using the same fields you used.
from usearch.
Related Issues (20)
- Bug: error: linking with `cc` failed: exit status: 1 in rust crate HOT 20
- Bug: crash when hardware concurrency is exceeded HOT 5
- Bug: index.search returns invalid keys when k > index size HOT 5
- Bug: Deadlock in concurrent update()s HOT 5
- Bug: Replacing initial entry affects visibility of other entries HOT 2
- Feature: Cross compilation of sqlite extension for ios and android for react native apps HOT 2
- Bug: Issues index dtype=i8 with Inner Product Metrics HOT 27
- Feature parity between GoLang and C
- Feature: Java search API extension to batch search and ANN.
- Bug: Segfault when dimensions of added vector don't add up (Rust) HOT 4
- Bug: Failed to run c++ examples. HOT 2
- Bug: Arm64 versions starting at v10.0 and up give the error Fatal Python error: Illegal instruction HOT 3
- Low index performance after `clear()` HOT 2
- Bug: Syntax Error with Jest in ESM HOT 3
- Bug: Rust build does not use simsimd (`index.hardware_acceleration()` reports `serial`) HOT 2
- Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) HOT 4
- Feature: adding `py.typed` metadata to `python/usearch` HOT 1
- Bug: npm package does not support esm in nodejs project. HOT 2
- "usearch_sqlite" binary for Windows HOT 1
- Bug: Rust test_add_remove_vector fails on main-dev HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from usearch.