Describe the bug Not sure if it's a bug, but the Usearch README le

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Great, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks for the issue, <a class="user-mention notranslate" data-hovercard-type="user" d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Bug: Clustering is really really slow about usearch HOT 7 OPEN

andersonbcdefg commented on June 1, 2024

Bug: Clustering is really really slow

from usearch.

Comments (7)

sourcesync commented on June 1, 2024

@andersonbcdefg Can I try your dataset?

from usearch.

andersonbcdefg commented on June 1, 2024

https://huggingface.co/datasets/teknium/OpenHermes-2.5

from usearch.

sourcesync commented on June 1, 2024

Great, @andersonbcdefg I downloaded this data file . What model/service did you use to vectorize the text? Can you share your vectorization technique/code? That way i can try to reproduce exactly what you are seeing.

from usearch.

ashvardanian commented on June 1, 2024

Thanks for the issue, @andersonbcdefg! Current version has very high variance depending on the dataset and other function arguments. I will be releasing a different algorithm in v3 😉
Will you be open to help test it before the public release?

from usearch.

andersonbcdefg commented on June 1, 2024

i'm down! i did just rip all the usearch clustering out of my codebase and replace it with K-means, haha. but i'll test it outside of prod on similar datasets and see if it's faster, and if so i can put it back! :D I do think the resulting K-means clusters are a bit worse, but it's 20 seconds for streaming k-means vs 10 minutes for usearch so that was a pretty major difference.

from usearch.

andersonbcdefg commented on June 1, 2024

@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes

from usearch.

sourcesync commented on June 1, 2024

@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes

Thanks @andersonbcdefg. Per Ash's comment, I was assuming there is something going on with the statistical distribution of your vectors giving you worst-case performance. When I get a chance, I'll try a different vectorizer using the same fields you used.

from usearch.

Recommend Projects

Bug: Clustering is really really slow about usearch HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent