Right now Chroma is built as a "single-space" embedding store. Meaning it stores 1 emb

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Wish I'd read this before I left comments on <a class="issue-link js-issue-link" data-

"Multi-space" about chroma HOT 3 CLOSED

chroma-core commented on July 4, 2024

"Multi-space"

from chroma.

Comments (3)

jeffchuber commented on July 4, 2024

What would this look like in practice....

Our application should support many embedding spaces at once. An embedding space is specific to a model_version and a layer. A model_version may have many layers tracked and therefore also many embedding spaces. Furthermore a app may have many model_versions as well.

My default assumption is that we don't track model_version, layer, or app as separate tables - at least not in the user facing api. They can be columns or stored in some key-value list...

Should they have their own dedicated columns? Or be configurable by the user? How much flexibility do we care to offer here?

Will table schemas be heterogenous across embedding spaces or will they be consistent? Do we care? I think we probably only really care that they have a column named embedding_data, one named category_name, and one named user_identifer... or we have a mapping or functional mapping to that data.

Given duckdb runs in-memory - we have the option to not load in all of the data at once - and keep a bunch of it "offline". This is really nice because it speeds up everything. However we then need some mapping of (1) what tables are available, (2) and where to find them. Where do we store the information for (1) and (2)? In another OTLP perhaps? Then duckdb needs to further manage bringing these tables in and out of memory. Perhaps keeping some persistently loaded into memory until some event happens or the user says it should be unloaded. This offline(storage)-online(high availability) is similar to a feature store.

We also would need to think about how to partition the parquet files - and specifically how new data is written quickly.

Beyond storing the embeddings and related columns, we also need to think about how we store the indices in this paradigm. We would also persist and load indexes. Some considerations about indexes... knowing the mapping of parquet files -- index file(s). And knowing whether they are stale or not. An additional complexity is that I am not sure that one index per space is a good assumption. Indexes have settings like l2/cos/ip, different strategies like ngt, annoy, hnsw, etc and more. We might want to maintain different indices for different use cases.

from chroma.

jeffchuber commented on July 4, 2024

@levand would love your thoughts here as well!

from chroma.

levand commented on July 4, 2024

Wish I'd read this before I left comments on #29, which were largely focused on this :)

So yeah. Do we want to introduce a "space" as a first-class part of the data model? Or keep the "dataset" (aka a single embedding space, aka one layer of one model of one app) as the central idiom, and everything else handled via metadata?

If we handle each dataset (aka embedding space) independently then that's the very natural partition point in Parquet. We'd just end up with dozens/hundreds of Parquet files instead of one/few

from chroma.

Recommend Projects

"Multi-space" about chroma HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent