Coder Social home page Coder Social logo

"Multi-space" about chroma HOT 3 CLOSED

chroma-core avatar chroma-core commented on July 4, 2024
"Multi-space"

from chroma.

Comments (3)

jeffchuber avatar jeffchuber commented on July 4, 2024

What would this look like in practice....

Our application should support many embedding spaces at once. An embedding space is specific to a model_version and a layer. A model_version may have many layers tracked and therefore also many embedding spaces. Furthermore a app may have many model_versions as well.

My default assumption is that we don't track model_version, layer, or app as separate tables - at least not in the user facing api. They can be columns or stored in some key-value list...

Should they have their own dedicated columns? Or be configurable by the user? How much flexibility do we care to offer here?

Will table schemas be heterogenous across embedding spaces or will they be consistent? Do we care? I think we probably only really care that they have a column named embedding_data, one named category_name, and one named user_identifer... or we have a mapping or functional mapping to that data.

Given duckdb runs in-memory - we have the option to not load in all of the data at once - and keep a bunch of it "offline". This is really nice because it speeds up everything. However we then need some mapping of (1) what tables are available, (2) and where to find them. Where do we store the information for (1) and (2)? In another OTLP perhaps? Then duckdb needs to further manage bringing these tables in and out of memory. Perhaps keeping some persistently loaded into memory until some event happens or the user says it should be unloaded. This offline(storage)-online(high availability) is similar to a feature store.

We also would need to think about how to partition the parquet files - and specifically how new data is written quickly.

Beyond storing the embeddings and related columns, we also need to think about how we store the indices in this paradigm. We would also persist and load indexes. Some considerations about indexes... knowing the mapping of parquet files -- index file(s). And knowing whether they are stale or not. An additional complexity is that I am not sure that one index per space is a good assumption. Indexes have settings like l2/cos/ip, different strategies like ngt, annoy, hnsw, etc and more. We might want to maintain different indices for different use cases.

from chroma.

jeffchuber avatar jeffchuber commented on July 4, 2024

@levand would love your thoughts here as well!

from chroma.

levand avatar levand commented on July 4, 2024

Wish I'd read this before I left comments on #29, which were largely focused on this :)

So yeah. Do we want to introduce a "space" as a first-class part of the data model? Or keep the "dataset" (aka a single embedding space, aka one layer of one model of one app) as the central idiom, and everything else handled via metadata?

If we handle each dataset (aka embedding space) independently then that's the very natural partition point in Parquet. We'd just end up with dozens/hundreds of Parquet files instead of one/few

from chroma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.