Comments (3)
What would this look like in practice....
Our application should support many embedding spaces
at once. An embedding space is specific to a model_version
and a layer
. A model_version
may have many layers
tracked and therefore also many embedding spaces
. Furthermore a app
may have many model_versions
as well.
My default assumption is that we don't track model_version
, layer
, or app
as separate tables - at least not in the user facing api. They can be columns or stored in some key-value
list...
Should they have their own dedicated columns? Or be configurable by the user? How much flexibility do we care to offer here?
Will table schemas be heterogenous across embedding spaces
or will they be consistent? Do we care? I think we probably only really care that they have a column named embedding_data
, one named category_name
, and one named user_identifer
... or we have a mapping or functional mapping to that data.
Given duckdb runs in-memory - we have the option to not load in all of the data at once - and keep a bunch of it "offline". This is really nice because it speeds up everything. However we then need some mapping of (1) what tables are available, (2) and where to find them. Where do we store the information for (1) and (2)? In another OTLP perhaps? Then duckdb needs to further manage bringing these tables in and out of memory. Perhaps keeping some persistently loaded into memory until some event happens or the user says it should be unloaded. This offline(storage)-online(high availability) is similar to a feature store.
We also would need to think about how to partition the parquet files - and specifically how new data is written quickly.
Beyond storing the embeddings and related columns, we also need to think about how we store the indices in this paradigm. We would also persist and load indexes. Some considerations about indexes... knowing the mapping of parquet files -- index file(s). And knowing whether they are stale or not. An additional complexity is that I am not sure that one index per space is a good assumption. Indexes have settings like l2/cos/ip
, different strategies like ngt
, annoy
, hnsw
, etc and more. We might want to maintain different indices for different use cases.
from chroma.
@levand would love your thoughts here as well!
from chroma.
Wish I'd read this before I left comments on #29, which were largely focused on this :)
So yeah. Do we want to introduce a "space" as a first-class part of the data model? Or keep the "dataset" (aka a single embedding space, aka one layer of one model of one app) as the central idiom, and everything else handled via metadata?
If we handle each dataset (aka embedding space) independently then that's the very natural partition point in Parquet. We'd just end up with dozens/hundreds of Parquet files instead of one/few
from chroma.
Related Issues (20)
- [Feature Request]: Use AWS S3 or Azure Blob Storage for persisting chroma db
- [Bug]: Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number. HOT 3
- [Bug]: Error creating and inserting to collections using Persistent Client HOT 3
- [Feature Request]: Query max distance in addition to n_results HOT 1
- [Bug]: getCollection missing DefaultEmbeddingFunction (JavaScript client)
- [Feature Request]: Faster default EF on apple silicon
- [Feature Request]: Passing pre-computed embeddings directly to VectorStore HOT 2
- [Bug]: Cosine Similarity: Unusual Negative Distance in Same Sentence Search HOT 2
- [Bug]: The Write-ahead Log (embeddings_queue) doesn't get cleaned up HOT 4
- [Feature Request]: Ability to close local clients
- [Bug]: When using AzureOpenAI for embedding `azure_deployment` needs to be provided as well HOT 1
- Extending Search Filter to Include Multiple Metadata Fields HOT 3
- what happens if I call chroma.from_documents twice? HOT 4
- [Bug]: ONNXRuntime error on multiple document upserts HOT 2
- [Bug]: add_documents gets slower with each call HOT 2
- exec /docker_entrypoint.sh: no such file or directory[Install issue]: HOT 1
- [Feature Request]: Transformer-base Embedding Function HOT 1
- [Bug]: TypeError: Type is not JSON serializable: numpy.float64 chromadb/api/fastapi.py HOT 5
- [Bug]: Chroma HNSW breaking on Mac during LlamaIndex Installation HOT 3
- [Bug]: My app service cannot communicate with a service Chromadb with mount disk HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chroma.