chroma-core / chroma Goto Github PK
View Code? Open in Web Editor NEWthe AI-native open-source embedding database
Home Page: https://www.trychroma.com/
License: Apache License 2.0
the AI-native open-source embedding database
Home Page: https://www.trychroma.com/
License: Apache License 2.0
In several places, we convert UUID
objects back and forth from and to strings. This isn't great, and it's hard to keep track of what they should be, when.
We should add types to make sure they are what we expect, and then refactor to try to do this conversion as little as possible, or preferably, not at all.
An example:
Line 115 in ddaabeb
It's ambiguous here what type collection_uuid
should be.
Currently we store data in the coco json format. That looks like this:
{
"annotations": [{
"bbox": [-0.88869536, -2.7152133, 127.68482208, 62.74983978],
"category_id": 43,
"category_name": "knife",
}]
}
More here on that format: https://roboflow.com/formats/coco-json
Questions
Next steps
Review various data formats. Look further at roboflow, see if the labeling providers have some documentation. Consider both vision as well as NLP applications.
chroma-client
is mainly is responsible for the public API and data ferrying to the backend.
chroma-server
is mainly is responsible for storing data and running computations.
Inside chroma there are 2 wolves:
Wolf B is fairly obvious... there is chroma-client
and chroma-server
and then work together.
Wolf A however... does chroma-client
have extra functionality to handle the in-memory use case? or is there seperate code that is shared between chroma-server and chroma-client?
I think some code has to be shared... the code for "doing the maths" - and possibly more code, things like data formats, etc. Possibly more. How should this be structured?
Chroma uses strict requirement pins, which means that pip will only install it as part of an application whose requirement pins are exactly the same. As Chroma is a framework that will (hopefully!) be included in many other applications, having strict pins severely limits that uptake by making it automatically incompatible with applications that have even a minor version difference of something as common as FastAPI, requests, or Pydantic.
If interesting, at Prefect we found this article to be very helpful in thinking through the implications of pinning versions, and have generally left our frameworks (which are intended to be used in other software) pinned only to lower bounds that we know are compatible with our usage.
This version is meant to get a handle on the external and internal APIs, not yet to define all the specifics of what inputs they take, if they do single vs batch mode, etc.
mhb
and the index and saves it to the db.chroma
directory at the level the python function was executed frommhb
and regenerate distance values? - on every new batch of datapoints?Collections should store their UUIDs and pass them through read/write paths rather than every path requiring a lookup from collection_name -> collection_uuid.
Without real auditing, I already know the counts are off in the objects set. I don't plan to go crazy with the audit, but just some sanity checks before asking Anton and Jeff to review.
The new files will be converted into parquet for testing.
One of the requirements for the package is uuid which is a package from 2006 supporting Python 2. uuid
is now a built-in library.
Should be removed from requirements ๐
Based on our conversations so far, write up a possible round-trip customer data plan for MVP.
Should include:
Currently if you try to get nearest neighbors before building an index, you get a cryptic error message. We have the state to give users a great error message, so do this.
Any production service needs a way to configure regular backups. If we go with duckdb/parquet
this could be as simple as a bucket policy to create an archive of any new or updated files every 24h.
So far we've only explored the simple in-process version. Consider multi-process, server-side.
Unless we want to continue on relying on the user to manually trigger the formation of the index, and running other downstream analytics, we want to have an internal worker/queue system for running these ourselves based on triggers and schedules.
The current client-server
bridge uses http+json+rest
because it was easy, but certainly not because it was fast or ideal. Explore alternatives to increase speed.
The core algorithms have a lot of redundant computation we could avoid by modularizing them further and caching results as they become available.
https://github.com/chroma-core/chroma/blob/jeff/end-to-end-v1/chroma-client/tests/test_client.py#L18
I need to stub out the server to test the client...
How to do this? @levand any ideas?
One concern about using DuckDb and parquet is maintaining correctness even when potentially many requests are coming in per second to add new embeddings to the production data space.
The other concern is multiple users in the org querying or pulling data from a service at the same time.
Will this work? Will there be collisions?
#87 introduces the core algorithms, but representative sampling using class labels is currently broken.
This is probably because hnswlib filtering doesn't let us get a connected graph, and so ANN search fails. We should understand how / when this fails. There are possibly easy fixes, i.e. generating a separate index per dataset per class, or increasing the edge factor, but the real solution is to understand how to use hnsw properly in our context.
We don't know exactly what users want yet in a frontend, but here are some very general ideas
Down the road
Step 1 would be to figure out a sketch of how we would package it up.
One of the known potential issues with using hnswlib
vs other more "full-featured" options is updating the index without recomputing the whole index.
This is actually important because updating the index is a core part of what we do.
Deleting from an index is also useful because users may upload data accidentally and want to remove it without starting completely over.
Because of query_dataframe. It should be overriden.
We think the changes we wanted are now upstream. Test!
get_nearest_neighbor
can return nan for the labels if we happen to query unlabeled datasets. This breaks fastapi's JSON serialization with an error ValueError: Out of range float values are not JSON compliant
.
We should replace the nans or handle them accordingly. The current workaround when using the server is to not query unlabeled datasets in get_nearest_neighbor calls like so - which is likely what is desired anyways.
chroma_client.get_nearest_neighbors(embedding, n_results=N, where={'dataset':'training'})
space_key
is dumb. The idea in theory makes sense... their is an embedding space, and there is a key to denote that unique embedding space.
But I think its very confusing, and DX includes good naming! (hardest problem blah blah)
Other candidates
enviornment
scope
Depending on our ANN lib (e.g. Hnsw) we'll have an artifact computed for a set of rows. That has to:
Obvious, so users dont have to send unencrypted packets over the wire.
More available here: https://fastapi.tiangolo.com/deployment/concepts/
If we stay with parquet, I imagine we will want to support moving these files off the local machines they are built on and off to object storage on whatever cloud the user is using.
if I'm using create_collection
in a notebook cell (or in another Python function), the code will fail on the second run as the collection already exists. This forces users to change their code from create
to get
collection, or to use an or
statement like this:
repos = chroma_client.get_collection(name="my_repos") or chroma_client.create_collection(name="my_repos")
Inspired by Rails ActiveRecord's find_or_create_by
, Chroma could support a get_or_create
function that creates the collection OR returns it if it already exists.
If the collection exist, Chroma prints a notice. This will help avoid issues in which users are writing into someone else's collection, or write data in a collection thinking it's brand new but it already has data in it.
Right now Chroma is built as a "single-space" embedding store. Meaning it stores 1 embedding space at a time (from 1 layer, 1 trained model, 1 app).
Think about what it would take to make it "multi-store".
If we stick with duckdb+parquet
this probably also means various kinds of parquet partitioning.
https://fastapi.tiangolo.com/deployment/server-workers/
https://fastapi.tiangolo.com/deployment/server-workers/
https://fastapi.tiangolo.com/deployment/docker/#official-docker-image-with-gunicorn-uvicorn
Would need to think about how this would affect things being in-memory and making sure those could be shared.
Useful to make sure that a remote Chroma is as expected or whether a new feature is available yet.
We need to be able to report on how many users (anonymously) are using Chroma. This is going to be an opt-out part of the setup flow. We should make it very easy for the user to tell what we are sending back so they can verify we are only sending back very lightweight fully-anonymized usage and no user data ever.
We may want to add telemetry the client and the server?
If we do add telemetry to the client - we need to make sure that our requests are not blocking.
To enable opt-out we are going to have to add a .env
file to the application.
Should we send events directly to the downstream data store or use our own tracking domain? eg https://telemetry.trychroma.com
The current random sampler is biased toward images with more embeddings in them because we sample from embeddings, not images.
If we generated something like a custom quality score for a giant set of embeddings... do we update (discouraged in clickhouse) or put them into a new table....
A user reported this bug
Make sure the client respects overirding model-space ! It wasn't for many functions
Currently the API has a bunch of functions the user shouldn't call. The one that stands out most to me is where_with_model_space
which is purely a utility function.
Currently I've marked all those I think should be removed with "
When the user deploys this service, they will want to:
We should offer some smart defaults around this.
The default behavior is clearing directory on exit instead of persisting it:
Lines 319 to 321 in c4bce6d
As opposed to the doc:
By default Chroma uses an in-memory database, which gets persisted on exit and loaded on start (if it exists). This is fine for many experimental / prototyping workloads, limited by your machine's memory.
Suggestion:
list
, dict
, etc. Why? because we do not want to put the burden on the user to learn Apache Arrow or to add extra dependencies to their application like Pandas if they aren't using them right now.Duckdb
and Clickhouse
both support direct import and query of apache arrow files, and I assume data structures, though this needs to be tested.fetch
, fetch_arrow
, fetch_df
, fetch_numpy
to give users more flexibility in the format they get back. The client or API would handle this transform probably at that endpoint... all wrapped around fetch_arrow
if we use that internally.client
mainly, in the deployed context, but we also want to support direct import of chroma_core
for in-memory
notebook usage... we will want to offer these various fetch_*
options at both layers.insert_arrow
, fetch_arrow
however.Github Actions uses older versions of OSX (11 and 12), and runs only on Intel-based macs.
The tests run fine when running locally on an M1.
Given that the dev team is mostly on newer M1 Macs, without immediate access to an Intel-based mac to reproduce the problem, we are going to "solve" the test failure by removing MacOS as a test platform target. Given the error message, I also strongly suspect the problem is in the Github Actions environment rather than actually being a Mac-related issue.
Creating this issue so we can circle back if we ever do decide it's important to get unit tests running on Intel macs, again.
There are currently 3 separate READMEs. None of these give the right instructions for setup, testing, or usage.
Should have placeholder modules for the major areas (db, indexing, api) to unblock feature development.
The features we needed (filtering) have now been merged in upstream, so we no longer need our fork.
Mirroring what I ran in the JSON version, try it native.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.