chroma-core / chroma Goto Github PK

the AI-native open-source embedding database

License: Apache License 2.0

Python 32.39% Dockerfile 0.16% Shell 0.30% TypeScript 7.30% HTML 0.02% JavaScript 0.06% Jupyter Notebook 12.47% Makefile 0.06% Go 9.90% HCL 0.02% Rust 36.91% C++ 0.27% Starlark 0.15%

document-retrieval embeddings llms

chroma's People

Contributors

Stargazers

Watchers

Forkers

hammadb fanahova danielgross moerehman tausif3 babyblue26 fpingham yiminzme simon-lawyer augustkarlstedt co-simulation maddyonline byrneml nlsfnr m8e floleuerer bishwenduk029 smellsworth bkluwe22 cjpais magitek-labs shibuiwilliam yichozy deepfindai malats ericmand laserbear realrasengan atinak centaurioun 0xchina jaedukseo stjordanis arebourstls paulsunnypark sergeimeza dbasch alw3ys eltociear 0xdigiscore jamesthesnake jschachter darth-veitcher ybv glapa-grossklag cmelbye freecamel pjhul karbon0x law108000 saif-qureos druskacik goswamig mosegui michaelstorm idahopotato1 levileijenhorst rajat1saxena segmond arnaudmkonan smartschat lxlxok legitosaurus ahkiu69 2218084076 kyutarou realhunion 596050 liqul ahlag sarvex williamtran29 jancajthaml smallw00d2211 soxunlocks pterameta ahmedopolis emmanuelr20 thomasewing04 luxeaveforks hiteshgorana zhangxfeng jamestiotio yodablocks shubhamay-shubho shivankar-p hirajanwin fatelei giorgosstath16 bartekdobija renatopassos pkiage thejerk400 deegiimurun joshuabellew techventurebuilder renatlot adkri gilchris lyrl

chroma's Issues

UUIDs and Strings

In several places, we convert UUID objects back and forth from and to strings. This isn't great, and it's hard to keep track of what they should be, when.

We should add types to make sure they are what we expect, and then refactor to try to do this conversion as little as possible, or preferably, not at all.

An example:

chroma/chromadb/api/local.py

Line 115 in ddaabeb

    
           collection_uuid, embedding=embeddings, metadata=metadatas, documents=documents, ids=ids

It's ambiguous here what type collection_uuid should be.

Data formatting/storage for inference and label data

Currently we store data in the coco json format. That looks like this:

{
	"annotations": [{
		"bbox": [-0.88869536, -2.7152133, 127.68482208, 62.74983978],
		"category_id": 43,
		"category_name": "knife",
	}]
}

More here on that format: https://roboflow.com/formats/coco-json

Questions

what data format will our users have their results in? Natively in wherever they are running inference/logging
what properties actually matter to us? should things be broken out into their own columns? should we be able to store/return back to the user exactly what they gave us even so?

Next steps
Review various data formats. Look further at roboflow, see if the labeling providers have some documentation. Consider both vision as well as NLP applications.

Discussion: in-memory and chroma client-server --> sharing code?

chroma-client is mainly is responsible for the public API and data ferrying to the backend.
chroma-server is mainly is responsible for storing data and running computations.

Inside chroma there are 2 wolves:

Wolf A: chroma running entirely in the python session of the user
Wolf B: chroma running in a client-server way where the user sends data to a session managed outside of their current python session.

Wolf B is fairly obvious... there is chroma-client and chroma-server and then work together.

Wolf A however... does chroma-client have extra functionality to handle the in-memory use case? or is there seperate code that is shared between chroma-server and chroma-client?

I think some code has to be shared... the code for "doing the maths" - and possibly more code, things like data formats, etc. Possibly more. How should this be structured?

Relax strict requirements pins?

Chroma uses strict requirement pins, which means that pip will only install it as part of an application whose requirement pins are exactly the same. As Chroma is a framework that will (hopefully!) be included in many other applications, having strict pins severely limits that uptake by making it automatically incompatible with applications that have even a minor version difference of something as common as FastAPI, requests, or Pydantic.

If interesting, at Prefect we found this article to be very helpful in thinking through the implications of pinning versions, and have generally left our frameworks (which are intended to be used in other software) pinned only to lower bounds that we know are compatible with our usage.

API Design

This version is meant to get a handle on the external and internal APIs, not yet to define all the specifics of what inputs they take, if they do single vs batch mode, etc.

External API

init - set metadata to be used for logging and fetching for this session
log - log new embeddings and their metadata
fetch - get data to label

Internal API

init - loads db if exists
load database
save database - run everytime the database changes so we never rely on in-memory?
load index
save index
generate ANN index - runs hnswlib
use ANN - mhb and other calculations uses the index from ANN for speed
run mhb - runs malanobis distance
generate distances - generates distances based on mhb and the index and saves it to the db
save new embedding+metadata datapoints to db
[future] run umap
internal set of things that log does
internal set of things that fetch does

File store proposal

store the db and index in a .chroma directory at the level the python function was executed from

Questions

when do regenerate the index? - on every new batch of datapoints?
when do we re-rerun mhb and regenerate distance values? - on every new batch of datapoints?

A Collection should store its UUID

Collections should store their UUIDs and pass them through read/write paths rather than every path requiring a lookup from collection_name -> collection_uuid.

Audit and resolve coco JSON export integrity

Without real auditing, I already know the counts are off in the objects set. I don't plan to go crazy with the audit, but just some sanity checks before asking Anton and Jeff to review.

The new files will be converted into parquet for testing.

uuid package is for Python 2

One of the requirements for the package is uuid which is a package from 2006 supporting Python 2. uuid is now a built-in library.

Should be removed from requirements 🙏

Document hand rolled (parquet/duckdb/hnswlib) customer cycle proposal

Based on our conversations so far, write up a possible round-trip customer data plan for MVP.

Should include:

customer data logging
transfer data to chroma
build ovoid indexes on training data
build ANN indexes on training data
ingest prod data
compute recommendations for prod -> training inclusion

better error handling around NN if no index is created yet

Currently if you try to get nearest neighbors before building an index, you get a cryptic error message. We have the state to give users a great error message, so do this.

_CHR-39

DB Backups

Any production service needs a way to configure regular backups. If we go with duckdb/parquet this could be as simple as a bucket policy to create an archive of any new or updated files every 24h.

Prototype server with duckdb

So far we've only explored the simple in-process version. Consider multi-process, server-side.

Worker/Queues

Unless we want to continue on relying on the user to manually trigger the formation of the index, and running other downstream analytics, we want to have an internal worker/queue system for running these ourselves based on triggers and schedules.

Move away from JSON/REST

The current client-server bridge uses http+json+rest because it was easy, but certainly not because it was fast or ideal. Explore alternatives to increase speed.

Make core algorithms more efficient

The core algorithms have a lot of redundant computation we could avoid by modularizing them further and caching results as they become available.

Get tests working on chroma-client

https://github.com/chroma-core/chroma/blob/jeff/end-to-end-v1/chroma-client/tests/test_client.py#L18

I need to stub out the server to test the client...

How to do this? @levand any ideas?

Multi-user

One concern about using DuckDb and parquet is maintaining correctness even when potentially many requests are coming in per second to add new embeddings to the production data space.

The other concern is multiple users in the org querying or pulling data from a service at the same time.

Will this work? Will there be collisions?

Fix class-based cluster representative sampling

#87 introduces the core algorithms, but representative sampling using class labels is currently broken.

This is probably because hnswlib filtering doesn't let us get a connected graph, and so ANN search fails. We should understand how / when this fails. There are possibly easy fixes, i.e. generating a separate index per dataset per class, or increasing the edge factor, but the real solution is to understand how to use hnsw properly in our context.

chroma-ui

We don't know exactly what users want yet in a frontend, but here are some very general ideas

View what "spaces" you have
View underlying data in the browser
Run SQL queries against those datasets in the browser

Down the road

View projections

Step 1 would be to figure out a sketch of how we would package it up.

Prototype milvus ANN server

Updating the index

One of the known potential issues with using hnswlib vs other more "full-featured" options is updating the index without recomputing the whole index.

This is actually important because updating the index is a core part of what we do.

Deleting from an index is also useful because users may upload data accidentally and want to remove it without starting completely over.

Raw SQL is broken with DuckDB

Because of query_dataframe. It should be overriden.

Move off our temporary fork of hnsw

We think the changes we wanted are now upstream. Test!

"get_nearest_neighbor" can return nan neighbor classes - breaks server implementation

get_nearest_neighbor can return nan for the labels if we happen to query unlabeled datasets. This breaks fastapi's JSON serialization with an error ValueError: Out of range float values are not JSON compliant.

We should replace the nans or handle them accordingly. The current workaround when using the server is to not query unlabeled datasets in get_nearest_neighbor calls like so - which is likely what is desired anyways.

chroma_client.get_nearest_neighbors(embedding, n_results=N, where={'dataset':'training'})

Rename space_key

space_key is dumb. The idea in theory makes sense... their is an embedding space, and there is a key to denote that unique embedding space.

But I think its very confusing, and DX includes good naming! (hardest problem blah blah)

Other candidates

enviornment
scope
....

Prototype caching for ANN indexes

Depending on our ANN lib (e.g. Hnsw) we'll have an artifact computed for a set of rows. That has to:

live on disk
be associated to the input set
load into memory for live queries
unload after disuse

Add algorithm telemetry

Support HTTPS

Obvious, so users dont have to send unencrypted packets over the wire.

More available here: https://fastapi.tiangolo.com/deployment/concepts/

Data storage for parquet files

If we stay with parquet, I imagine we will want to support moving these files off the local machines they are built on and off to object storage on whatever cloud the user is using.

Add get_or_create support

Current issue:

if I'm using create_collection in a notebook cell (or in another Python function), the code will fail on the second run as the collection already exists. This forces users to change their code from create to get collection, or to use an or statement like this:

repos = chroma_client.get_collection(name="my_repos") or chroma_client.create_collection(name="my_repos")

Proposed solution

Inspired by Rails ActiveRecord's find_or_create_by, Chroma could support a get_or_create function that creates the collection OR returns it if it already exists.

If the collection exist, Chroma prints a notice. This will help avoid issues in which users are writing into someone else's collection, or write data in a collection thinking it's brand new but it already has data in it.

"Multi-space"

Right now Chroma is built as a "single-space" embedding store. Meaning it stores 1 embedding space at a time (from 1 layer, 1 trained model, 1 app).

Think about what it would take to make it "multi-store".

If we stick with duckdb+parquet this probably also means various kinds of parquet partitioning.

Use gunicorn in the Docker image? Uvicorn not advised in prod

https://fastapi.tiangolo.com/deployment/server-workers/

https://fastapi.tiangolo.com/deployment/server-workers/
https://fastapi.tiangolo.com/deployment/docker/#official-docker-image-with-gunicorn-uvicorn

Would need to think about how this would affect things being in-memory and making sure those could be shared.

API method to retrieve running Chroma version

Useful to make sure that a remote Chroma is as expected or whether a new feature is available yet.

Telemetry back to Chroma

We need to be able to report on how many users (anonymously) are using Chroma. This is going to be an opt-out part of the setup flow. We should make it very easy for the user to tell what we are sending back so they can verify we are only sending back very lightweight fully-anonymized usage and no user data ever.

We may want to add telemetry the client and the server?

If we do add telemetry to the client - we need to make sure that our requests are not blocking.

To enable opt-out we are going to have to add a .env file to the application.

Should we send events directly to the downstream data store or use our own tracking domain? eg https://telemetry.trychroma.com

Fix random sampling to be over images not embeddings

The current random sampler is biased toward images with more embeddings in them because we sample from embeddings, not images.

Pull docs/adr out into Notion

Storing custom generated scores

If we generated something like a custom quality score for a giant set of embeddings... do we update (discouraged in clickhouse) or put them into a new table....

loading empty collection and getting col.count() throws exception

A user reported this bug

client respect model_space override

Make sure the client respects overirding model-space ! It wasn't for many functions

Remove non-user functions from the API

Currently the API has a bunch of functions the user shouldn't call. The one that stands out most to me is where_with_model_space which is purely a utility function.

Currently I've marked all those I think should be removed with "⚠️ This method should not be used directly." in the docstring.

Add black for code formatting

https://github.com/psf/black

Example config here

https://github.com/chroma-core/dataset-explorer/blob/master/pyproject.toml#L19

Metrics/Logs/Tracers for the User

When the user deploys this service, they will want to:

capture some level of logging
understand performance of the service
capture any bugs that are happening

We should offer some smart defaults around this.

Making default behavior saving to disk instead of clearing it

The default behavior is clearing directory on exit instead of persisting it:

chroma/chromadb/db/duckdb.py

Lines 319 to 321 in c4bce6d

    
           def __del__(self): 
        
               print("Exiting: Cleaning up .chroma directory") 
        
               self._idx.reset()

As opposed to the doc:

By default Chroma uses an in-memory database, which gets persisted on exit and loaded on start (if it exists). This is fine for many experimental / prototyping workloads, limited by your machine's memory.

Suggestion:

Clarify that default behavior is clearing directory in the doc
Or better, make saving to disk the default behavior.

Interface formats

We want to make our "default" apis use python generic data structures like list, dict, etc. Why? because we do not want to put the burden on the user to learn Apache Arrow or to add extra dependencies to their application like Pandas if they aren't using them right now.
When users want a speed up - they can move to a more opinionated and optimized data structure.
That being said, we can still use a format internally and standardize around it.
Apache Arrow is attractive as that candidate
Arrow also gives us direct access to Apace Arrow Flight and Apache Arrow Flight SQL
Duckdb and Clickhouse both support direct import and query of apache arrow files, and I assume data structures, though this needs to be tested.
We could add other endpoints to our client or API like fetch, fetch_arrow, fetch_df, fetch_numpy to give users more flexibility in the format they get back. The client or API would handle this transform probably at that endpoint... all wrapped around fetch_arrow if we use that internally.
Because users will use both the client mainly, in the deployed context, but we also want to support direct import of chroma_core for in-memory notebook usage... we will want to offer these various fetch_* options at both layers.
The client can always talk to the API via insert_arrow, fetch_arrow however.
There is the additional question of moving data over-the-wire and what format we use. With Arrow, we could Arrow Flight (which uses protobufs under the hood iirc), or we could use something else.

Unit tests fail in Github Actions on for MacOS platform

Github Actions uses older versions of OSX (11 and 12), and runs only on Intel-based macs.

The tests run fine when running locally on an M1.

Given that the dev team is mostly on newer M1 Macs, without immediate access to an Intel-based mac to reproduce the problem, we are going to "solve" the test failure by removing MacOS as a test platform target. Given the error message, I also strongly suspect the problem is in the Github Actions environment rather than actually being a Mac-related issue.

Creating this issue so we can circle back if we ever do decide it's important to get unit tests running on Intel macs, again.

CLI

#74

Put on pause with the early Dec refactor

_CHR-40

	def __del__(self):
	print("Exiting: Cleaning up .chroma directory")
	self._idx.reset()