Coder Social home page Coder Social logo

chroma-core / chroma Goto Github PK

View Code? Open in Web Editor NEW
13.1K 13.1K 1.1K 319.8 MB

the AI-native open-source embedding database

Home Page: https://www.trychroma.com/

License: Apache License 2.0

Python 32.39% Dockerfile 0.16% Shell 0.30% TypeScript 7.30% HTML 0.02% JavaScript 0.06% Jupyter Notebook 12.47% Makefile 0.06% Go 9.90% HCL 0.02% Rust 36.91% C++ 0.27% Starlark 0.15%
document-retrieval embeddings llms

chroma's People

Contributors

adjectiveallison avatar alabasteraxe avatar alw3ys avatar atroyn avatar beggers avatar cakecrusher avatar codetheweb avatar dglazkov avatar floleuerer avatar fr0th avatar grishick avatar hammadb avatar ibratoev avatar ishiihara avatar jeffchuber avatar laserbear avatar levand avatar naynaly10 avatar nicolasgere avatar nlsfnr avatar perryrobinson avatar russell-pollari avatar sai-suraj-27 avatar sanketkedia avatar satyam-79 avatar shivankar-p avatar swyxio avatar tazarov avatar tonisives avatar weiligu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chroma's Issues

UUIDs and Strings

In several places, we convert UUID objects back and forth from and to strings. This isn't great, and it's hard to keep track of what they should be, when.

We should add types to make sure they are what we expect, and then refactor to try to do this conversion as little as possible, or preferably, not at all.

An example:

collection_uuid, embedding=embeddings, metadata=metadatas, documents=documents, ids=ids

It's ambiguous here what type collection_uuid should be.

Data formatting/storage for inference and label data

Currently we store data in the coco json format. That looks like this:

{
	"annotations": [{
		"bbox": [-0.88869536, -2.7152133, 127.68482208, 62.74983978],
		"category_id": 43,
		"category_name": "knife",
	}]
}

More here on that format: https://roboflow.com/formats/coco-json

Questions

  • what data format will our users have their results in? Natively in wherever they are running inference/logging
  • what properties actually matter to us? should things be broken out into their own columns? should we be able to store/return back to the user exactly what they gave us even so?

Next steps
Review various data formats. Look further at roboflow, see if the labeling providers have some documentation. Consider both vision as well as NLP applications.

Discussion: in-memory and chroma client-server --> sharing code?

chroma-client is mainly is responsible for the public API and data ferrying to the backend.
chroma-server is mainly is responsible for storing data and running computations.

Inside chroma there are 2 wolves:

  • Wolf A: chroma running entirely in the python session of the user
  • Wolf B: chroma running in a client-server way where the user sends data to a session managed outside of their current python session.

Wolf B is fairly obvious... there is chroma-client and chroma-server and then work together.

Wolf A however... does chroma-client have extra functionality to handle the in-memory use case? or is there seperate code that is shared between chroma-server and chroma-client?

I think some code has to be shared... the code for "doing the maths" - and possibly more code, things like data formats, etc. Possibly more. How should this be structured?

Relax strict requirements pins?

Chroma uses strict requirement pins, which means that pip will only install it as part of an application whose requirement pins are exactly the same. As Chroma is a framework that will (hopefully!) be included in many other applications, having strict pins severely limits that uptake by making it automatically incompatible with applications that have even a minor version difference of something as common as FastAPI, requests, or Pydantic.

If interesting, at Prefect we found this article to be very helpful in thinking through the implications of pinning versions, and have generally left our frameworks (which are intended to be used in other software) pinned only to lower bounds that we know are compatible with our usage.

API Design

This version is meant to get a handle on the external and internal APIs, not yet to define all the specifics of what inputs they take, if they do single vs batch mode, etc.

External API

  • init - set metadata to be used for logging and fetching for this session
  • log - log new embeddings and their metadata
  • fetch - get data to label

Internal API

  • init - loads db if exists
  • load database
  • save database - run everytime the database changes so we never rely on in-memory?
  • load index
  • save index
  • generate ANN index - runs hnswlib
  • use ANN - mhb and other calculations uses the index from ANN for speed
  • run mhb - runs malanobis distance
  • generate distances - generates distances based on mhb and the index and saves it to the db
  • save new embedding+metadata datapoints to db
  • [future] run umap
  • internal set of things that log does
  • internal set of things that fetch does

File store proposal

  • store the db and index in a .chroma directory at the level the python function was executed from

Questions

  • when do regenerate the index? - on every new batch of datapoints?
  • when do we re-rerun mhb and regenerate distance values? - on every new batch of datapoints?

A Collection should store its UUID

Collections should store their UUIDs and pass them through read/write paths rather than every path requiring a lookup from collection_name -> collection_uuid.

Audit and resolve coco JSON export integrity

Without real auditing, I already know the counts are off in the objects set. I don't plan to go crazy with the audit, but just some sanity checks before asking Anton and Jeff to review.

The new files will be converted into parquet for testing.

uuid package is for Python 2

One of the requirements for the package is uuid which is a package from 2006 supporting Python 2. uuid is now a built-in library.

Should be removed from requirements ๐Ÿ™

Document hand rolled (parquet/duckdb/hnswlib) customer cycle proposal

Based on our conversations so far, write up a possible round-trip customer data plan for MVP.

Should include:

  • customer data logging
  • transfer data to chroma
  • build ovoid indexes on training data
  • build ANN indexes on training data
  • ingest prod data
  • compute recommendations for prod -> training inclusion

DB Backups

Any production service needs a way to configure regular backups. If we go with duckdb/parquet this could be as simple as a bucket policy to create an archive of any new or updated files every 24h.

Worker/Queues

Unless we want to continue on relying on the user to manually trigger the formation of the index, and running other downstream analytics, we want to have an internal worker/queue system for running these ourselves based on triggers and schedules.

Move away from JSON/REST

The current client-server bridge uses http+json+rest because it was easy, but certainly not because it was fast or ideal. Explore alternatives to increase speed.

Make core algorithms more efficient

The core algorithms have a lot of redundant computation we could avoid by modularizing them further and caching results as they become available.

Multi-user

One concern about using DuckDb and parquet is maintaining correctness even when potentially many requests are coming in per second to add new embeddings to the production data space.

The other concern is multiple users in the org querying or pulling data from a service at the same time.

Will this work? Will there be collisions?

Fix class-based cluster representative sampling

#87 introduces the core algorithms, but representative sampling using class labels is currently broken.

This is probably because hnswlib filtering doesn't let us get a connected graph, and so ANN search fails. We should understand how / when this fails. There are possibly easy fixes, i.e. generating a separate index per dataset per class, or increasing the edge factor, but the real solution is to understand how to use hnsw properly in our context.

chroma-ui

We don't know exactly what users want yet in a frontend, but here are some very general ideas

  • View what "spaces" you have
  • View underlying data in the browser
  • Run SQL queries against those datasets in the browser

Down the road

  • View projections

Step 1 would be to figure out a sketch of how we would package it up.

Updating the index

One of the known potential issues with using hnswlib vs other more "full-featured" options is updating the index without recomputing the whole index.

This is actually important because updating the index is a core part of what we do.

Deleting from an index is also useful because users may upload data accidentally and want to remove it without starting completely over.

"get_nearest_neighbor" can return nan neighbor classes - breaks server implementation

get_nearest_neighbor can return nan for the labels if we happen to query unlabeled datasets. This breaks fastapi's JSON serialization with an error ValueError: Out of range float values are not JSON compliant.

We should replace the nans or handle them accordingly. The current workaround when using the server is to not query unlabeled datasets in get_nearest_neighbor calls like so - which is likely what is desired anyways.

chroma_client.get_nearest_neighbors(embedding, n_results=N, where={'dataset':'training'})

Rename space_key

space_key is dumb. The idea in theory makes sense... their is an embedding space, and there is a key to denote that unique embedding space.

But I think its very confusing, and DX includes good naming! (hardest problem blah blah)

Other candidates

  • enviornment
  • scope
  • ....

Prototype caching for ANN indexes

Depending on our ANN lib (e.g. Hnsw) we'll have an artifact computed for a set of rows. That has to:

  • live on disk
  • be associated to the input set
  • load into memory for live queries
  • unload after disuse

Data storage for parquet files

If we stay with parquet, I imagine we will want to support moving these files off the local machines they are built on and off to object storage on whatever cloud the user is using.

Add get_or_create support

Current issue:

if I'm using create_collection in a notebook cell (or in another Python function), the code will fail on the second run as the collection already exists. This forces users to change their code from create to get collection, or to use an or statement like this:

repos = chroma_client.get_collection(name="my_repos") or chroma_client.create_collection(name="my_repos")

Proposed solution

Inspired by Rails ActiveRecord's find_or_create_by, Chroma could support a get_or_create function that creates the collection OR returns it if it already exists.

If the collection exist, Chroma prints a notice. This will help avoid issues in which users are writing into someone else's collection, or write data in a collection thinking it's brand new but it already has data in it.

"Multi-space"

Right now Chroma is built as a "single-space" embedding store. Meaning it stores 1 embedding space at a time (from 1 layer, 1 trained model, 1 app).

Think about what it would take to make it "multi-store".

If we stick with duckdb+parquet this probably also means various kinds of parquet partitioning.

Telemetry back to Chroma

We need to be able to report on how many users (anonymously) are using Chroma. This is going to be an opt-out part of the setup flow. We should make it very easy for the user to tell what we are sending back so they can verify we are only sending back very lightweight fully-anonymized usage and no user data ever.

We may want to add telemetry the client and the server?

If we do add telemetry to the client - we need to make sure that our requests are not blocking.

To enable opt-out we are going to have to add a .env file to the application.

Should we send events directly to the downstream data store or use our own tracking domain? eg https://telemetry.trychroma.com

Storing custom generated scores

If we generated something like a custom quality score for a giant set of embeddings... do we update (discouraged in clickhouse) or put them into a new table....

Remove non-user functions from the API

Currently the API has a bunch of functions the user shouldn't call. The one that stands out most to me is where_with_model_space which is purely a utility function.

Currently I've marked all those I think should be removed with "โš ๏ธ This method should not be used directly." in the docstring.

Metrics/Logs/Tracers for the User

When the user deploys this service, they will want to:

  • capture some level of logging
  • understand performance of the service
  • capture any bugs that are happening

We should offer some smart defaults around this.

Making default behavior saving to disk instead of clearing it

The default behavior is clearing directory on exit instead of persisting it:

def __del__(self):
print("Exiting: Cleaning up .chroma directory")
self._idx.reset()

As opposed to the doc:

By default Chroma uses an in-memory database, which gets persisted on exit and loaded on start (if it exists). This is fine for many experimental / prototyping workloads, limited by your machine's memory.

Suggestion:

  1. Clarify that default behavior is clearing directory in the doc
  2. Or better, make saving to disk the default behavior.

Interface formats

  • We want to make our "default" apis use python generic data structures like list, dict, etc. Why? because we do not want to put the burden on the user to learn Apache Arrow or to add extra dependencies to their application like Pandas if they aren't using them right now.
  • When users want a speed up - they can move to a more opinionated and optimized data structure.
  • That being said, we can still use a format internally and standardize around it.
  • Apache Arrow is attractive as that candidate
  • Arrow also gives us direct access to Apace Arrow Flight and Apache Arrow Flight SQL
  • Duckdb and Clickhouse both support direct import and query of apache arrow files, and I assume data structures, though this needs to be tested.
  • We could add other endpoints to our client or API like fetch, fetch_arrow, fetch_df, fetch_numpy to give users more flexibility in the format they get back. The client or API would handle this transform probably at that endpoint... all wrapped around fetch_arrow if we use that internally.
  • Because users will use both the client mainly, in the deployed context, but we also want to support direct import of chroma_core for in-memory notebook usage... we will want to offer these various fetch_* options at both layers.
  • The client can always talk to the API via insert_arrow, fetch_arrow however.
  • There is the additional question of moving data over-the-wire and what format we use. With Arrow, we could Arrow Flight (which uses protobufs under the hood iirc), or we could use something else.

Unit tests fail in Github Actions on for MacOS platform

Github Actions uses older versions of OSX (11 and 12), and runs only on Intel-based macs.

The tests run fine when running locally on an M1.

Given that the dev team is mostly on newer M1 Macs, without immediate access to an Intel-based mac to reproduce the problem, we are going to "solve" the test failure by removing MacOS as a test platform target. Given the error message, I also strongly suspect the problem is in the Github Actions environment rather than actually being a Mac-related issue.

Creating this issue so we can circle back if we ever do decide it's important to get unit tests running on Intel macs, again.

Fix up READMEs

There are currently 3 separate READMEs. None of these give the right instructions for setup, testing, or usage.

Move to upstream HNSWlib

The features we needed (filtering) have now been merged in upstream, so we no longer need our fork.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.