This repository contains the pre-generated embedding data used in the Automata project. The Automata project requires embeddings for various aspects of its operation, and these are kept in this repository for easy management and versioning.
The embeddings are stored in a ChromaSymbolEmbeddingVectorDatabase
, which is a customized data store based on the Chroma vector database. This database stores key-value pairs, where the key is a symbol from the codebase and the value is an embedding vector that represents this symbol in a high-dimensional space.
This repository is included as a Git submodule in the main Automata repository. To include the embeddings in your local copy of Automata, clone the repository with the --recurse-submodules
option:
git clone --recurse-submodules https://github.com/emrgnt-cmplxty/automata.git
If you've already cloned the repository, you can fetch the submodule with:
git submodule update --init --recursive
In the Automata code, the embeddings are accessed using the ChromaSymbolEmbeddingVectorDatabase
class. For example, to get an embedding for a particular symbol, use the get()
method:
key = parse_symbol('your_symbol_here')
embedding = embedding_database.get(key)
You can also add, update, and batch process entries using the methods provided by ChromaSymbolEmbeddingVectorDatabase
.
This data repository currently supports the following repositories:
- Automata (automata)
- LlamaIndex (llama_index)
- langchain (langchain)
- chroma (chromadb)
- pandas (pandas)
- flask (src/flask)
- scikit-learn (sklearn)
- tensorflow (tensorflow)
In all instances, the embeddings are stored in the collection contained in the parentheses.
We welcome contributions to the embedding data. If you've generated new embeddings that you believe would be useful to the Automata project, please submit a pull request. Be sure to include details about how the embeddings were generated.
Before submitting a pull request, please make sure your changes pass all existing tests. Add new tests for any new functionality or to cover any areas you think are lacking.
For detailed instructions on contributing, please see the CONTRIBUTING.md file in the main Automata repository.
The embeddings and associated code are released under the Apache License 2.0. Please see the LICENSE file for details.