Join the project community on our server!
Cake
is a pure Rust implementation of the LLama3 distributed inference based on Candle. The goal of the project is being able to run big (70B+) models by repurposing consumer hardware into an heterogeneous cluster of iOS, macOS, Linux and Windows devices.
This is experimental code.
The idea is to shard the transformer blocks to multiple devices in order to be able to run the inference on models that wouldn't normally fit in the GPU memory of a single device. Inferences over contiguous transformer blocks on the same worker are batched in order to minimize latency due to data transfer.
Run a worker node:
cake-cli --model /path/to/Meta-Llama-3-8B \ # model path, read below on how to optimize model size for workers
--mode worker \ # run as worker
--name worker0 \ # worker name in topology file
--topology topology.yml \ # topology
--address 0.0.0.0:10128 # bind address
Run a master node:
cake-cli --model /path/to/Meta-Llama-3-8B \
--topology topology.yml
Where topology.yaml
determines which layers are served by whom:
linux_server_1:
host: 'linux_server.host:10128'
description: 'NVIDIA Titan X Pascal (12GB)'
layers:
- 'model.layers.0-5'
linux_server_2:
host: 'linux_server2.host:10128'
description: 'NVIDIA GeForce 3080 (10GB)'
layers:
- 'model.layers.6-16'
iphone:
host: 'iphone.host:10128'
description: 'iPhone 15 Pro Max'
layers:
- 'model.layers.17'
ipad:
host: 'ipad.host:10128'
description: 'iPad'
layers:
- 'model.layers.18-19'
macbook:
host: 'macbook.host:10128'
description: 'M1 Max'
layers:
- 'model.layers.20-31'
As a memory and disk space optimization, you might want to give the worker only the data it actually needs from the model instead of the whole folder, in which case you can use the cake-split-model
utility. For instance to generate a smaller version of the llama3 safetensors, you can:
cake-split-model --model-path path/to/Meta-Llama-3-8B \ # source model to split
--topology path/to/topology.yml \ # topology file
--output output-folder-name # output folder where all the workers data bundles will be saved
This will create a smaller folder with only the required layers tensors and the topology file for the specific worker. Remember to also copy other model contents (config.json, tokenizer.json, etc) in the worker bundle before deploying it.
OS | Architectures | Acceleration | Status |
---|---|---|---|
GNU/Linux | arm, arm64, x86_64 | - | ✔️ |
GNU/Linux | arm, arm64, x86_64 | CUDA | ✔️ |
GNU/Linux | arm, arm64, x86_64 | BLAS | ✔️ |
macOS | intel | - | ✔️ |
macOS | aarch64 | - | ✔️ |
macOS | aarch64 | Metal | ✔️ |
Android | arm, arm64, x86_64 | - | ✔️ |
Android | arm, arm64, x86_64 | CUDA | untested |
iOS / iPadOS | aarch64 | - | ✔️ |
iOS / iPadOS | aarch64 | Metal | 90% done, WIP |
Web | - | WebGPU | in theory possible, not done |
Released under the GPL 3 license. To see the licenses of the project dependencies, install cargo license with cargo install cargo-license
and then run cargo license
.