Why: so we can play with in in inference mode Origi

Naively quantized bloom 6b on CPU: <a href="http://nora:8800/notebooks/decentralized/j

Copied bloom version <a class="commit-link" data-hovercard-type="commit" data-hovercar

Next steps: push individual bloom layers to huggingface hub</l

As of <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://

[CODE] port HF bloom to hivemind,about bigscience-workshop/petals

Comments (5)

justheuristic commented on May 14, 2024

Naively quantized bloom 6b on CPU: http://nora:8800/notebooks/decentralized/jheuristic/bloom_test/cpu-qint-matmul.ipynb
source: https://gist.github.com/justheuristic/47830c9ddfd45889894e69d4f45ce233

from petals.

justheuristic commented on May 14, 2024

On client-side computations

Since we plan to run embeddings/logits on client side, we need to compute them efficiently.
Embedding computation is cheap AF, but logits are more complicated

Computing final logits on colab CPU costs over a minute per one token

Solution 1: use fast KNN

top-99% probabilities are held by the 100 most likely tokens
use HNSW to find top-100 tokens that have highest dot product
FAISS of ScaNN for fast nearest neighbor search

Solution 2: just use GPU

colab T4 in fp16: 30ms per token (no longer a bottleneck)
kudesnik m40 (~2x colab k80): 67ms, still very much acceptable
gpus might not be available

Current opinion: use GPUs, think about fast CPU mode later.

from petals.

justheuristic commented on May 14, 2024

Copied bloom version huggingface/transformers@ca2a55e
from huggingface

Their attention code is spectacularly bad, see #1

from petals.

justheuristic commented on May 14, 2024

Next steps:

push individual bloom layers to huggingface hub
implement BloomBlock as hivemind.moe.server.ExpertBackend

from petals.

justheuristic commented on May 14, 2024

As of 8221469

Pushing model to hub is handled via python -m cli.convert_model --many_args_here, see README for usage example
Server can run forward, backward and inference of bloom blocks, see README for instructions on how to start a server

from petals.

Recommend Projects