Model | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|
ResNet-50 (fps) | 20 | 2,070 | 7,200 | 10,000 |
BERT-Large (sen/s) | 12 | 362 | 406 | 410 |
Falcon7B-decode (t/s) | 32 | 135 | 135 | 140 |
ViT (fps) | 8 | 430 | 643 | 1700 |
T5 small (sen/s) | 140 | |||
Bloom (sen/s) | 70 | |||
U-Net | coming soon |
[1] - Observed from the host. Includes dispatch overahed and kernel execution time.
[2] - Ignoring host overhead. Kernel execution time only.
Model | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|---|
Falcon7B-decode (t/s/u) | 129th | 32 | 9.9 | 13.5 | 21 |
Mistral-7B-decode (t/s/u) | 33rd | 32 | 7.9 | 10.9 | 21 |
Mamba-2.8B-decode (t/s/u) | any | 32 | 1.7 | 2.0 | 17 |
Stable Diffusion 1.4 512x512 | coming soon | 1 |
[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.
Model | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|---|
LLaMA-2-70B-decode (t/s/u) | 129th | 32 | 0.95 | 8.4 | 20 |
LLaMA-3-70B-decode (t/s/u) | 129th | 32 | 0.95 | 7.7 | 20 |
Falcon40B-decode | coming soon | ||||
Mixtral7Bx8-decode | coming soon | ||||
ResNet50 (data parallel) | coming soon |
import ttnn
import torch
with ttnn.manage_device(device_id=0) as device:
a = torch.ones((5, 7))
b = torch.ones((1, 7))
a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
output = a + b
output = ttnn.to_torch(output)
print(output)
TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.
Get started with simple kernels.