Coder Social home page Coder Social logo

tenstorrent / tt-metal Goto Github PK

View Code? Open in Web Editor NEW
232.0 19.0 16.0 103.16 MB

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

License: Apache License 2.0

Makefile 0.32% C++ 46.41% Python 43.02% Shell 0.30% C 3.32% Starlark 0.01% HTML 0.34% Assembly 0.02% Jupyter Notebook 6.05% CMake 0.21%
ml falcon llama llm low-level-programming metal mistral mixtral resnet stable-diffusion

tt-metal's Introduction

ttnn logo

TT-NN is python & C++ Neural Network OP library.


Grayskull (GS) Models

Model Batch End-to-end throughput [1] Device throughput [2] Target
ResNet-50 (fps) 20 2,070 7,200 10,000
BERT-Large (sen/s) 12 362 406 410
Falcon7B-decode (t/s) 32 135 135 140
ViT (fps) 8 430 643 1700
T5 small (sen/s) 140
Bloom (sen/s) 70
U-Net coming soon

[1] - Observed from the host. Includes dispatch overahed and kernel execution time.

[2] - Ignoring host overhead. Kernel execution time only.

Wormhole (WH) Models

Model Gen. Token [3] Batch End-to-end throughput [1] Device throughput [2] Target
Falcon7B-decode (t/s/u) 129th 32 9.9 13.5 21
Mistral-7B-decode (t/s/u) 33rd 32 7.9 10.9 21
Mamba-2.8B-decode (t/s/u) any 32 1.7 2.0 17
Stable Diffusion 1.4 512x512 coming soon 1

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

T3000 (2x4 mesh of WHs) Models

Model Gen. Token [3] Batch End-to-end throughput [1] Device throughput [2] Target
LLaMA-2-70B-decode (t/s/u) 129th 32 0.95 8.4 20
LLaMA-3-70B-decode (t/s/u) 129th 32 0.95 7.7 20
Falcon40B-decode coming soon
Mixtral7Bx8-decode coming soon
ResNet50 (data parallel) coming soon

Using TT-NN ops and tensors

import ttnn
import torch

with ttnn.manage_device(device_id=0) as device:
   a = torch.ones((5, 7))
   b = torch.ones((1, 7))

   a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
   b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)

   output = a + b
   output = ttnn.to_torch(output)

print(output)

TT-Metalium logo

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Getting started

Get started with simple kernels.

tt-metal's People

Contributors

abhullar-tt avatar acejkov avatar aliutt avatar arakhmati avatar ashayestehmanesh avatar banekg avatar cglagovich avatar coffeeandcheesecake avatar davorchap avatar dongjin-na avatar drjessop avatar eyonland avatar farbabi avatar kkwong10 avatar mikevin920 avatar mo-tenstorrent avatar muthutt avatar mywoodstock avatar nemanjagrujic avatar npetrovic-tenstorrent avatar pgkeller avatar tarafdartt avatar tt-aho avatar tt-billteng avatar tt-brianliu avatar tt-dma avatar tt-nshanker avatar tt-rkim avatar umadevimcw avatar yugaott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tt-metal's Issues

release: automate API docs for kernel and host

Description

  • Coordinate with release schedule, of which @tt-rkim shall write an early version
  • Automate doxygen generation of all funcs with name, desc, arg table (arg name, desc, required, type, valid range)
    • All sections in Host API and kernel API (which should be individual files for each) must be covered
  • Insert into ingestible tables into sphinx RST
  • Ensure performance isn't compromised from FORCE_INLINE trick
  • Decide whether to embed doxygen_build into build/html of sphinx docs
  • Automate generation of cpp doxygen -> xml -> sphinx somehow, maybe with makefile

BERT Tiny Demo

Successfully run bert tiny inference on a random input and validate against pytorch

Everything should be on grayskull except for embedding layer, which can for the time being stay on CPU until integer tensor support is added.

To bring up bert tiny, the following ops need to be available:

  • Untilize, in order to transpose data in row major format (required for (seq len, num att heads, hid dim / num att heads) -> (num att heads, seq len, hid dim / num att heads) since att heads is not divisible by 32)
  • Row major transpose, to perform what was mentioned above
  • Row major reshape, which required rebanking operation
  • Tilize: For converting back to tiles for subsequent math operations

perf: introduce performance regression infra to repo

Description

We need a way of tracking performance regressions. This means specific tests in both LL-Buda and LLRT that will run things similar to the current correctness/functionality tests, but with performance thresholds.

Mainly in terms of:

  • time
  • memory usage (maybe)
  • power utilization (maybe)

A more holistic approach will be useful for parts of the stack, and we can include a refactoring of current testing infra.

infra: add git pre-commit and commit-msg hooks

Description

We need to start enforcing some commit hooks for pre-commit and commit-msg. This would help us keep the repo cleaner, not just files but also commit messages.

Commit message template proposal

A possible convention proposed by @DrJessop would be

[title]
<ISSUE NUMBER if applicable, else N/A>
<MR addressing if applicable, else N/A>
<Description>

This would likely have to be a python script activated in the virtualenv that pre-commit sets up.

release: fix remaining release script tech debt

Description

There are various things we need to clean up in the release script:

  • Ensure that all LFS files are tracked
  • Ensure that all LFS files are pushed
  • Implement deletion of files and of submodules in dst that are no longer tracked
  • Release notes generation
  • Semantic versioning
  • Branch name should be an argument throughout code, rn just global constant of main
  • Assert return op codes are valid for fetch, pull, and push
  • Pay back debt of doing git add device ... need nicer way of doing this with git cmd, LFS has no native API with Gitpython
  • Parameterize release_name, it's now just alpha1

risc_rw_speed_banked_dram hang on tttest

It hangs in the write stage on tttest, I added some debug prints to the host code.
It finished the read stage, it start the write stage and it hangs.


❯ ./build/test/llrt/tests/test_run_risc_rw_speed_banked_dram
                   Test | INFO     | num cmd line args: 1
                   Test | INFO     | Using default test arguments
                   Test | INFO     | Test arguments: buffer_size = 409600, num_repetitions = 10000, transaction_size = 512
                 Device | INFO     | SOC descriptors loaded /mnt/localhome/dcapalija/fw-dma-test-2/device/grayskull_120_arch.yaml
                 Device | INFO     | Network descriptor loaded 
Detected 1 PCI device 
Opening TT_PCI_INTERFACE_ID 0 for netlist target_device_id: 0
PCIEIntfId   0x0
VID:DID      0x1e52:0xfaca
SubVID:SubID 0x1e52:0x3
BSF          c1:0:0
BAR          0x18000000000  size: 48MB
HARVESTING DISABLED = 0x0 (memory: 0x0 logic: 0x0)
HARVESTING DISABLED = 0x0 (memory: 0x0 logic: 0x0)
Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1202 MHz
                  Verif | INFO     | Cores to use: (x=5,y=4) 
                  Verif | INFO     | 
                  Verif | INFO     | Starting read speed test
                  Verif | INFO     | BRISC time: 0.77506221s
                  Verif | INFO     | Bytes read: 4096000000, GB read: 3.814697265625
                  Verif | INFO     | Read speed GB/s: 4.921794942918195
                  Verif | INFO     | Done read speed test
                  Verif | INFO     | 
                  Verif | INFO     | Starting write speed test

infra: create a test harness to be used in tt-metal tests

Description

Spurred by earlier sentiments and comments in !53, we want to introduce a test harness that will do a lot of boilerplate for us. Creating device, some config stuff, some debug stuff to help us ex. logging AI CLK speed. We can even move stuff like the tt_gdb into it, fully housing its usage and giving us a direct place to pinpoint any test infra.

This will also help with CI development as in #59 because tests will be uniform and nothing will jump out at us when developing CI.

As we move away from doing llrt tests, we will want a test harness for all TT-Metal tests.

Add documentation for new noc apis

Add documentation on:
noc_semaphore_set
noc_semaphore_wait
noc_semaphore_inc
noc_semaphore_set_multicast
noc_async_write_multicast

Also, add documentation on noc ids, VCs, CMD_BUFs, and how write requests are ordered.

llrt + ll_buda: expose target_ai_clk param to both llrt and ll_buda, currently default 0

Description

Note: for 0.2, not important for 0.1

Commit 383ffdbe1f33ec9d0f067f85fcb9d972773a51ce on master exposed an issue with target_ai_clk in tt_cluster.hpp. We assumed up to this point the compiler will set it 0, thereby not attempting to set AI CLK to an invalid value, however at this commit and forward target_ai_clk starts being initialized with garbage ex. 20k+.

A quick fix for 0.1 was to set it to 0 by default. However, this value can't be changed. We need to change this to be a configurable param in llrt and ll_buda with a default 0.

Relevant ppl include @abhullar-tt

Perf_Mon: Measure perf of matmul kernels

  • Monitoring the performance of number of improvements brought in for matmul test including multicast on in0 and in1, sub-bock based writer.
  • Improving writer by moving to sub-block writes to CB instead of Tiles

Clean-up comments in ll_buda regarding loading hexs / blanks

Loading blanks has a comment that we're taking riscs out of reset, but that's not the case


    // Take device out of reset
    const llrt::TensixRiscsOptions riscs_options = llrt::TensixRiscsOptions::ALL_RISCS;
    llrt::internal_::load_blank_kernel_to_all_worker_cores_with_exceptions(
        cluster, pcie_slot, riscs_options, worker_cores);

Clean-up reset comments in llrt

We're doing the de-assert properly now -- every core has a valid HEX, and only then take out of reset, so I think this comment / TODO should be update to say that.


// TODO: de-asserting reset properly
//  this deasserts reset for all BRISCs (on all devices, all cores), but not other RISC processors (NCRISC, TRISC)
// even though it deasserts reset for all the BRISCs, we are only loading  BRISC for a single core ("core")
// this is unsafe, since BRISCs for which we haven't loaded FW are now running garbage out of their L1
// proper solution:
// a) load dummy BRISC FW to unused cores, and keep using the function that de-asserts all BRISCs (easier, we can load blank kernel and disable NCRISC loading)
// b) de-assert reset only for used BRISCs (needs a new deassert function w/ a list of core to de-assert) (harder)
void deassert_brisc_reset_for_all_chips_all_cores(tt_cluster *cluster) {
    cluster->deassert_risc_reset(); 
    log_debug(tt::LogLLRuntime, "deasserted reset for all BRISCs");
}

sweep_tests: Improvements for running sweep tests

Features/TODOs:

  • Add a test config for running a small number of random samples for all available op tests
  • Add option to specify hard-coded list of shapes (eg. for edge cases)
  • Remove old path of running tests via standalone python scripts
  • Add README

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.