tenstorrent / tt-metal Goto Github PK

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

License: Apache License 2.0

Makefile 0.32% C++ 46.41% Python 43.02% Shell 0.30% C 3.32% Starlark 0.01% HTML 0.34% Assembly 0.02% Jupyter Notebook 6.05% CMake 0.21%

ml falcon llama llm low-level-programming metal mistral mixtral resnet stable-diffusion

tt-metal's Introduction

Buy hardware | Install | Discord

TT-NN is python & C++ Neural Network OP library.

Grayskull (GS) Models

Model	Batch	End-to-end throughput [1]	Device throughput [2]	Target
ResNet-50 (fps)	20	2,070	7,200	10,000
BERT-Large (sen/s)	12	362	406	410
Falcon7B-decode (t/s)	32	135	135	140
ViT (fps)	8	430	643	1700
T5 small (sen/s)		140
Bloom (sen/s)		70
U-Net	coming soon

[1] - Observed from the host. Includes dispatch overahed and kernel execution time.

[2] - Ignoring host overhead. Kernel execution time only.

Wormhole (WH) Models

Model	Gen. Token [3]	Batch	End-to-end throughput [1]	Device throughput [2]	Target
Falcon7B-decode (t/s/u)	129th	32	9.9	13.5	21
Mistral-7B-decode (t/s/u)	33rd	32	7.9	10.9	21
Mamba-2.8B-decode (t/s/u)	any	32	1.7	2.0	17
Stable Diffusion 1.4 512x512	coming soon	1

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

T3000 (2x4 mesh of WHs) Models

Model	Gen. Token [3]	Batch	End-to-end throughput [1]	Device throughput [2]	Target
LLaMA-2-70B-decode (t/s/u)	129th	32	0.95	8.4	20
LLaMA-3-70B-decode (t/s/u)	129th	32	0.95	7.7	20
Falcon40B-decode	coming soon
Mixtral7Bx8-decode	coming soon
ResNet50 (data parallel)	coming soon

Using TT-NN ops and tensors

import ttnn
import torch

with ttnn.manage_device(device_id=0) as device:
   a = torch.ones((5, 7))
   b = torch.ones((1, 7))

   a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
   b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)

   output = a + b
   output = ttnn.to_torch(output)

print(output)

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Programming Guide | API Reference

Getting started

Get started with simple kernels.

tt-metal's People

Contributors

Stargazers

Watchers

Forkers

olofj tanvirarafin aswinmcw vimalwill msd2219 joelsmithtt touristshaun tt-mnijjar shubhamsaboo eyonland danishabbir kevinwutt nvukobrattt lipengzen marty1885 adifrancescott

tt-metal's Issues

release: automate API docs for kernel and host

Description

Coordinate with release schedule, of which @tt-rkim shall write an early version
Automate doxygen generation of all funcs with name, desc, arg table (arg name, desc, required, type, valid range)
- All sections in Host API and kernel API (which should be individual files for each) must be covered
Insert into ingestible tables into sphinx RST
Ensure performance isn't compromised from FORCE_INLINE trick
Decide whether to embed doxygen_build into build/html of sphinx docs
Automate generation of cpp doxygen -> xml -> sphinx somehow, maybe with makefile

BERT Tiny Demo

Successfully run bert tiny inference on a random input and validate against pytorch

Everything should be on grayskull except for embedding layer, which can for the time being stay on CPU until integer tensor support is added.

To bring up bert tiny, the following ops need to be available:

Untilize, in order to transpose data in row major format (required for (seq len, num att heads, hid dim / num att heads) -> (num att heads, seq len, hid dim / num att heads) since att heads is not divisible by 32)
Row major transpose, to perform what was mentioned above
Row major reshape, which required rebanking operation
Tilize: For converting back to tiles for subsequent math operations

docs: add two eltwise binary op programming example

Description

We need "chained op" programming example for two binary ops stitched together in the ll_buda programming examples.

Att @farbabi @srajab-tt

build_kernels_for_riscv: get rid of "type" prop

Description

Extra type param should not exist in hlk compile params.

Referring to comments in https://yyz-gitlab.local.tenstorrent.com/dcapalija/fw-dma-test-2/-/merge_requests/68

Tilize_op doesn't not work correctly when W != 2^k for all non-neg integer k

reg_scripts: add --seed param and allow all test lists to be seedable

perf: introduce performance regression infra to repo

Description

We need a way of tracking performance regressions. This means specific tests in both LL-Buda and LLRT that will run things similar to the current correctness/functionality tests, but with performance thresholds.

Mainly in terms of:

time
memory usage (maybe)
power utilization (maybe)

A more holistic approach will be useful for parts of the stack, and we can include a refactoring of current testing infra.

Make bert pass for N>1

RISC-V have large startup latency in pipelined across rows test (w/o staggered reset)

We don't see this in MM

Branch: data_movement_ubench

Perf-Mon: Fetch core clock from the device api instead of hard coding

At the moment clock is hard coded to 1.2Ghz in the profiler module. This clock needs to be extracted from the device api.

kernels: move draft kernels into sandbox

Description

kernels/compute/draft seems like a good sandbox candidate. Suggested by @coffeeandcheesecake

Perf_Mon: Default profiler markers for main and kernel_main functions

Default markers for:

start / end of main functions on BRISC and NCRISC
sandwich kernel_mains on BRISC and NCRISC
Investigate TRISC and the possibility of enabling the same markers.

Allow tensor to support uint32 data format

infra: add git pre-commit and commit-msg hooks

Description

We need to start enforcing some commit hooks for pre-commit and commit-msg. This would help us keep the repo cleaner, not just files but also commit messages.

Commit message template proposal

A possible convention proposed by @DrJessop would be

[title]
<ISSUE NUMBER if applicable, else N/A>
<MR addressing if applicable, else N/A>
<Description>

This would likely have to be a python script activated in the virtualenv that pre-commit sets up.

release: fix remaining release script tech debt

Description

There are various things we need to clean up in the release script:

Ensure that all LFS files are tracked
Ensure that all LFS files are pushed
Implement deletion of files and of submodules in dst that are no longer tracked
Release notes generation
Semantic versioning
Branch name should be an argument throughout code, rn just global constant of main
Assert return op codes are valid for fetch, pull, and push
Pay back debt of doing git add device ... need nicer way of doing this with git cmd, LFS has no native API with Gitpython
Parameterize release_name, it's now just alpha1

Add support for tilize batch dim > 1

various: move structs in kernels/hostdevcommon/kernel_structs.h to tt namespace

Description

For consistency, I think we should move structures like HlkOperand and CB into the tt namespace.

This means a using directive in the compute_hlk_api.h for simplicity. I think this is okay since HLKs are only ever used in HLKs. We could also do a kernel-folder wide find and replace to add tt::.

risc_rw_speed_banked_dram hang on tttest

It hangs in the write stage on tttest, I added some debug prints to the host code.
It finished the read stage, it start the write stage and it hangs.


❯ ./build/test/llrt/tests/test_run_risc_rw_speed_banked_dram
                   Test | INFO     | num cmd line args: 1
                   Test | INFO     | Using default test arguments
                   Test | INFO     | Test arguments: buffer_size = 409600, num_repetitions = 10000, transaction_size = 512
                 Device | INFO     | SOC descriptors loaded /mnt/localhome/dcapalija/fw-dma-test-2/device/grayskull_120_arch.yaml
                 Device | INFO     | Network descriptor loaded 
Detected 1 PCI device 
Opening TT_PCI_INTERFACE_ID 0 for netlist target_device_id: 0
PCIEIntfId   0x0
VID:DID      0x1e52:0xfaca
SubVID:SubID 0x1e52:0x3
BSF          c1:0:0
BAR          0x18000000000  size: 48MB
HARVESTING DISABLED = 0x0 (memory: 0x0 logic: 0x0)
HARVESTING DISABLED = 0x0 (memory: 0x0 logic: 0x0)
Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1202 MHz
                  Verif | INFO     | Cores to use: (x=5,y=4) 
                  Verif | INFO     | 
                  Verif | INFO     | Starting read speed test
                  Verif | INFO     | BRISC time: 0.77506221s
                  Verif | INFO     | Bytes read: 4096000000, GB read: 3.814697265625
                  Verif | INFO     | Read speed GB/s: 4.921794942918195
                  Verif | INFO     | Done read speed test
                  Verif | INFO     | 
                  Verif | INFO     | Starting write speed test

Update UNRESERVED_BASE to be right after the NCRISC/BRISC args

constexpr static std::uint32_t UNRESERVED_BASE = 200 * 1024; // Start of unreserved space

we should check for CBs and L1 Buffers.
Currently it check only for L1Buffers

Perf-Mon: Benchmarking functions requires correct timer marks on its first and last instructions.

Due to compiler optimization, a simple delta measurement of time before and after function execution can be a wrong representation of execution time.
https://stackoverflow.com/questions/22387586/measuring-execution-time-of-a-function-in-c

ci: initial bringup

Description

https://tenstorrent.sharepoint.com/:p:/r/sites/Jasmina/Shared%20Documents/dma%20tests/ContinuousIntegrationAndGenerality/Generality%26CISlides.pptx?d=wfc9966097ea04ece9d554fc6a5a746ed&csf=1&web=1&e=dGL5dl

by @DrJessop

infra: create a test harness to be used in tt-metal tests

Description

Spurred by earlier sentiments and comments in !53, we want to introduce a test harness that will do a lot of boilerplate for us. Creating device, some config stuff, some debug stuff to help us ex. logging AI CLK speed. We can even move stuff like the tt_gdb into it, fully housing its usage and giving us a direct place to pinpoint any test infra.

This will also help with CI development as in #59 because tests will be uniform and nothing will jump out at us when developing CI.

As we move away from doing llrt tests, we will want a test harness for all TT-Metal tests.

Add documentation for new noc apis

Add documentation on:
noc_semaphore_set
noc_semaphore_wait
noc_semaphore_inc
noc_semaphore_set_multicast
noc_async_write_multicast

Also, add documentation on noc ids, VCs, CMD_BUFs, and how write requests are ordered.

sweep_tests: Add lightweight infra for running sweep tests for pytorch and ll_buda ops

llrt + ll_buda: expose target_ai_clk param to both llrt and ll_buda, currently default 0

Description

Note: for 0.2, not important for 0.1

Commit 383ffdbe1f33ec9d0f067f85fcb9d972773a51ce on master exposed an issue with target_ai_clk in tt_cluster.hpp. We assumed up to this point the compiler will set it 0, thereby not attempting to set AI CLK to an invalid value, however at this commit and forward target_ai_clk starts being initialized with garbage ex. 20k+.

A quick fix for 0.1 was to set it to 0 by default. However, this value can't be changed. We need to change this to be a configurable param in llrt and ll_buda with a default 0.

Relevant ppl include @abhullar-tt

Perf_Mon: Measure perf of matmul kernels

Monitoring the performance of number of improvements brought in for matmul test including multicast on in0 and in1, sub-bock based writer.
Improving writer by moving to sub-block writes to CB instead of Tiles

Perf-Mon: Bring multi threading support to kernel profiler module.

Profiler should support thread safety. On the kernel side, print server for DPRINT is multi threaded and currently logic had to be added to the server side to ensure thread safety.

Improve CPU profiling instrumentation

code cleanup

reg_scripts: support multi-chip systems in regressions + CI

Description

Will expand on in future, but examples include

reset-tensix targeting different chips
running regressions on diff chips

Compile and kernel loading caching (temporary to unblock progress on BERT)

This is not meant to be the proper implementation, just something that can allow short-term progress acceleration on larger models.

Need better broadcast support across the board for all ops

Right now there's some support but it's not full coverage of pytorch semantics, there are gaps in bmm(), matmul(), bcast ops etc.

Improve reduce sum accuracy and accuracy test coverage

Kernels: Staggered reset to become optional

Add configurable staggered reset — default is off

Add InterleavedDramBuffer

This can be a wrapper around vector of DramBuffers

hlkc: provide TT_METAL_HOME as a parameter rather than env var

Description

Still intake as an env var, but provide as a parameter in build_kernels_for_riscv instead. Plucking env vars out of thin air in hlkc introduces more points of failure.

Improve multi-batch support in ll_buda ops

llbuda apis: with 8-bank readers/writers in a few places we still pass non-zero noc coordinates which is confusing

It would be better to pass 0/0 with comments that these are unused.

ll_buda::WriteRuntimeArgsToDevice(
        device,
        binary_reader_kernel,
	@@ -152,7 +157,7 @@ Tensor eltwise_binary(const Tensor &a, const Tensor &b, BinaryOpType::Enum op_ty
        (std::uint32_t)dram_src1_noc_xy.x,

Clean-up comments in ll_buda regarding loading hexs / blanks

Loading blanks has a comment that we're taking riscs out of reset, but that's not the case


    // Take device out of reset
    const llrt::TensixRiscsOptions riscs_options = llrt::TensixRiscsOptions::ALL_RISCS;
    llrt::internal_::load_blank_kernel_to_all_worker_cores_with_exceptions(
        cluster, pcie_slot, riscs_options, worker_cores);


// TODO: de-asserting reset properly
//  this deasserts reset for all BRISCs (on all devices, all cores), but not other RISC processors (NCRISC, TRISC)
// even though it deasserts reset for all the BRISCs, we are only loading  BRISC for a single core ("core")
// this is unsafe, since BRISCs for which we haven't loaded FW are now running garbage out of their L1
// proper solution:
// a) load dummy BRISC FW to unused cores, and keep using the function that de-asserts all BRISCs (easier, we can load blank kernel and disable NCRISC loading)
// b) de-assert reset only for used BRISCs (needs a new deassert function w/ a list of core to de-assert) (harder)
void deassert_brisc_reset_for_all_chips_all_cores(tt_cluster *cluster) {
    cluster->deassert_risc_reset(); 
    log_debug(tt::LogLLRuntime, "deasserted reset for all BRISCs");
}

Perf-Mon: Bring compute risc kernel profiling support

At the moment only BRISC and NCRISC profiling is available. Bring similar kernel profiling for TRISC0/1/2 RISCs as well.

Perf_Mon: Have option mask for default and custom markers enable through tt_metal

Add more flexibility to the profiler flag.
Options for the flag must be:

No kernel profiling
Default markers only
Custom markers only
Custom and Default markers
compute yes or no
data flow yes or no

Clean up python API to not use ll_buda_bindings.ll_buda_bindings._C

infra: remove unnecessary libs

Description

Some candidates for removal among our dependencies include:

yaml-cpp from third_party
sqlite3
hdf5

Implement fused softmax

sweep_tests: Improvements for running sweep tests

Features/TODOs:

Add a test config for running a small number of random samples for all available op tests
Add option to specify hard-coded list of shapes (eg. for edge cases)
Remove old path of running tests via standalone python scripts
Add README

tenstorrent / tt-metal Goto Github PK

tt-metal's Introduction

Buy hardware | Install | Discord

API Reference | Model Demos

Grayskull (GS) Models

Wormhole (WH) Models

T3000 (2x4 mesh of WHs) Models

Using TT-NN ops and tensors

Programming Guide | API Reference

Getting started

tt-metal's People

Contributors

Stargazers

Watchers

Forkers

tt-metal's Issues

Description

Description

Description

Description

Description

Description

Commit message template proposal

Description

Description

Description

Description

Description

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org