Coder Social home page Coder Social logo

Comments (10)

cglagovichTT avatar cglagovichTT commented on July 19, 2024

I was able to repro this segfault on sjc-snva-t3002.

Config:

  • Llama3
  • decode-only
  • greedy decode
image
2024-07-02 16:45:32.806 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_25_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1,│······
 32, 32]), dtype BFLOAT16, layout TILE                                                                                                                                                                                                │······
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │······
                                                                                                                                                                                                                                      │······
Thread 223 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │······
[Switching to Thread 0x7ffe82fd7700 (LWP 773289)]                                                                                                                                                                                     │······
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │······
(gdb) bt                                                                                                                                                                                                                              │······
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │······
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │······
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │······
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│······
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │······
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│······
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │······
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │······
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│······
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │······
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │······
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95                                                                                                                                                    │······
(gdb) list                                                                                                                                                                                                                            │······
1       <built-in>: No such file or directory.                                                                                                                                                                                        │······
(gdb)

from tt-metal.

cglagovichTT avatar cglagovichTT commented on July 19, 2024

When I run with async disabled, I see a variety of errors.

Case 1:
image

2024-07-02 17:59:25.628 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_109_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·, 32, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·Thread 218 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe857dc700 (LWP 820677)]                                                                                                                                                                                     │·
0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeList:│·
:Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                   │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeL│·
ist::Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                               │·#1  0x00007fff8883432f in tt::tt_metal::allocator::FreeList::allocate_slice_of_free_block(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, unsigned long, unsigned long) ()                                         │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#2  0x00007fff88834754 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·
#3  0x00007fff88837075 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·#4  0x00007fff88838f17 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff8883905f in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#6  0x00007fff8887398e in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·#7  0x00007fff888760c4 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·#8  0x00007fff88873a85 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#9  0x00007fff88822c63 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·#10 0x00007fff888215d8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·#11 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#12 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·
#13 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#14 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·#15 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()

from tt-metal.

cglagovichTT avatar cglagovichTT commented on July 19, 2024

case 2: deallocate
image

2024-07-02 18:42:19.973 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_rot_mat_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 32, │·
128, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                               │·
2024-07-02 18:42:19.977 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 838272)]                                                                                                                                                                                     │·
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │·
(gdb) list                                                                                                                                                                                                                            │·
1       <built-in>: No such file or directory.                                                                                                                                                                                        │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │·
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │·
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│·
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │·
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│·
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │·
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │·
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│·
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │·
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │·
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

from tt-metal.

cglagovichTT avatar cglagovichTT commented on July 19, 2024

Case 3: search_first
image

2024-07-02 18:57:17.853 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_149_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 857393)]                                                                                                                                                                                     │·
\0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                             │·
#1  0x00007fff888347a9 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·#2  0x00007fff888378d5 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·
#3  0x00007fff88839777 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·
#4  0x00007fff888398bf in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff888741ee in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·
#6  0x00007fff88876924 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#7  0x00007fff888742e5 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#8  0x00007fff88822c73 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·
#9  0x00007fff888215e8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·
) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·
#10 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#11 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·
__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·#12 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#13 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·
#14 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()

from tt-metal.

cglagovichTT avatar cglagovichTT commented on July 19, 2024

I was not able to repro this segfault with async queues disabled.

In one of the deallocate segfaults in a worker thread, I see that the main thread is involved in sending a tensor to device.
image

This made me wonder if this code pattern is the culprit:

            rot_mats = ttnn.as_tensor(
                rot_mat,
                dtype=ttnn.bfloat16,
                layout=ttnn.TILE_LAYOUT,
                device=self.device_mesh,
                cache_file_name=cache_name(f"rot_mat_decode_{start_pos}"),
                memory_config=self.model_config["DRAM_MEMCFG"],
                mesh_mapper=ReplicateTensorToMesh(self.device_mesh),
            )
            rot_mats = ttnn.to_device(rot_mats, self.device_mesh)

The to_device should be unnecessary but not incorrect. I ran the test again with this call removed, but the segfaults did not go away.

from tt-metal.

cglagovichTT avatar cglagovichTT commented on July 19, 2024

Repro instructions:

  • branch: cglagovich/9837
  • Build in release mode
  • set CPU frequency governor to ondemand (seems to help with repro)
gdb --args python -m pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]

Expected output:

--Type <RET> for more, q to quit, c to continue without paging--

Thread 224 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe827d6700 (LWP 369024)]
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so
(gdb) q

from tt-metal.

tstescoTT avatar tstescoTT commented on July 19, 2024

On a new T3000 machine to get the first run to 2816 tokens generated in a single sequence I got 6 crashes:

2024-07-03-Kuaishou-ttsmi
  • OS: Ubuntu 20.04
  • tt-kmd: 1.27.1
  • firmward bundle: 80.8.12.0
  • tt-metal commit: a053bc8

I did a soft reset tt-smi -r 0,1,2,3 after each crash and reran the first run script.

crash 1:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:26:09.820 | INFO     | __main__:run_decode:199 - Loop 5

free(): invalid pointer
Aborted (core dumped)

2:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:33:32.714 | INFO     | __main__:run_decode:199 - Loop 88

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7f063a443958]
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::op
tional<tt::tt_metal::ShardSpecBuffer> const&)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__
16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7f063a047928]
 --- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7f0639fe9299]
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Te
nsor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt
::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std
::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::ope
ration::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::v
ector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::_
_1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7f0639fe888f]
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7f0639fe7f7d]
 --- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::T
ensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>,
 std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocato
r<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tens
or>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7f0639fe6994]
 --- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7f063a046709]
 --- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7f063a311e39]
 --- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7f063a312b6f]
 --- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7f0639c319eb]
 --- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7f0639c31c5b]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f06a66c0609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f06a67fa353]

Aborted (core dumped)

crash 3 (same stack trace as above):

2024-07-03 11:40:37.964 | INFO     | __main__:run_decode:199 - Loop 127

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 4 (same stack trace as above):

2024-07-03 11:49:00.270 | INFO     | __main__:run_decode:199 - Loop 489

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 5:

2024-07-03 11:57:17.461 | INFO     | __main__:run_decode:199 - Loop 847

2024-07-03 11:57:17.573 | INFO     | __main__:run_decode:199 - Loop 848

Segmentation fault (core dumped)

crash 6 (hang)

2024-07-03 12:13:12.789 | INFO     | __main__:run_decode:199 - Loop 913

2024-07-03 12:13:12.903 | INFO     | __main__:run_decode:199 - Loop 914

^C^C^C^C^CTerminated

Rerunning after this crash got to 2816 tokens and gets to the known issue #9839. This completes the first run and generation for 2k context is relatively reliable.

from tt-metal.

tt-asaigal avatar tt-asaigal commented on July 19, 2024

Hey @tstescoTT would you mind running with this commit cherry-picked: 4558673. It resolved the segfault for me locally

from tt-metal.

smehtaTT avatar smehtaTT commented on July 19, 2024

@tstescoTT - can you help repro and confirm?

from tt-metal.

smehtaTT avatar smehtaTT commented on July 19, 2024

@mbahnasTT confirms tested - can be closed

from tt-metal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.