Describe the bug The result of a <code class="notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Added the appropriate checks here: <a class="issue-link js-issue-link" data-error-text

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor about tt-metal HOT 7 CLOSED

marty1885 commented on August 20, 2024

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor

from tt-metal.

Comments (7)

TT-BrianLiu commented on August 20, 2024 2

Hey Marty, I will take a look at this. We haven't done much testing with our matmul APIs from C++ side, so thanks for pointing out this issue!

from tt-metal.

marty1885 commented on August 20, 2024 1

@TT-BrianLiu No problem. I have uploaded the example code into a self-contained repository on my GitHub. Please let me know if the instructions in the README is not clear. I just tried again on c52e153 and I am experiencing the same issue. LMK if you cannot replicate it on your machine.

https://github.com/marty1885/ttnn-matmul-corruption-demo

from tt-metal.

TT-BrianLiu commented on August 20, 2024

Can you help provide the full test file with the includes and show how you built and run the test?

from tt-metal.

TT-BrianLiu commented on August 20, 2024

Thank you! I will try running the test

from tt-metal.

TT-BrianLiu commented on August 20, 2024

I was able to repro it and I pushed the test here: jedi

from tt-metal.

TT-BrianLiu commented on August 20, 2024

I figured our your issue. Our matmuls either support [B, 1, M, K] x [1, 1, K, N] (bcast_batch=True) or [B, 1, M, K] x [B, 1, K, N] (bcast_batch=False). In this test, the batches for input 1 (m2) and input 2 (m1) don't match so this matmul should actually assert out but it doesn't. Instead, our matmul is treating this as the second bcast_batch=False case and reading input 2 (m1) as a full [20, 1, 256, 32] tensor when only [2, 1, 256, 32] is actually allocated.

So, fix is simple. I will add the missing asserts for the matmul variants that are missing it, but let me explain what you're seeing in your tests. I will leave your test below for future reference since I will remove it when I merge the fix. I also removed everything that is not relevant (eg. the transpose, the extra allocation of m3).

The output tile that starts being corrupted exactly corresponds to the batch dim of m2. In your original test, you had [10, 1, 256, 32], so the 11th tile was being corrupted. I switched it to [2, 1, 256, 32] and then the 3rd tile started being corrupted. This is good evidence for the matmul reading in garbage inputs for input 2 (m1).
The reason you see corrupted values only when program cache is enabled is because program binaries are dumped in DRAM and the tilize_with_zero_padding op inside make_random_tensor also defaults to DRAM for the output. Tensors and binaries are allocated bottom-up in DRAM, so when the matmul reads memory above the allocated space for input 2 (m1), it's reading in non-zero values from the program binaries. When program cache isn't enabled, it's either reading in sensible (but wrong) values of input 1 (if you allocate m2, then m1) or zeros (if you allocate m1, then m2). If you enable program cache and let the test print instead of aborting, you will see some more sensible (but wrong) values later when it gets past the region of memory where the program binaries are enabled.
If you switch the output of make_random_tensor to L1 by passing in an L1 memory config (see code), you will see the same result regardless if program cache is enabled or not. This is because the incorrect space that matmul is reading from for input 2 (m1) is now in L1 and that region of space will be unaffected by whether program cache is enabled or not. You will still see incorrect values that will change depending on the order of allocation for m1 and m2, just not complete garbage.

auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::L1};
return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);

Reference code:

#include "tensor/host_buffer/functions.hpp"
#include "tensor/types.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include <cstddef>
#include <tt_eager/tensor/tensor.hpp>
#include <ttnn/core.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/device.hpp>
#include <tt_dnn/op_library/fully_connected/fully_connected_op.hpp>
#include <tt_dnn/op_library/eltwise_unary/eltwise_unary_op.hpp>
#include <tt_eager/tensor/tensor.hpp>
#include <tt_dnn/op_library/transpose/transpose_op.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/composite/composite_ops.hpp"
#include "tt_dnn/op_library/tilize/tilize_op.hpp"
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/matmul.hpp>
#include <tt_dnn/op_library/update_cache/update_cache_op.hpp>

#include <vector>
#include <iostream>

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::DRAM};  // Switch to L1
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);
}

int main() {
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled if m1 is in DRAM; otherwise, test will "pass" with bad results either way

    auto m2 = make_random_tensor(tt::tt_metal::Shape{20, 1, 32, 256});
    auto m1 = make_random_tensor(tt::tt_metal::Shape{2, 1, 256, 32});

    auto m3 = ttnn::operations::matmul::matmul(m2, m1, std::nullopt);

    std::vector<bfloat16> buf(m3.shape().volume());
    tt::tt_metal::memcpy(buf.data(), m3);
    std::cout << "Total ele: " << buf.size() << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if (i % 8 != 0) continue; // Print every 8 from each row
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;

    device->close();
}

from tt-metal.

TT-BrianLiu commented on August 20, 2024

Added the appropriate checks here: #10013

from tt-metal.

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor about tt-metal HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent