Coder Social home page Coder Social logo

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor about tt-metal HOT 7 CLOSED

marty1885 avatar marty1885 commented on August 20, 2024
[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor

from tt-metal.

Comments (7)

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024 2

Hey Marty, I will take a look at this. We haven't done much testing with our matmul APIs from C++ side, so thanks for pointing out this issue!

from tt-metal.

marty1885 avatar marty1885 commented on August 20, 2024 1

@TT-BrianLiu No problem. I have uploaded the example code into a self-contained repository on my GitHub. Please let me know if the instructions in the README is not clear. I just tried again on c52e153 and I am experiencing the same issue. LMK if you cannot replicate it on your machine.

https://github.com/marty1885/ttnn-matmul-corruption-demo

from tt-metal.

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024

Can you help provide the full test file with the includes and show how you built and run the test?

from tt-metal.

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024

Thank you! I will try running the test

from tt-metal.

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024

I was able to repro it and I pushed the test here: jedi

from tt-metal.

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024

I figured our your issue. Our matmuls either support [B, 1, M, K] x [1, 1, K, N] (bcast_batch=True) or [B, 1, M, K] x [B, 1, K, N] (bcast_batch=False). In this test, the batches for input 1 (m2) and input 2 (m1) don't match so this matmul should actually assert out but it doesn't. Instead, our matmul is treating this as the second bcast_batch=False case and reading input 2 (m1) as a full [20, 1, 256, 32] tensor when only [2, 1, 256, 32] is actually allocated.

So, fix is simple. I will add the missing asserts for the matmul variants that are missing it, but let me explain what you're seeing in your tests. I will leave your test below for future reference since I will remove it when I merge the fix. I also removed everything that is not relevant (eg. the transpose, the extra allocation of m3).

  • The output tile that starts being corrupted exactly corresponds to the batch dim of m2. In your original test, you had [10, 1, 256, 32], so the 11th tile was being corrupted. I switched it to [2, 1, 256, 32] and then the 3rd tile started being corrupted. This is good evidence for the matmul reading in garbage inputs for input 2 (m1).
  • The reason you see corrupted values only when program cache is enabled is because program binaries are dumped in DRAM and the tilize_with_zero_padding op inside make_random_tensor also defaults to DRAM for the output. Tensors and binaries are allocated bottom-up in DRAM, so when the matmul reads memory above the allocated space for input 2 (m1), it's reading in non-zero values from the program binaries. When program cache isn't enabled, it's either reading in sensible (but wrong) values of input 1 (if you allocate m2, then m1) or zeros (if you allocate m1, then m2). If you enable program cache and let the test print instead of aborting, you will see some more sensible (but wrong) values later when it gets past the region of memory where the program binaries are enabled.
  • If you switch the output of make_random_tensor to L1 by passing in an L1 memory config (see code), you will see the same result regardless if program cache is enabled or not. This is because the incorrect space that matmul is reading from for input 2 (m1) is now in L1 and that region of space will be unaffected by whether program cache is enabled or not. You will still see incorrect values that will change depending on the order of allocation for m1 and m2, just not complete garbage.
auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::L1};
return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);

Reference code:

#include "tensor/host_buffer/functions.hpp"
#include "tensor/types.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include <cstddef>
#include <tt_eager/tensor/tensor.hpp>
#include <ttnn/core.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/device.hpp>
#include <tt_dnn/op_library/fully_connected/fully_connected_op.hpp>
#include <tt_dnn/op_library/eltwise_unary/eltwise_unary_op.hpp>
#include <tt_eager/tensor/tensor.hpp>
#include <tt_dnn/op_library/transpose/transpose_op.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/composite/composite_ops.hpp"
#include "tt_dnn/op_library/tilize/tilize_op.hpp"
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/matmul.hpp>
#include <tt_dnn/op_library/update_cache/update_cache_op.hpp>

#include <vector>
#include <iostream>

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::DRAM};  // Switch to L1
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);
}

int main() {
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled if m1 is in DRAM; otherwise, test will "pass" with bad results either way

    auto m2 = make_random_tensor(tt::tt_metal::Shape{20, 1, 32, 256});
    auto m1 = make_random_tensor(tt::tt_metal::Shape{2, 1, 256, 32});

    auto m3 = ttnn::operations::matmul::matmul(m2, m1, std::nullopt);

    std::vector<bfloat16> buf(m3.shape().volume());
    tt::tt_metal::memcpy(buf.data(), m3);
    std::cout << "Total ele: " << buf.size() << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if (i % 8 != 0) continue; // Print every 8 from each row
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;

    device->close();
}

from tt-metal.

TT-BrianLiu avatar TT-BrianLiu commented on August 20, 2024

Added the appropriate checks here: #10013

from tt-metal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.