Attached is the source code of a benchmark which measures execution time of the <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The prefetch instructions are my suspect: <div class="snippet-clipboard-content no

Unpredictable execution time of BLASFEO functions caused by uninitialized padding elements of the matrix,about giaf/blasfeo

Comments (34)

mkatliar commented on August 24, 2024 1

@giaf I appreciate if you do independent benchmarking. I have created a separate repo with my benchmark code here: https://github.com/mkatliar/blasfeo_benchmark. Maybe it is a good place to start with and add your independent benchmarks.

from blasfeo.

mkatliar commented on August 24, 2024

I am not sure whether the issue is related or not to the details of how the time is measured in the benchmark. But the difference is surprising, and I have not seen such strange results when benchmarking non-blasfeo code.

from blasfeo.

mkatliar commented on August 24, 2024

With --v=2 benchmark option I can get some more info.

Command:

./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --v=2

Output

[...]
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1
-- LOG(2): Ran in 1.1109e-05/1.1244e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10
-- LOG(2): Ran in 9.1714e-05/9.1772e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 100
-- LOG(2): Ran in 0.000906435/0.000906326
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1000
-- LOG(2): Ran in 0.00914121/0.00914061
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10000
-- LOG(2): Ran in 0.0923856/0.0924196
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687447/0.687481
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687436/0.687472
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687106/0.687139
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252108/0.25214
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252166/0.252203
BM_gemm_blasfeo/10/10/10              9073 ns         9073 ns        75769
BM_gemm_blasfeo/10/10/10              9073 ns         9073 ns        75769
BM_gemm_blasfeo/10/10/10              9069 ns         9068 ns        75769
BM_gemm_blasfeo/10/10/10              3328 ns         3327 ns        75769
BM_gemm_blasfeo/10/10/10              3329 ns         3328 ns        75769
BM_gemm_blasfeo/10/10/10_mean         6774 ns         6774 ns            5
BM_gemm_blasfeo/10/10/10_median       9069 ns         9068 ns            5
BM_gemm_blasfeo/10/10/10_stddev       3146 ns         3146 ns            5

You can see that the benchmark was ran with an increasing iteration count to determine the number of required iterations to be 75769. And then it ran 5 times a batch of 75769 calls to gemm, where the first three batches executed in about 0.69s and the last two in 0.25s.

from blasfeo.

mkatliar commented on August 24, 2024

The issue is not reproduced if BLASFEO is compiled with LA = REFERENCE instead of LA = HIGH_PERFORMANCE

from blasfeo.

mkatliar commented on August 24, 2024

The prefetch instructions are my suspect:

ROUTINE ======================== inner_kernel_dgemm_nn_4x4_lib4
16669307243 16669307243 (flat, cum) 26.66% of Total
         .          .      56e30: cmp    $0x0,%r10d                       ;inner_kernel_dgemm_nn_4x4_lib4 
         .          .      56e34: jle    570a0 <inner_kernel_dgemm_nn_4x4_lib4+0x270>
         .          .      56e3a: vxorpd %ymm4,%ymm4,%ymm4
         .          .      56e3e: vmovapd %ymm4,%ymm5
  14330999   14330999      56e42: vmovapd %ymm4,%ymm6                     ;inner_kernel_dgemm_nn_4x4_lib4 ??
         .          .      56e46: vmovapd %ymm4,%ymm7                     ;inner_kernel_dgemm_nn_4x4_lib4 
         .          .      56e4a: vmovapd (%r11),%ymm13
         .          .      56e4f: cmp    $0x4,%r10d
         .          .      56e53: jle    56f59 <inner_kernel_dgemm_nn_4x4_lib4+0x129>
  14337096   14337096      56e59: nopl   0x0(%rax)                        ;inner_kernel_dgemm_nn_4x4_lib4 ??
  34239958   34239958      56e60: prefetcht0 (%r12,%r13,2)
  86003288   86003288      56e65: prefetcht0 0x40(%r12,%r13,2)
  39820550   39820550      56e6b: vbroadcastsd (%r12),%ymm12
1414449993 1414449993      56e71: vfmadd231pd %ymm13,%ymm12,%ymm0
 273971468  273971468      56e76: vmovapd 0x20(%r11),%ymm14
  78039046   78039046      56e7c: vbroadcastsd 0x20(%r12),%ymm12

from blasfeo.

mkatliar commented on August 24, 2024

I commented out all prefetcht0 instructions in BLASFEO code and rebuilt BLASFEO, but the issue is still reproduced.

from blasfeo.

giaf commented on August 24, 2024

Is the issue happening for HIGH_PERFORMANCE targets INTEL_SANDY_BRIDGE, INTEL_CORE and AMD_BULLDOZER?

from blasfeo.

mkatliar commented on August 24, 2024

The important difference between the benchmark runs is where the memory is allocated. I found out that the alignment of the array data has dramatic effect on the time.

Below are the results from the benchmark run with different allocated memory alignment (last parameter). You see that when the alignment is 0x20 it varies a lot. When it is 0x1000, it does not vary.

--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32                     123 ns          123 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    33.5 ns         33.5 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    80.4 ns         80.4 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    80.4 ns         80.4 ns     10000000
BM_gemm_blasfeo/2/2/2/32                     210 ns          210 ns     10000000
BM_gemm_blasfeo/2/2/2/32_mean                105 ns          105 ns            5
BM_gemm_blasfeo/2/2/2/32_median             80.4 ns         80.4 ns            5
BM_gemm_blasfeo/2/2/2/32_stddev             66.4 ns         66.4 ns            5
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096_mean             33.6 ns         33.6 ns            5
BM_gemm_blasfeo/2/2/2/4096_median           33.6 ns         33.6 ns            5
BM_gemm_blasfeo/2/2/2/4096_stddev          0.015 ns        0.015 ns            5

from blasfeo.

giaf commented on August 24, 2024

For the BLASFEO API, 32 bytes is the minimum alignment to avoid segfaults, but 64 bytes is the assumed alignment for performance. But performance deviations should not be that large.

On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation. And in your script the memory allocation happens differently for the BLASFEO case (alignedAlloc) and for the CBLAS case (std::make_unique). Additionally, depending on when all these small memory chunks are garbage collected, this may pollute quite a lot the memory managed by the allocation system.

In summary, it is not clear to me that the difference in computational time comes from the execution of the BLASFEO routine, or conversely from the memory allocation routine.

from blasfeo.

mkatliar commented on August 24, 2024

Here is the updated benchmark, where alignment is added as an extra parameter.
gemm-benchmark.zip

The results from the run

./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --benchmark_report_aggregates_only=true

are below:

-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32_mean              80.2 ns         80.2 ns            5
BM_gemm_blasfeo/2/2/2/32_median            80.8 ns         80.8 ns            5
BM_gemm_blasfeo/2/2/2/32_stddev            31.4 ns         31.4 ns            5
BM_gemm_blasfeo/3/3/3/32_mean               194 ns          194 ns            5
BM_gemm_blasfeo/3/3/3/32_median             214 ns          214 ns            5
BM_gemm_blasfeo/3/3/3/32_stddev             164 ns          164 ns            5
BM_gemm_blasfeo/5/5/5/32_mean               460 ns          460 ns            5
BM_gemm_blasfeo/5/5/5/32_median             376 ns          376 ns            5
BM_gemm_blasfeo/5/5/5/32_stddev             239 ns          239 ns            5
BM_gemm_blasfeo/10/10/10/32_mean           3584 ns         3584 ns            5
BM_gemm_blasfeo/10/10/10/32_median         3775 ns         3775 ns            5
BM_gemm_blasfeo/10/10/10/32_stddev          285 ns          285 ns            5
BM_gemm_blasfeo/20/20/20/32_mean            514 ns          514 ns            5
BM_gemm_blasfeo/20/20/20/32_median          515 ns          515 ns            5
BM_gemm_blasfeo/20/20/20/32_stddev         4.86 ns         4.86 ns            5
BM_gemm_blasfeo/30/30/30/32_mean          10297 ns        10297 ns            5
BM_gemm_blasfeo/30/30/30/32_median         6358 ns         6358 ns            5
BM_gemm_blasfeo/30/30/30/32_stddev         9572 ns         9572 ns            5
BM_gemm_blasfeo/2/2/2/4096_mean             726 ns          726 ns            5
BM_gemm_blasfeo/2/2/2/4096_median           726 ns          726 ns            5
BM_gemm_blasfeo/2/2/2/4096_stddev         0.432 ns        0.431 ns            5
BM_gemm_blasfeo/3/3/3/4096_mean             729 ns          729 ns            5
BM_gemm_blasfeo/3/3/3/4096_median           729 ns          729 ns            5
BM_gemm_blasfeo/3/3/3/4096_stddev         0.096 ns        0.093 ns            5
BM_gemm_blasfeo/5/5/5/4096_mean            2475 ns         2475 ns            5
BM_gemm_blasfeo/5/5/5/4096_median          2475 ns         2475 ns            5
BM_gemm_blasfeo/5/5/5/4096_stddev         0.321 ns        0.321 ns            5
BM_gemm_blasfeo/10/10/10/4096_mean         3774 ns         3774 ns            5
BM_gemm_blasfeo/10/10/10/4096_median       3774 ns         3774 ns            5
BM_gemm_blasfeo/10/10/10/4096_stddev       2.18 ns         2.18 ns            5
BM_gemm_blasfeo/20/20/20/4096_mean          507 ns          507 ns            5
BM_gemm_blasfeo/20/20/20/4096_median        507 ns          507 ns            5
BM_gemm_blasfeo/20/20/20/4096_stddev      0.024 ns        0.025 ns            5
BM_gemm_blasfeo/30/30/30/4096_mean         4270 ns         4270 ns            5
BM_gemm_blasfeo/30/30/30/4096_median       4270 ns         4270 ns            5
BM_gemm_blasfeo/30/30/30/4096_stddev      0.447 ns        0.467 ns            5

You can see that when the alignment is 32 the standard deviation of the time is huge, and with the alignment 4096 it is very small.

However, surprisingly, with large alignment the performace is often many times worse than with the small, compare:

BM_gemm_blasfeo/2/2/2/32_mean              80.2 ns         80.2 ns            5
BM_gemm_blasfeo/2/2/2/4096_mean             726 ns          726 ns            5
...
BM_gemm_blasfeo/5/5/5/32_mean               460 ns          460 ns            5
BM_gemm_blasfeo/5/5/5/4096_mean            2475 ns         2475 ns            5

This is very puzzling.

from blasfeo.

mkatliar commented on August 24, 2024

On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation.

No, it is not. It benchmarks only the call to the routine. If you look in the source, only what is inside the for (auto _ : state) loop is benchmarked. The rest is just preparation of the benchmark (which happens once per batch, but is not timed).

from blasfeo.

giaf commented on August 24, 2024

OK sorry I didn't get how that smart auto benchmarking loop works. Alright then, memory allocation is not benchmarked, but only the linear algebra routine call.

from blasfeo.

mkatliar commented on August 24, 2024

It seems that not the alignment itself is important, but really the address of the data arrays. This is confirmed by the following test, where the alignment is kept at 0x20 bytes, but the arrays are reused between different runs of the benchmark with the same sizes:

-----------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean             124 ns          124 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median           124 ns          124 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev         0.061 ns        0.061 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean             557 ns          557 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median           557 ns          557 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev         0.658 ns        0.661 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean            2506 ns         2505 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median          2506 ns         2506 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev         0.373 ns        0.379 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean         4280 ns         4280 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median       4280 ns         4280 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev       1.19 ns         1.19 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean          511 ns          511 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median        511 ns          511 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev      0.149 ns        0.147 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean         1641 ns         1641 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median       1641 ns         1641 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev      0.540 ns        0.538 ns

You see that keeping location of the data removes variability of the timings.

However, one would expect the same performance regardless of the location of the data, which is not the case.

Interestingly, you can see from the result that the time does not necessarily increase with the size of the problem. This is probably caused by the large variation of the time due to different data location.

from blasfeo.

mkatliar commented on August 24, 2024

Although re-running the benchmarks gives very similar results, running them in reverse order gives very different results, probably due to changed placement of memory chunks:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean         9420 ns         9420 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median       9427 ns         9427 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev       36.4 ns         36.5 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean          508 ns          508 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median        507 ns          507 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev      0.355 ns        0.354 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean         4781 ns         4781 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median       4781 ns         4780 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev       2.00 ns         1.97 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean            2297 ns         2297 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median          2297 ns         2297 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev         0.048 ns        0.043 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean             341 ns          341 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median           341 ns          341 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev         0.026 ns        0.025 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean             423 ns          423 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median           423 ns          423 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev         0.037 ns        0.037 ns            5

from blasfeo.

mkatliar commented on August 24, 2024

The current version of the benchmark sources gemm-benchmark.zip

from blasfeo.

mkatliar commented on August 24, 2024

The same behavior is observed on a machine with AMD Bulldozer and BLASFEO built with TARGET=X64_AMD_BULLDOZER

from blasfeo.

mkatliar commented on August 24, 2024

For syscop members, the benchmark code is accessible in the tmpc repo:

git clone [email protected]:mikhail.katliar/tmpc.git -b 44-mpipm
cd tmpc/bench/blasfeo
make
./bench

You need to install Google Benchmark first, see https://github.com/google/benchmark#installation

from blasfeo.

mkatliar commented on August 24, 2024

Explanation of what is the difference between BM_gemm_blasfeo and BM_gemm_blasfeo_reuse_memory.

The benchmark is run as follows. For every combination of parameters (m, n, k, alignment) a loop where the function is called (let's call it a batch) is run several times. Then the mean, median, and standard deviation across batches is computed.

BM_gemm_blasfeo:
Before running every batch, new memory chunks are allocated.
BM_gemm_blasfeo_reuse_memory:
All batches corresponding to the same set of parameters (m, n, k, alignment) use the same memory chunks.

from blasfeo.

giaf commented on August 24, 2024

This makes clear that the performance penalty comes from the choice of the specific memory address.

I would like to clarify that you chosen alignment, 32-bytes, is the minimum to have correct behavior (i.e. avoid segmentation faults), but it is not the recommended one in the BLASFEO API. As you can see e.g. here https://github.com/giaf/blasfeo/blob/master/auxiliary/blasfeo_stdlib.c#L36 if you use the BLASFEO functionality to allocate aligned memory, the address is aligned to typical cache size boundaries, i.e. 64 bytes.

Having a memory address aligned to 32 bytes but not to 64 bytes is expected to slightly increase the computational cost, and the matrix actually occupies more cache lines in memory, and therefore more memory has to be moved through the cache hierarchy.

But the effect seen in some of your tests is much larger than the expected one. A possible explanation is that sometimes you also hit memory page boundaries, or something like that.

from blasfeo.

mkatliar commented on August 24, 2024

The same behavior is observed with the alignment of 0x40 bytes.

from blasfeo.

mkatliar commented on August 24, 2024

"Sometimes you also hit memory page boundaries" would also occur in the BM_gemm_cblas, but we don't see such effect.

from blasfeo.

giaf commented on August 24, 2024

would also occur in the BM_gemm_cblas, but we don't see such effect.

That depends on the implementation. E.g OpenBLAS typically packs all data in internally allocated buffers (likely properly aligned), so unaligned data is accessed only once.

Conversely, BLASFEO (and especially the BLASFEO API) operates natively on some or all of the matrix arguments. So unaligned data is accessed more times.

from blasfeo.

mkatliar commented on August 24, 2024

But do you agree that the described behavior of BLASFEO should be fixed?

from blasfeo.

giaf commented on August 24, 2024

You can try out MKL instead of OpenBLAS, for the variants 'NN' and 'NT' in the fortran BLAS API (so check out which the corresponding variants are through the CBLAS API).

Depending on the outcome, we can speak about "fixes".

from blasfeo.

mkatliar commented on August 24, 2024

Well, once we know about this issue, one cannot anymore say anything meaningful about BLASFEO performance. Instead of saying "this function executes in 30ns" one should say "this function executes in between 30 and 300 or maybe 500ns, depending on where you allocate memory, and we don't know how it depends".

from blasfeo.

giaf commented on August 24, 2024

Before undergoing the complete reimplementation of BLASFEO (since choices about internal memory formats are at the very foundations), I will investigate personally and with independent benchmarks the issue.

If there is some "bug" to be fixes (i.e. something specific and non-fundamental which triggers the behavior) it will be fixed, but if this only comes form the implementation choices, I guess a note about providing a minimum alignment for maximum performance will make it.

In some benchmarks I did in the past, I encountered some performance degradation in MKL if memory is not aligned, so if MKL is good with that, I'm good too ;)

Thanks for the deep investigation of the issue and I'll keep you updates on the next steps!

from blasfeo.

mkatliar commented on August 24, 2024

Results with MKL BLAS (Fortran API, although I don't see why Fortran API vs CBlas API would make difference):

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_gemm_blas/2/2/2_mean             155 ns          155 ns            5
BM_gemm_blas/2/2/2_median           155 ns          155 ns            5
BM_gemm_blas/2/2/2_stddev         0.037 ns        0.036 ns            5
BM_gemm_blas/3/3/3_mean             162 ns          162 ns            5
BM_gemm_blas/3/3/3_median           162 ns          162 ns            5
BM_gemm_blas/3/3/3_stddev         0.023 ns        0.020 ns            5
BM_gemm_blas/5/5/5_mean             186 ns          186 ns            5
BM_gemm_blas/5/5/5_median           186 ns          186 ns            5
BM_gemm_blas/5/5/5_stddev         0.018 ns        0.019 ns            5
BM_gemm_blas/10/10/10_mean          265 ns          265 ns            5
BM_gemm_blas/10/10/10_median        265 ns          265 ns            5
BM_gemm_blas/10/10/10_stddev      0.018 ns        0.018 ns            5
BM_gemm_blas/20/20/20_mean          727 ns          727 ns            5
BM_gemm_blas/20/20/20_median        729 ns          729 ns            5
BM_gemm_blas/20/20/20_stddev       2.33 ns         2.33 ns            5
BM_gemm_blas/30/30/30_mean         1738 ns         1738 ns            5
BM_gemm_blas/30/30/30_median       1738 ns         1738 ns            5
BM_gemm_blas/30/30/30_stddev       1.58 ns         1.59 ns            5

from blasfeo.

giaf commented on August 24, 2024

As investigated here mkatliar/blasfeo_benchmark#4 (comment), the issue is due to the presence of denormals in the memory passed to BLASFEO in order to create the matrices. In case of matrices of size not multiple of the minimum kernel size (usually 4x4), the matrix multiple of 4x4 is actually used for computation, and the extra computed elements discarded when writing the result form registers into memory. Therefore, if there are denormals on the "dark" part of the memory of the matrix multiple of 4x4, they do not affect the correctness of the result, but they may severely slow down computations like in this case.

To answer to why there are denormals at all in the memory (and consistently across different test machines and OSs), this may be due to what the google benchmark itself does.

from blasfeo.

mkatliar commented on August 24, 2024

It is BLASFEO's responsibility that the matrix structure after blasfeo_create_dmat() is properly functional. The client's responsibility is to allocate the memory and properly set elements of the matrices.

I don't consider this issue as "Closed". To fix the issue, blasfeo_create_dmat() needs to be changed such that the padding elements are initialized to 0.

from blasfeo.

giaf commented on August 24, 2024

We will write a "best practice" in the BLASFEO repo, but what you propose is not the solution.

If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines. We use this trick in some of our projects (as I know what to do, since I wrote it). Setting to zero the "padding" memory would break this.

from blasfeo.

giaf commented on August 24, 2024

On top of that, blasfeo_create_dmat is very light weight, and does not touch the data at all also for performance reasons. E.g. in acados, the workspace is allocated once, but BLASFEO matrices are repeatedly created on that memory. Memory allocations happens "off-line" and therefore it is cheap to zero it out there, while the BLASFEO structs creation also happens "on-line" and therefore it should be cheap by design.

from blasfeo.

mkatliar commented on August 24, 2024

If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines.

"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat struct and the related functions.

The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.

from blasfeo.

giaf commented on August 24, 2024

"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat struct and the related functions.

Sure it is undocumented exactly because it is unsafe. But nonetheless since we are the developers of BLASFEO, we can use knowledge about its internal in our own software.

from blasfeo.

giaf commented on August 24, 2024

The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.

Just to be clear, the results computed by the routine are CORRECT. This does not create potential problems for potentially many users in many places.

Computational speed can be lower if some good practice rules are not followed. We were aware about slight performance drops in case of not aligned memory, but they are in the order of some percent, and therefore not of interest for the normal user.

We were not aware of the large performance penalty (in the order of 10x or more) that can arise from having denormals in the padding memory. Now that we are aware of this, we will add a note to make sure that other users can avoid this issue.

We thank you for your performance issue report and for the effort and help in finding the reason for that. The lifetime you invested there will not be wasted.

from blasfeo.

Unpredictable execution time of BLASFEO functions caused by uninitialized padding elements of the matrix about blasfeo HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent