Comments (34)
@giaf I appreciate if you do independent benchmarking. I have created a separate repo with my benchmark code here: https://github.com/mkatliar/blasfeo_benchmark. Maybe it is a good place to start with and add your independent benchmarks.
from blasfeo.
I am not sure whether the issue is related or not to the details of how the time is measured in the benchmark. But the difference is surprising, and I have not seen such strange results when benchmarking non-blasfeo code.
from blasfeo.
With --v=2
benchmark option I can get some more info.
Command:
./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --v=2
Output
[...]
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1
-- LOG(2): Ran in 1.1109e-05/1.1244e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10
-- LOG(2): Ran in 9.1714e-05/9.1772e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 100
-- LOG(2): Ran in 0.000906435/0.000906326
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1000
-- LOG(2): Ran in 0.00914121/0.00914061
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10000
-- LOG(2): Ran in 0.0923856/0.0924196
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687447/0.687481
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687436/0.687472
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687106/0.687139
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252108/0.25214
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252166/0.252203
BM_gemm_blasfeo/10/10/10 9073 ns 9073 ns 75769
BM_gemm_blasfeo/10/10/10 9073 ns 9073 ns 75769
BM_gemm_blasfeo/10/10/10 9069 ns 9068 ns 75769
BM_gemm_blasfeo/10/10/10 3328 ns 3327 ns 75769
BM_gemm_blasfeo/10/10/10 3329 ns 3328 ns 75769
BM_gemm_blasfeo/10/10/10_mean 6774 ns 6774 ns 5
BM_gemm_blasfeo/10/10/10_median 9069 ns 9068 ns 5
BM_gemm_blasfeo/10/10/10_stddev 3146 ns 3146 ns 5
You can see that the benchmark was ran with an increasing iteration count to determine the number of required iterations to be 75769. And then it ran 5 times a batch of 75769 calls to gemm
, where the first three batches executed in about 0.69s and the last two in 0.25s.
from blasfeo.
The issue is not reproduced if BLASFEO is compiled with LA = REFERENCE
instead of LA = HIGH_PERFORMANCE
from blasfeo.
The prefetch instructions are my suspect:
ROUTINE ======================== inner_kernel_dgemm_nn_4x4_lib4
16669307243 16669307243 (flat, cum) 26.66% of Total
. . 56e30: cmp $0x0,%r10d ;inner_kernel_dgemm_nn_4x4_lib4
. . 56e34: jle 570a0 <inner_kernel_dgemm_nn_4x4_lib4+0x270>
. . 56e3a: vxorpd %ymm4,%ymm4,%ymm4
. . 56e3e: vmovapd %ymm4,%ymm5
14330999 14330999 56e42: vmovapd %ymm4,%ymm6 ;inner_kernel_dgemm_nn_4x4_lib4 ??
. . 56e46: vmovapd %ymm4,%ymm7 ;inner_kernel_dgemm_nn_4x4_lib4
. . 56e4a: vmovapd (%r11),%ymm13
. . 56e4f: cmp $0x4,%r10d
. . 56e53: jle 56f59 <inner_kernel_dgemm_nn_4x4_lib4+0x129>
14337096 14337096 56e59: nopl 0x0(%rax) ;inner_kernel_dgemm_nn_4x4_lib4 ??
34239958 34239958 56e60: prefetcht0 (%r12,%r13,2)
86003288 86003288 56e65: prefetcht0 0x40(%r12,%r13,2)
39820550 39820550 56e6b: vbroadcastsd (%r12),%ymm12
1414449993 1414449993 56e71: vfmadd231pd %ymm13,%ymm12,%ymm0
273971468 273971468 56e76: vmovapd 0x20(%r11),%ymm14
78039046 78039046 56e7c: vbroadcastsd 0x20(%r12),%ymm12
from blasfeo.
I commented out all prefetcht0
instructions in BLASFEO code and rebuilt BLASFEO, but the issue is still reproduced.
from blasfeo.
Is the issue happening for HIGH_PERFORMANCE
targets INTEL_SANDY_BRIDGE
, INTEL_CORE
and AMD_BULLDOZER
?
from blasfeo.
The important difference between the benchmark runs is where the memory is allocated. I found out that the alignment of the array data has dramatic effect on the time.
Below are the results from the benchmark run with different allocated memory alignment (last parameter). You see that when the alignment is 0x20 it varies a lot. When it is 0x1000, it does not vary.
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32 123 ns 123 ns 10000000
BM_gemm_blasfeo/2/2/2/32 33.5 ns 33.5 ns 10000000
BM_gemm_blasfeo/2/2/2/32 80.4 ns 80.4 ns 10000000
BM_gemm_blasfeo/2/2/2/32 80.4 ns 80.4 ns 10000000
BM_gemm_blasfeo/2/2/2/32 210 ns 210 ns 10000000
BM_gemm_blasfeo/2/2/2/32_mean 105 ns 105 ns 5
BM_gemm_blasfeo/2/2/2/32_median 80.4 ns 80.4 ns 5
BM_gemm_blasfeo/2/2/2/32_stddev 66.4 ns 66.4 ns 5
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096_mean 33.6 ns 33.6 ns 5
BM_gemm_blasfeo/2/2/2/4096_median 33.6 ns 33.6 ns 5
BM_gemm_blasfeo/2/2/2/4096_stddev 0.015 ns 0.015 ns 5
from blasfeo.
For the BLASFEO API, 32 bytes is the minimum alignment to avoid segfaults, but 64 bytes is the assumed alignment for performance. But performance deviations should not be that large.
On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation. And in your script the memory allocation happens differently for the BLASFEO case (alignedAlloc
) and for the CBLAS case (std::make_unique
). Additionally, depending on when all these small memory chunks are garbage collected, this may pollute quite a lot the memory managed by the allocation system.
In summary, it is not clear to me that the difference in computational time comes from the execution of the BLASFEO routine, or conversely from the memory allocation routine.
from blasfeo.
Here is the updated benchmark, where alignment is added as an extra parameter.
gemm-benchmark.zip
The results from the run
./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --benchmark_report_aggregates_only=true
are below:
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32_mean 80.2 ns 80.2 ns 5
BM_gemm_blasfeo/2/2/2/32_median 80.8 ns 80.8 ns 5
BM_gemm_blasfeo/2/2/2/32_stddev 31.4 ns 31.4 ns 5
BM_gemm_blasfeo/3/3/3/32_mean 194 ns 194 ns 5
BM_gemm_blasfeo/3/3/3/32_median 214 ns 214 ns 5
BM_gemm_blasfeo/3/3/3/32_stddev 164 ns 164 ns 5
BM_gemm_blasfeo/5/5/5/32_mean 460 ns 460 ns 5
BM_gemm_blasfeo/5/5/5/32_median 376 ns 376 ns 5
BM_gemm_blasfeo/5/5/5/32_stddev 239 ns 239 ns 5
BM_gemm_blasfeo/10/10/10/32_mean 3584 ns 3584 ns 5
BM_gemm_blasfeo/10/10/10/32_median 3775 ns 3775 ns 5
BM_gemm_blasfeo/10/10/10/32_stddev 285 ns 285 ns 5
BM_gemm_blasfeo/20/20/20/32_mean 514 ns 514 ns 5
BM_gemm_blasfeo/20/20/20/32_median 515 ns 515 ns 5
BM_gemm_blasfeo/20/20/20/32_stddev 4.86 ns 4.86 ns 5
BM_gemm_blasfeo/30/30/30/32_mean 10297 ns 10297 ns 5
BM_gemm_blasfeo/30/30/30/32_median 6358 ns 6358 ns 5
BM_gemm_blasfeo/30/30/30/32_stddev 9572 ns 9572 ns 5
BM_gemm_blasfeo/2/2/2/4096_mean 726 ns 726 ns 5
BM_gemm_blasfeo/2/2/2/4096_median 726 ns 726 ns 5
BM_gemm_blasfeo/2/2/2/4096_stddev 0.432 ns 0.431 ns 5
BM_gemm_blasfeo/3/3/3/4096_mean 729 ns 729 ns 5
BM_gemm_blasfeo/3/3/3/4096_median 729 ns 729 ns 5
BM_gemm_blasfeo/3/3/3/4096_stddev 0.096 ns 0.093 ns 5
BM_gemm_blasfeo/5/5/5/4096_mean 2475 ns 2475 ns 5
BM_gemm_blasfeo/5/5/5/4096_median 2475 ns 2475 ns 5
BM_gemm_blasfeo/5/5/5/4096_stddev 0.321 ns 0.321 ns 5
BM_gemm_blasfeo/10/10/10/4096_mean 3774 ns 3774 ns 5
BM_gemm_blasfeo/10/10/10/4096_median 3774 ns 3774 ns 5
BM_gemm_blasfeo/10/10/10/4096_stddev 2.18 ns 2.18 ns 5
BM_gemm_blasfeo/20/20/20/4096_mean 507 ns 507 ns 5
BM_gemm_blasfeo/20/20/20/4096_median 507 ns 507 ns 5
BM_gemm_blasfeo/20/20/20/4096_stddev 0.024 ns 0.025 ns 5
BM_gemm_blasfeo/30/30/30/4096_mean 4270 ns 4270 ns 5
BM_gemm_blasfeo/30/30/30/4096_median 4270 ns 4270 ns 5
BM_gemm_blasfeo/30/30/30/4096_stddev 0.447 ns 0.467 ns 5
You can see that when the alignment is 32 the standard deviation of the time is huge, and with the alignment 4096 it is very small.
However, surprisingly, with large alignment the performace is often many times worse than with the small, compare:
BM_gemm_blasfeo/2/2/2/32_mean 80.2 ns 80.2 ns 5
BM_gemm_blasfeo/2/2/2/4096_mean 726 ns 726 ns 5
...
BM_gemm_blasfeo/5/5/5/32_mean 460 ns 460 ns 5
BM_gemm_blasfeo/5/5/5/4096_mean 2475 ns 2475 ns 5
This is very puzzling.
from blasfeo.
On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation.
No, it is not. It benchmarks only the call to the routine. If you look in the source, only what is inside the for (auto _ : state)
loop is benchmarked. The rest is just preparation of the benchmark (which happens once per batch, but is not timed).
from blasfeo.
OK sorry I didn't get how that smart auto benchmarking loop works. Alright then, memory allocation is not benchmarked, but only the linear algebra routine call.
from blasfeo.
It seems that not the alignment itself is important, but really the address of the data arrays. This is confirmed by the following test, where the alignment is kept at 0x20 bytes, but the arrays are reused between different runs of the benchmark with the same sizes:
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean 124 ns 124 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median 124 ns 124 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev 0.061 ns 0.061 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean 557 ns 557 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median 557 ns 557 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev 0.658 ns 0.661 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean 2506 ns 2505 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median 2506 ns 2506 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev 0.373 ns 0.379 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean 4280 ns 4280 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median 4280 ns 4280 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev 1.19 ns 1.19 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean 511 ns 511 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median 511 ns 511 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev 0.149 ns 0.147 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean 1641 ns 1641 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median 1641 ns 1641 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev 0.540 ns 0.538 ns
You see that keeping location of the data removes variability of the timings.
However, one would expect the same performance regardless of the location of the data, which is not the case.
Interestingly, you can see from the result that the time does not necessarily increase with the size of the problem. This is probably caused by the large variation of the time due to different data location.
from blasfeo.
Although re-running the benchmarks gives very similar results, running them in reverse order gives very different results, probably due to changed placement of memory chunks:
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean 9420 ns 9420 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median 9427 ns 9427 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev 36.4 ns 36.5 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean 508 ns 508 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median 507 ns 507 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev 0.355 ns 0.354 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean 4781 ns 4781 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median 4781 ns 4780 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev 2.00 ns 1.97 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean 2297 ns 2297 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median 2297 ns 2297 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev 0.048 ns 0.043 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean 341 ns 341 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median 341 ns 341 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev 0.026 ns 0.025 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean 423 ns 423 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median 423 ns 423 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev 0.037 ns 0.037 ns 5
from blasfeo.
The current version of the benchmark sources gemm-benchmark.zip
from blasfeo.
The same behavior is observed on a machine with AMD Bulldozer and BLASFEO built with TARGET=X64_AMD_BULLDOZER
from blasfeo.
For syscop members, the benchmark code is accessible in the tmpc repo:
git clone [email protected]:mikhail.katliar/tmpc.git -b 44-mpipm
cd tmpc/bench/blasfeo
make
./bench
You need to install Google Benchmark first, see https://github.com/google/benchmark#installation
from blasfeo.
Explanation of what is the difference between BM_gemm_blasfeo
and BM_gemm_blasfeo_reuse_memory
.
The benchmark is run as follows. For every combination of parameters (m, n, k, alignment) a loop where the function is called (let's call it a batch) is run several times. Then the mean, median, and standard deviation across batches is computed.
-
BM_gemm_blasfeo:
Before running every batch, new memory chunks are allocated. -
BM_gemm_blasfeo_reuse_memory:
All batches corresponding to the same set of parameters (m, n, k, alignment) use the same memory chunks.
from blasfeo.
This makes clear that the performance penalty comes from the choice of the specific memory address.
I would like to clarify that you chosen alignment, 32-bytes, is the minimum to have correct behavior (i.e. avoid segmentation faults), but it is not the recommended one in the BLASFEO API. As you can see e.g. here https://github.com/giaf/blasfeo/blob/master/auxiliary/blasfeo_stdlib.c#L36 if you use the BLASFEO functionality to allocate aligned memory, the address is aligned to typical cache size boundaries, i.e. 64 bytes.
Having a memory address aligned to 32 bytes but not to 64 bytes is expected to slightly increase the computational cost, and the matrix actually occupies more cache lines in memory, and therefore more memory has to be moved through the cache hierarchy.
But the effect seen in some of your tests is much larger than the expected one. A possible explanation is that sometimes you also hit memory page boundaries, or something like that.
from blasfeo.
The same behavior is observed with the alignment of 0x40 bytes.
from blasfeo.
"Sometimes you also hit memory page boundaries" would also occur in the BM_gemm_cblas, but we don't see such effect.
from blasfeo.
would also occur in the BM_gemm_cblas, but we don't see such effect.
That depends on the implementation. E.g OpenBLAS typically packs all data in internally allocated buffers (likely properly aligned), so unaligned data is accessed only once.
Conversely, BLASFEO (and especially the BLASFEO API) operates natively on some or all of the matrix arguments. So unaligned data is accessed more times.
from blasfeo.
But do you agree that the described behavior of BLASFEO should be fixed?
from blasfeo.
You can try out MKL instead of OpenBLAS, for the variants 'NN' and 'NT' in the fortran BLAS API (so check out which the corresponding variants are through the CBLAS API).
Depending on the outcome, we can speak about "fixes".
from blasfeo.
Well, once we know about this issue, one cannot anymore say anything meaningful about BLASFEO performance. Instead of saying "this function executes in 30ns" one should say "this function executes in between 30 and 300 or maybe 500ns, depending on where you allocate memory, and we don't know how it depends".
from blasfeo.
Before undergoing the complete reimplementation of BLASFEO (since choices about internal memory formats are at the very foundations), I will investigate personally and with independent benchmarks the issue.
If there is some "bug" to be fixes (i.e. something specific and non-fundamental which triggers the behavior) it will be fixed, but if this only comes form the implementation choices, I guess a note about providing a minimum alignment for maximum performance will make it.
In some benchmarks I did in the past, I encountered some performance degradation in MKL if memory is not aligned, so if MKL is good with that, I'm good too ;)
Thanks for the deep investigation of the issue and I'll keep you updates on the next steps!
from blasfeo.
Results with MKL BLAS (Fortran API, although I don't see why Fortran API vs CBlas API would make difference):
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_gemm_blas/2/2/2_mean 155 ns 155 ns 5
BM_gemm_blas/2/2/2_median 155 ns 155 ns 5
BM_gemm_blas/2/2/2_stddev 0.037 ns 0.036 ns 5
BM_gemm_blas/3/3/3_mean 162 ns 162 ns 5
BM_gemm_blas/3/3/3_median 162 ns 162 ns 5
BM_gemm_blas/3/3/3_stddev 0.023 ns 0.020 ns 5
BM_gemm_blas/5/5/5_mean 186 ns 186 ns 5
BM_gemm_blas/5/5/5_median 186 ns 186 ns 5
BM_gemm_blas/5/5/5_stddev 0.018 ns 0.019 ns 5
BM_gemm_blas/10/10/10_mean 265 ns 265 ns 5
BM_gemm_blas/10/10/10_median 265 ns 265 ns 5
BM_gemm_blas/10/10/10_stddev 0.018 ns 0.018 ns 5
BM_gemm_blas/20/20/20_mean 727 ns 727 ns 5
BM_gemm_blas/20/20/20_median 729 ns 729 ns 5
BM_gemm_blas/20/20/20_stddev 2.33 ns 2.33 ns 5
BM_gemm_blas/30/30/30_mean 1738 ns 1738 ns 5
BM_gemm_blas/30/30/30_median 1738 ns 1738 ns 5
BM_gemm_blas/30/30/30_stddev 1.58 ns 1.59 ns 5
from blasfeo.
As investigated here mkatliar/blasfeo_benchmark#4 (comment), the issue is due to the presence of denormals in the memory passed to BLASFEO in order to create the matrices. In case of matrices of size not multiple of the minimum kernel size (usually 4x4), the matrix multiple of 4x4 is actually used for computation, and the extra computed elements discarded when writing the result form registers into memory. Therefore, if there are denormals on the "dark" part of the memory of the matrix multiple of 4x4, they do not affect the correctness of the result, but they may severely slow down computations like in this case.
To answer to why there are denormals at all in the memory (and consistently across different test machines and OSs), this may be due to what the google benchmark itself does.
from blasfeo.
It is BLASFEO's responsibility that the matrix structure after blasfeo_create_dmat()
is properly functional. The client's responsibility is to allocate the memory and properly set elements of the matrices.
I don't consider this issue as "Closed". To fix the issue, blasfeo_create_dmat()
needs to be changed such that the padding elements are initialized to 0.
from blasfeo.
We will write a "best practice" in the BLASFEO repo, but what you propose is not the solution.
If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat
, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines. We use this trick in some of our projects (as I know what to do, since I wrote it). Setting to zero the "padding" memory would break this.
from blasfeo.
On top of that, blasfeo_create_dmat
is very light weight, and does not touch the data at all also for performance reasons. E.g. in acados, the workspace is allocated once, but BLASFEO matrices are repeatedly created on that memory. Memory allocations happens "off-line" and therefore it is cheap to zero it out there, while the BLASFEO structs creation also happens "on-line" and therefore it should be cheap by design.
from blasfeo.
If you know what you do, you can put already the data in the memory and the pass it to the
blasfeo_create_dmat
, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines.
"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat
struct and the related functions.
The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.
from blasfeo.
"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the
blasfeo_dmat
struct and the related functions.
Sure it is undocumented exactly because it is unsafe. But nonetheless since we are the developers of BLASFEO, we can use knowledge about its internal in our own software.
from blasfeo.
The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.
Just to be clear, the results computed by the routine are CORRECT. This does not create potential problems for potentially many users in many places
.
Computational speed can be lower if some good practice rules are not followed. We were aware about slight performance drops in case of not aligned memory, but they are in the order of some percent, and therefore not of interest for the normal user.
We were not aware of the large performance penalty (in the order of 10x or more) that can arise from having denormals in the padding memory. Now that we are aware of this, we will add a note to make sure that other users can avoid this issue.
We thank you for your performance issue report and for the effort and help in finding the reason for that. The lifetime you invested there will not be wasted.
from blasfeo.
Related Issues (20)
- Error on multiple definition of `BLASFEO_PROCESSOR_FEATURES'
- Need to link against math library? HOT 1
- Linker error: SHF_MERGE section size (456) must be a multiple of sh_entsize (32) HOT 4
- Tests fail to build: libblasfeo.so: undefined reference to kernel_dpack_buffer_fn HOT 1
- Are there routines for matrix norms? HOT 2
- Incorrect documentation for dtrmm in blasfeo_d_blasfeo_api.h? HOT 3
- blasfeo_dtrmm_rltn not implemented HOT 1
- Missing symbols kernel_dpack_buffer_* in the shared library HOT 2
- Tests fail: error: undefined symbol: blasfeo_sgemm HOT 1
- When can we use parameter as both input and output? HOT 1
- What are m, n, k in dgemm routines? HOT 1
- blasfeo_target.h:1:0: error: unterminated #ifndef HOT 2
- MacOS M2 compiling issue HOT 2
- Calling certain triangular matrices routines leads to `undefined symbol` error HOT 2
- `blasfeo_dtrmm_rlnn` accesses invalid memory with offset on the lower triangular matrix HOT 1
- How can I build BLASFEO and HPIPM for microcontrollers? HOT 1
- what is panel format? HOT 2
- Matrix Inversions HOT 6
- blasfeo_strmm_rlnn fails on HASWELL
- sgemm_nn fails for k > 8 on HASWELL HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blasfeo.