Coder Social home page Coder Social logo

Comments (17)

syoyo avatar syoyo commented on June 10, 2024

I think PR #14 would solve this issue.

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

Hi,

I've briefly tested the current master state 36ad817
with my replication of the buildbench tutorial on my Android aarch64 smartphone (OnePlus 3T).

Unfortunately, the situation has not improved for this particular performance defect:
LOW_QUALITY: 0.74 seconds, 1.34 Mprims/s, 266 SAH build quality
MEDIUM_QUALITY: 0.55 seconds, 1.81 Mprims/s, 249 SAH build quality
HIGH_QUALITY: 1.86 seconds, 0.54 Mprims/s, 249 SAH build quality

The low-quality builder is still too slow. Please note that BUILD_IOS was not enabled. I will try repeat this test either on iOS or by enabling this flag on Android whereever applicable.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

Can you please test with neon-fix branch? It solves some NEON issue #17 by backporting BUILD_IOS code path(some NEON fix/improvement by @pchang0414 )

I will also try to run buildbench on our Jetson AGX Xavier. @maikschulze Which scene data did you use for benchmarking?

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

I made a mistake in my comment, I meant to refer to my replication of the bvh_builder tutorial, which synthesizes geometry. Sorry for the confusion.

I will have a look at the other branch. It seems you have already done the work, I intended to do. Thanks :)

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

I've tested the neon-fix branch 7863a1c
with the internal tasking system and obtain:

LOW_QUALITY: 0.82 seconds, 1.22 Mprims/s, 266 SAH build quality
MEDIUM_QUALITY: 0.50 seconds, 2.02 Mprims/s, 249 SAH build quality
HIGH_QUALITY: 1.39 seconds, 0.72 Mprims/s, 249 SAH build quality

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

Here is the result of bvh_builder on Jetson AGX Xavier(ARMv8 Processor rev 0 (v8l). 8 cores) using neon-fix branch.

gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=Release \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=gcc \
  -DCMAKE_CXX_COMPILER=g++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 467.245ms, 4.92247 Mprims/s, sah = 363.265 [DONE]
iteration 1: building BVH over 2300000 primitives, 189.316ms, 12.149 Mprims/s, sah = 363.265 [DONE]
iteration 2: building BVH over 2300000 primitives, 199.916ms, 11.5048 Mprims/s, sah = 363.265 [DONE]
iteration 3: building BVH over 2300000 primitives, 174.691ms, 13.1661 Mprims/s, sah = 363.265 [DONE]
iteration 4: building BVH over 2300000 primitives, 178.087ms, 12.915 Mprims/s, sah = 363.265 [DONE]
iteration 5: building BVH over 2300000 primitives, 196.053ms, 11.7315 Mprims/s, sah = 363.265 [DONE]
iteration 6: building BVH over 2300000 primitives, 216.538ms, 10.6217 Mprims/s, sah = 363.265 [DONE]
iteration 7: building BVH over 2300000 primitives, 163.449ms, 14.0717 Mprims/s, sah = 363.265 [DONE]
iteration 8: building BVH over 2300000 primitives, 203.061ms, 11.3267 Mprims/s, sah = 363.265 [DONE]
iteration 9: building BVH over 2300000 primitives, 199.604ms, 11.5228 Mprims/s, sah = 363.265 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 793.211ms, 2.89961 Mprims/s, sah = 340.853 [DONE]
iteration 1: building BVH over 2300000 primitives, 436.282ms, 5.27182 Mprims/s, sah = 340.853 [DONE]
iteration 2: building BVH over 2300000 primitives, 440.721ms, 5.21872 Mprims/s, sah = 340.853 [DONE]
iteration 3: building BVH over 2300000 primitives, 430.462ms, 5.3431 Mprims/s, sah = 340.853 [DONE]
iteration 4: building BVH over 2300000 primitives, 447.181ms, 5.14333 Mprims/s, sah = 340.853 [DONE]
iteration 5: building BVH over 2300000 primitives, 429.659ms, 5.35308 Mprims/s, sah = 340.853 [DONE]
iteration 6: building BVH over 2300000 primitives, 368.533ms, 6.24096 Mprims/s, sah = 340.853 [DONE]
iteration 7: building BVH over 2300000 primitives, 380.974ms, 6.03716 Mprims/s, sah = 340.853 [DONE]
iteration 8: building BVH over 2300000 primitives, 387.172ms, 5.94051 Mprims/s, sah = 340.853 [DONE]
iteration 9: building BVH over 2300000 primitives, 412.654ms, 5.57368 Mprims/s, sah = 340.853 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1400.94ms, 1.64176 Mprims/s, sah = 339.742 [DONE]
iteration 1: building BVH over 2300000 primitives, 1122.35ms, 2.04927 Mprims/s, sah = 339.742 [DONE]
iteration 2: building BVH over 2300000 primitives, 999.972ms, 2.30006 Mprims/s, sah = 339.742 [DONE]
iteration 3: building BVH over 2300000 primitives, 862.534ms, 2.66656 Mprims/s, sah = 339.742 [DONE]
iteration 4: building BVH over 2300000 primitives, 811.949ms, 2.83269 Mprims/s, sah = 339.742 [DONE]
iteration 5: building BVH over 2300000 primitives, 834.805ms, 2.75513 Mprims/s, sah = 339.742 [DONE]
iteration 6: building BVH over 2300000 primitives, 821.769ms, 2.79884 Mprims/s, sah = 339.742 [DONE]
iteration 7: building BVH over 2300000 primitives, 796.84ms, 2.8864 Mprims/s, sah = 339.742 [DONE]
iteration 8: building BVH over 2300000 primitives, 754.551ms, 3.04817 Mprims/s, sah = 339.742 [DONE]
iteration 9: building BVH over 2300000 primitives, 736.78ms, 3.12169 Mprims/s, sah = 339.742 [DONE]

clang 9.0.0

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=Release \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 330.425ms, 6.96073 Mprims/s, sah = 363.265 [DONE]
iteration 1: building BVH over 2300000 primitives, 217.422ms, 10.5785 Mprims/s, sah = 363.265 [DONE]
iteration 2: building BVH over 2300000 primitives, 145.361ms, 15.8227 Mprims/s, sah = 363.265 [DONE]
iteration 3: building BVH over 2300000 primitives, 264.551ms, 8.69398 Mprims/s, sah = 363.265 [DONE]
iteration 4: building BVH over 2300000 primitives, 214.144ms, 10.7404 Mprims/s, sah = 363.265 [DONE]
iteration 5: building BVH over 2300000 primitives, 215.484ms, 10.6737 Mprims/s, sah = 363.265 [DONE]
iteration 6: building BVH over 2300000 primitives, 207.744ms, 11.0713 Mprims/s, sah = 363.265 [DONE]
iteration 7: building BVH over 2300000 primitives, 217.249ms, 10.5869 Mprims/s, sah = 363.265 [DONE]
iteration 8: building BVH over 2300000 primitives, 205.477ms, 11.1935 Mprims/s, sah = 363.265 [DONE]
iteration 9: building BVH over 2300000 primitives, 205.345ms, 11.2007 Mprims/s, sah = 363.265 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 771.947ms, 2.97948 Mprims/s, sah = 340.853 [DONE]
iteration 1: building BVH over 2300000 primitives, 482.536ms, 4.76648 Mprims/s, sah = 340.853 [DONE]
iteration 2: building BVH over 2300000 primitives, 388.218ms, 5.9245 Mprims/s, sah = 340.853 [DONE]
iteration 3: building BVH over 2300000 primitives, 387.673ms, 5.93283 Mprims/s, sah = 340.853 [DONE]
iteration 4: building BVH over 2300000 primitives, 376.233ms, 6.11323 Mprims/s, sah = 340.853 [DONE]
iteration 5: building BVH over 2300000 primitives, 373.749ms, 6.15386 Mprims/s, sah = 340.853 [DONE]
iteration 6: building BVH over 2300000 primitives, 371.707ms, 6.18767 Mprims/s, sah = 340.853 [DONE]
iteration 7: building BVH over 2300000 primitives, 372.158ms, 6.18017 Mprims/s, sah = 340.853 [DONE]
iteration 8: building BVH over 2300000 primitives, 351.09ms, 6.55103 Mprims/s, sah = 340.853 [DONE]
iteration 9: building BVH over 2300000 primitives, 359.239ms, 6.40242 Mprims/s, sah = 340.853 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1334.42ms, 1.7236 Mprims/s, sah = 339.742 [DONE]
iteration 1: building BVH over 2300000 primitives, 789.567ms, 2.91299 Mprims/s, sah = 339.742 [DONE]
iteration 2: building BVH over 2300000 primitives, 758.453ms, 3.03249 Mprims/s, sah = 339.742 [DONE]
iteration 3: building BVH over 2300000 primitives, 776.326ms, 2.96267 Mprims/s, sah = 339.742 [DONE]
iteration 4: building BVH over 2300000 primitives, 813.967ms, 2.82567 Mprims/s, sah = 339.742 [DONE]
iteration 5: building BVH over 2300000 primitives, 807.859ms, 2.84703 Mprims/s, sah = 339.742 [DONE]
iteration 6: building BVH over 2300000 primitives, 713.595ms, 3.22312 Mprims/s, sah = 339.742 [DONE]
iteration 7: building BVH over 2300000 primitives, 714.242ms, 3.2202 Mprims/s, sah = 339.742 [DONE]
iteration 8: building BVH over 2300000 primitives, 753.975ms, 3.0505 Mprims/s, sah = 339.742 [DONE]
iteration 9: building BVH over 2300000 primitives, 665.994ms, 3.45348 Mprims/s, sah = 339.742 [DONE]

At least there is no performance degradation both for gcc and clang on Jetson AGX(ARM A72(?) core).

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

Thank you very much for posting the benchmark results. I will take a closer look at my "version" of the test and check other HW as well. So far, I don't have numbers for my iOS devices.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

And here is the result from Pixel4 + Termux. I have created another branch non-glfw https://github.com/lighttransport/embree-aarch64/tree/non-glfw, which builds bvh_builder without glfw dependency.

clang 9.0.1

iteration 0: building BVH over 2300000 primitives, 372.282ms, 6.17811 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 248.929ms, 9.23958 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 267.801ms, 8.58847 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 262.416ms, 8.76471 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 265.919ms, 8.64924 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 261.571ms, 8.79303 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 272.91ms, 8.42769 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 260.978ms, 8.813 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 267.578ms, 8.59562 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 264.649ms, 8.69075 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 711.938ms, 3.23062 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 595.141ms, 3.86463 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 663.287ms, 3.46758 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 639.961ms, 3.59397 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 608.185ms, 3.78175 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 588.025ms, 3.9114 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 867.439ms, 2.65148 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 665.757ms, 3.45471 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 751.099ms, 3.06218 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 698.145ms, 3.29444 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1282.62ms, 1.7932 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 1331.65ms, 1.72718 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 1307.26ms, 1.7594 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 1374.44ms, 1.67341 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 1512.5ms, 1.52066 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 1470.11ms, 1.56451 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 1468.75ms, 1.56596 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 1307.38ms, 1.75924 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 1333.34ms, 1.725 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 1321.41ms, 1.74057 Mprims/s, sah = 339.806 [DONE]

So, the performance issue may come from Android NDK build configuration.

NDK toolchain adds some extra compiler flags:

https://android.googlesource.com/platform/ndk/+/master/build/cmake/android.toolchain.cmake#449

which may affect the performance.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

non-glfw branch with Android NDK(r21) build on Pixel4
(Put built binary to /data/local/tmp and add +x then execute it)

# Use ANDROID_SDK_HOME environment
ANDROID_NDK_ROOT=$ANDROID_SDK_ROOT/ndk-bundle

# CMake 3.6 or later required.
CMAKE_BIN=cmake

rm -rf build-android
mkdir build-android
cd build-android

$CMAKE_BIN -G Ninja -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_NATIVE_API_LEVEL=24 \
  -DANDROID_ARM_MODE=arm \
  -DANDROID_ARM_NEON=TRUE \
  -DANDROID_STL=c++_static \
  -DEMBREE_ARM=On \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..

cd ..
~
$ LD_LIBRARY_PATH=. ./bvh_builder
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 320.482ms, 7.17669 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 253.758ms, 9.06376 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 271.481ms, 8.47205 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 280.434ms, 8.20158 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 261.063ms, 8.81014 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 279.063ms, 8.24187 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 260.827ms, 8.8181 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 252.432ms, 9.11136 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 262.714ms, 8.75477 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 251.372ms, 9.14978 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 621.555ms, 3.7004 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 648.118ms, 3.54874 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 642.976ms, 3.57712 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 562.866ms, 4.08623 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 632.965ms, 3.63369 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 587.982ms, 3.91168 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 637.619ms, 3.60717 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 701.085ms, 3.28063 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 547.128ms, 4.20377 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 651.321ms, 3.53129 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1366.26ms, 1.68342 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 1221.68ms, 1.88265 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 1293.02ms, 1.77878 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 1244.98ms, 1.84742 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 1229.1ms, 1.87129 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 1267.59ms, 1.81447 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 1304.91ms, 1.76258 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 1373.47ms, 1.67459 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 1497.47ms, 1.53592 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 1228.78ms, 1.87178 Mprims/s, sah = 339.806 [DONE]

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

Same binary used for Pixel4 on ZenFone Max(m2)(Snapdragon 632. A53 core)

Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1473.86ms, 1.56053 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 1187.6ms, 1.93669 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 1193.86ms, 1.92652 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 1168.3ms, 1.96867 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 1300.34ms, 1.76877 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 1525.81ms, 1.50739 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 1484.5ms, 1.54935 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 1217.63ms, 1.88892 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 1680.47ms, 1.36867 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 1207.49ms, 1.90477 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 2875.91ms, 0.799748 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 4150.64ms, 0.554132 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 2784.03ms, 0.82614 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 2826.1ms, 0.813842 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 2791.85ms, 0.823826 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 3473.95ms, 0.66207 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 2739.15ms, 0.839677 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 2735.06ms, 0.840931 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 3475.67ms, 0.661744 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 2811.84ms, 0.81797 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 5607.71ms, 0.410149 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 5948.25ms, 0.386668 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 5399.26ms, 0.425984 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 5374.93ms, 0.427912 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 5705.12ms, 0.403147 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 5565.67ms, 0.413247 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 5938.19ms, 0.387324 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 5233.09ms, 0.439511 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 5612.24ms, 0.409819 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 5466.35ms, 0.420756 Mprims/s, sah = 339.806 [DONE]

So, apparently the performance is linear even on A53 cores.

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

Hi,

after dissecting various commits and compiler settings which did not help at all, I've deployed my test application to my colleague's Android device. The issue disappeared!

An identical aarch64 binary results in the following measurements:

Xiaomi Mi 9T Pro (Snapdragon 855 CPU)

Time iteration (seconds): 0.025513
Time iteration (seconds): 0.015728
Time iteration (seconds): 0.014720
Time iteration (seconds): 0.014457
Time iteration (seconds): 0.016226
Time iteration (seconds): 0.014536
Time iteration (seconds): 0.014560
Time iteration (seconds): 0.014105
Time iteration (seconds): 0.015324
Time iteration (seconds): 0.014603
Low Quality Build Time (seconds): 0.161452
Low Quality Build Rate (Mprims/s): 6.193801

Low Quality Build Quality (SAH): 265.793854
Time iteration (seconds): 0.029731
Time iteration (seconds): 0.030952
Time iteration (seconds): 0.031231
Time iteration (seconds): 0.027482
Time iteration (seconds): 0.034549
Time iteration (seconds): 0.034750
Time iteration (seconds): 0.039156
Time iteration (seconds): 0.032344
Time iteration (seconds): 0.034419
Time iteration (seconds): 0.032151
Medium Quality Build Time (seconds): 0.326960
Medium Quality Build Rate (Mprims/s): 3.058480

Medium Quality Build Quality (SAH): 249.085358
Time iteration (seconds): 0.081488
Time iteration (seconds): 0.070334
Time iteration (seconds): 0.050496
Time iteration (seconds): 0.059259
Time iteration (seconds): 0.050183
Time iteration (seconds): 0.055322
Time iteration (seconds): 0.058191
Time iteration (seconds): 0.052493
Time iteration (seconds): 0.053196
Time iteration (seconds): 0.058377
High Quality Build Time (seconds): 0.589467
High Quality Build Rate (Mprims/s): 1.696447

High Quality Build Quality (SAH): 248.987671

Oneplus 3T (Snapdragon 821 CPU, from my previous tests)

Time iteration (seconds): 0.139856
Time iteration (seconds): 0.106543
Time iteration (seconds): 0.092181
Time iteration (seconds): 0.100685
Time iteration (seconds): 0.071749
Time iteration (seconds): 0.072782
Time iteration (seconds): 0.072107
Time iteration (seconds): 0.072892
Time iteration (seconds): 0.077812
Time iteration (seconds): 0.069211
Low Quality Build Time (seconds): 0.877871
Low Quality Build Rate (Mprims/s): 1.139120

Low Quality Build Quality (SAH): 265.860352
Time iteration (seconds): 0.047721
Time iteration (seconds): 0.059651
Time iteration (seconds): 0.057134
Time iteration (seconds): 0.047721
Time iteration (seconds): 0.049204
Time iteration (seconds): 0.048780
Time iteration (seconds): 0.054862
Time iteration (seconds): 0.053493
Time iteration (seconds): 0.052970
Time iteration (seconds): 0.061554
Medium Quality Build Time (seconds): 0.533335
Medium Quality Build Rate (Mprims/s): 1.874994

Medium Quality Build Quality (SAH): 249.218781
Time iteration (seconds): 0.130922
Time iteration (seconds): 0.126150
Time iteration (seconds): 0.127971
Time iteration (seconds): 0.146135
Time iteration (seconds): 0.154689
Time iteration (seconds): 0.140980
Time iteration (seconds): 0.188261
Time iteration (seconds): 0.157469
Time iteration (seconds): 0.211994
Time iteration (seconds): 0.205527
High Quality Build Time (seconds): 1.590357
High Quality Build Rate (Mprims/s): 0.628790

High Quality Build Quality (SAH): 248.880981

Personally, I'm surprised by this result because the relative floating/integer performance ratios almost match in GeekBench 5. Another factor may be the different threading mechanisms due to OS and hardware. I would argue that given the positive results by you, @syoyo , and from my colleague's device, the priority of this issue has decreased a lot.

I will invest into porting my test app to iOS and measure the mileage on A13 chips.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

@maikschulze Thanks for the performance report.

It looks the issue is related to older generation CPU with big-little architecture(Snapdragon 821). I have Xperia X Performance(Snapdragon 820) so will try to do a benchmark soon.

I will invest into porting my test app to iOS and measure the mileage on A13 chips.

Good! A13 should run faster than Snapdragon 855!

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

Here is the result from Xperia X Performance(Snapdragon 820)

134|SOV33:/data/local/tmp $ LD_LIBRARY_PATH=. ./bvh_builder
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1844.27ms, 1.24711 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 1516.41ms, 1.51674 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 1507.61ms, 1.52559 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 1566.18ms, 1.46854 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 1511.98ms, 1.52118 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 1609.48ms, 1.42903 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 1515.24ms, 1.51791 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 1532.6ms, 1.50072 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 1503.24ms, 1.53003 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 1504.22ms, 1.52903 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1151.75ms, 1.99697 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 1114.04ms, 2.06457 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 1116.13ms, 2.0607 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 1100ms, 2.09092 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 1105.78ms, 2.07998 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 1118.25ms, 2.05679 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 1146.07ms, 2.00686 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 1135.19ms, 2.0261 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 1129.74ms, 2.03587 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 1106.5ms, 2.07862 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 4017.36ms, 0.572516 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 4046.4ms, 0.568407 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 4088.83ms, 0.562509 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 4474.04ms, 0.514077 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 3881.43ms, 0.592565 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 3927.56ms, 0.585605 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 3890.86ms, 0.591128 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 3894.5ms, 0.590577 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 4020.9ms, 0.572012 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 3996.46ms, 0.575509 Mprims/s, sah = 339.806 [DONE]

LOW_QUALITY is slower than MID_QUALITY as observed in OnePlus 3T(Snapdragon 821). So the situation will be processor specific(especially Snapdragon 82x series).

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

Hi,

I've found time to port my benchmarks to other platforms. As a baseline to measure the performance and quality of the code changes from the last months, I've compiled the state a9ab7e6 . As a next step, I will test the newer contributions and report the results of my comparison.

Here are the measured BVH building rates for the state from June 2019:

OnePlus 3T (Snapdragon 821 CPU) @ Android arm64 NEON
Low Quality Build Rate (Mprims/s): 1.170872
Medium Quality Build Rate (Mprims/s): 1.373946
High Quality Build Rate (Mprims/s): 0.514601

Apple iPhone XS (A12 CPU) @ iOS arm64 NEON
Low Quality Build Rate (Mprims/s): 5.299200
Medium Quality Build Rate (Mprims/s): 3.624712
High Quality Build Rate (Mprims/s): 2.121203

Apple iPad Air (A12 CPU) @ iOS arm64 NEON
Low Quality Build Rate (Mprims/s): 5.232715
Medium Quality Build Rate (Mprims/s): 3.664790
High Quality Build Rate (Mprims/s): 2.060475

Apple iPad Pro (A12X CPU) @ iOS arm64 NEON
Low Quality Build Rate (Mprims/s): 6.667206
Medium Quality Build Rate (Mprims/s): 4.490212
High Quality Build Rate (Mprims/s): 2.712091

Google Pixelbook (i5-7Y57 CPU) @ Android x64 SSE2
Low Quality Build Rate (Mprims/s): 3.106385
Medium Quality Build Rate (Mprims/s): 1.424219
High Quality Build Rate (Mprims/s): 0.836109

Apple MacBook Pro (i9-9880H CPU) @ Windows 10 x64 SSE2
Low Quality Build Rate (Mprims/s): 14.665057
Medium Quality Build Rate (Mprims/s): 7.881895
High Quality Build Rate (Mprims/s): 5.461523

Apparently, the performance problem of the low-quality builder remains specific to the older gen chips such as Snapdragon 820 and Snapdragon 821. A12 chips do not show any performance problem either. I would therefore argue to close this issue and consider the originally raised issue a hardware problem, not a problem of the code base.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

@maikschulze Thanks for the benchmark!

I would therefore argue to close this issue and consider the originally raised issue a hardware problem, not a problem of the code base.

So can I close this issue?

from embree-aarch64.

maikschulze avatar maikschulze commented on June 10, 2024

So can I close this issue?

Yes, please.

from embree-aarch64.

syoyo avatar syoyo commented on June 10, 2024

Yes, please.

Thanks. So close the issue since it looks its a HW architectural issue.

from embree-aarch64.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.