Coder Social home page Coder Social logo

Comments (18)

henryrov avatar henryrov commented on August 29, 2024 3

Now NuttX CoreMark is really close to Debian CoreMark!

I tested the same fixes on Ox64, and now it's also really close to Buildroot (1104 with -O2 vs 1141).
I'll go ahead and close this issue now. Thanks everyone for your help!

from nuttx.

pkarashchenko avatar pkarashchenko commented on August 29, 2024 2

This description sounds to me like a 10 year old flashback of one of the projects that I was working on. At that time it was with an AM335x based device. I would omit most of the parts of that wonderful investigation, but that time we figured out that our embedded system didn't enable caching while configuring MMU regions, so I would suggest to take a look at that as the fist point of investigation.

from nuttx.

patacongo avatar patacongo commented on August 29, 2024 1

I wanted to test this, but booting without the MMU isn't currently supported on the BL808.

And if it is like the ARM MMU which I am more familiar with, disabling the MMU also requires disabling the caches as well since the MMU controls the cachable properties of each mapped region.

from nuttx.

lupyuen avatar lupyuen commented on August 29, 2024 1

@henryrov It's possible that we're flushing the MMU Cache too often: "MMU Cache for T-Head C906". Sorry the docs for BL808 SoC and T-Head C906 CPU are lacking, I have trouble guessing the correct MMU Cache settings, we might need to tweak them.

BL808 SoC is not officially supported by Linux / Debian Mainline, so it might be hard to figure out how Linux handles the MMU. Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.

(BTW: I'm not sure about Bouffalo Lab's future plans for BL808? It seems to have disappeared from their website)

UPDATE: We have enabled Strong Ordering in the MMU, which might cause performance issues. We might need to tweak it: T-Head C906 Strong Ordering

from nuttx.

henryrov avatar henryrov commented on August 29, 2024 1

A few findings:

It's possible that we're flushing the MMU Cache too often

I tested this by removing the call to mmu_flush_cache, but this didn't seem to affect coremark or the for loop at all.

We have enabled Strong Ordering in the MMU, which might cause performance issues.

I timed the for loop with different combinations of the shareable and strong order flags, but again this didn't seem to make a difference.

Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.

In that case, maybe we could learn something from testing the SG2000? Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux.

from nuttx.

henryrov avatar henryrov commented on August 29, 2024 1

@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?

That's great! You can enable coremark on NuttX through menuconfig under Application Configuration -> Benchmark Applications. I don't know much about Linux on the SG2000, but what I ended up doing for the BL808 was cross compiling coremark with the buildroot toolchain externally and moving the compiled binary to my SD card. Maybe if there's enough hardware support it might be easier to get the source code and then compile directly on the board?

from nuttx.

henryrov avatar henryrov commented on August 29, 2024 1

We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!

Nice! I tested a similar change on the BL808 and it did also fix NOP loop difference. It also increased coremark slightly, from 18 to 19.

Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.

No problem, thank you for your help!

from nuttx.

lupyuen avatar lupyuen commented on August 29, 2024 1

Thanks @henryrov for testing on Ox64! I configured the MMU to cache User Text and Data (for NuttX Apps):

Now NuttX CoreMark is really close to Debian CoreMark!

I'll upstream the fixes. Thanks again :-)

FYI: SG2000 NuttX CoreMark is 1,758 with default settings -Os and -g. So -O2 really makes a difference! How I compiled CoreMark for -O2:

rm ../apps/benchmarks/coremark/*.o
## Edit arch/risc-v/src/common/Toolchain.defs
## Change `ARCHOPTIMIZATION += -Os` to `ARCHOPTIMIZATION += -O2`
## Change `ARCHOPTIMIZATION += -g`  to `ARCHOPTIMIZATION +=`
## Note: NuttX Kernel won't boot with `-O2` (why?)

from nuttx.

pkarashchenko avatar pkarashchenko commented on August 29, 2024 1

@lupyuen could you please fill an issue related to -O2 compilation for kernel that you mentioned above?

from nuttx.

pkarashchenko avatar pkarashchenko commented on August 29, 2024 1

I'm not sure what is the compile optimization flags for Debian used. -O3 or -Ofast may still give some excursion speed sacrificing the space.

from nuttx.

acassis avatar acassis commented on August 29, 2024

Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.

"Houston we have a problem!!!!"

@lupyuen did you noticed it before?

@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?

from nuttx.

acassis avatar acassis commented on August 29, 2024

Hi @pkarashchenko makes sense! I remember when disabling cache support on Linux kernel the boot process was really slow.

Normally we don't pay too much attention on it on NuttX because it always boot in milliseconds even when cache is disabled. So this kind of benchmark, comparison is very important. I think we need to have HW CI that run benchmark to catch regressions.

from nuttx.

patacongo avatar patacongo commented on August 29, 2024

Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.

"Houston we have a problem!!!!"

@lupyuen did you noticed it before?

@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?

Possibly related to Issue #3355

A lot has changed since #3355 but re-assessing the system call utilization would also be a good starting point.

from nuttx.

patacongo avatar patacongo commented on August 29, 2024

The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.

Linux defaults to SCHED_OTHER which is tuned for data throughput and Linux has some of the best throughput times available. It minimizes context switching and "ages" threads to assure that each gets a shot at the CPU (after a delay). So everything makes good progress with minimum context switching overhead.

SCHED_OTHER will not support real-time behavior.

Realtime RTOSs, on the other hand do not typically support SCHED_OTHER. Several other schedulers are available for real time behavior. SCHED_FIFO is the only one specified by POSIX and can be used, for example, to support Minimum Latency Scheduling. That behavior depends on the strict priority scheduling of SCHED_FIFO. SCHED_FIFO is super responsive to the point of being "goosey". I can easier lose throughput due to many context switch "storms". Better response at the expense of reduced overall throughput and higher rates of context switches.

Low priority threads can also be blocked indefinitely.

Issue #3355 is a more likely cause a performance issue.

from nuttx.

pkarashchenko avatar pkarashchenko commented on August 29, 2024

@henryrov what is the NuttX score with MMU disabled?

from nuttx.

henryrov avatar henryrov commented on August 29, 2024

The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.

I agree that this could impact benchmark results somewhat, but I don't think it fully explains the difference here, especially since the difference I saw in the for loop tests was before the scheduler was initialized (assuming the issues are related), and running the loop again after starting the scheduler performs the same as it does immediately after mmu_enable.

@henryrov what is the NuttX score with MMU disabled?

I wanted to test this, but booting without the MMU isn't currently supported on the BL808.

from nuttx.

lupyuen avatar lupyuen commented on August 29, 2024

Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux

@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?

from nuttx.

lupyuen avatar lupyuen commented on August 29, 2024

@henryrov Here are the CoreMark Results for SG2000 (Milk-V Duo S), NuttX vs Debian. Yep the results look similar to Ox64 BL808 NuttX, since SG2000 NuttX is nearly identical to Ox64 NuttX:

SG2000 NuttX CoreMark -Os: 16

SG2000 NuttX CoreMark -O2: 21

  • Only CoreMark was compiled with -O2. Kernel won't boot with -O2 (why?)

SG2000 Debian CoreMark -O2: 2,470

I'll do more analysis of the NOP Loop before and after initing SG2000 MMU. Thanks!

(FYI: I thought it might be due to the OpenSBI System Timer Interrupt triggered too often, but nope it makes no difference when I disabled the interrupt)

UPDATE: We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!

CoreMark is now 17, up slightly from 16 earlier. NuttX Apps are also having the same MMU Delay, I'll check the MMU Flags for NuttX Apps:

Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.

from nuttx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.