Coder Social home page Coder Social logo

Comments (21)

traxanos avatar traxanos commented on June 3, 2024 1

I don't understand why I should switch off the interrupt. It must be enough that the processing in the interrupt is interrupted.

I have now tried this with just the mask and the error seems to be gone. The uptime is already more than 1 day.

    ethernet_arch_lwip_gpio_mask();
    const bool state = KNX_NETIF.isLinked();
    ethernet_arch_lwip_gpio_unmask();
    return state;

I had thought twice that it had hung up again. But a reconnect via showed me the running console with the corresponding uptime.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024 1

For the sake of completeness: It's running :D Thank you.

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

Please provide a MCVE so this could be reproduced elsewhere. IRQ mode is stable AFAIK and I've personally had the AdvancedWebServer running for ~24hrs with 3 different browsers refreshing every couple seconds w/o incident, and others have used it as well in their own testing (i.e. the ESP32 WIFI driver port recently added support for it).

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

Also, what exact Ethernet device is being used? One way a core would be stuck in the IRQ handler would be that the IRQ line never gets deasserted by the Ethernet adapter. If it's a shared line or noise, things could get stuck w/the IRQ asserted the the CPU continually calling IRQ handler.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

i was expecting this answer. i don't know how to do it. i have a huge framework here. it seems to be nothing simple. i have already switched off all my own DMAs & interrupts.
but what i also noticed is that i keep losing the link sporadically.

the hardware is a w5500 -> https://github.com/OpenKNX/OpenKNX/wiki/REG1-Eth

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

The problem also occurs with other hardware such as https://github.com/OpenKNX/OpenKNX/wiki/REG1-Base-IP

In addition, we monitor the loop runtime. Our loops do not run longer than 6ms. Nevertheless, I keep getting messages about runtimes of >100ms.

0d 00:19:16: Common: Warning: The loop took longer than usual (146 >= 100)

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

I thought the core was hung in an IRQ loop, so how would the loop timer ever output anything?

... It looks as if it is only running in interrupt mode.

Again, w/o any code or way of reproducing there's not much we can do here other than guess.

Are you calling any raw LWIP functions in your code? If so, you need to protect those calls with the proper mutex or you could end up w/re-entrancy which will really mess up LWIP. The included libraries all have their calls protected (AFAIK!) so if you're just using WiFiClient or WiFiUDP then this probably isn't an issue.

Is there any way of seeing where your excess time is being spent in loop? I'm scratching my head here because if there was a deadlock somewhere in the LWIP code it would never advance. If there was a timeout it would be on the order of 5,000 milliseconds, not 10s-100s of ms.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

I thought the core was hung in an IRQ loop, so how would the loop timer ever output anything?

the loop warning appears regardless of the error. but only if the interrupt mode is active. therefore i suspect a connection.

Again, w/o any code or way of reproducing there's not much we can do here other than guess.

if i could do that, i would find the error myself and could write you what needs to be fixed :)

Are you calling any raw LWIP functions in your code?

i check whether the link is connected and if the connection is lost, i call the dhcp call. unfortunately, none of this is done by the system itself.

the network handling is implemented in this module:
https://github.com/OpenKNX/OFM-Network/blob/v1/src/NetworkModule.cpp

This is summarized here

I check every 500ms if link is active with

KNX_NETIF.isLinked();

when state chenge to true, i start dhcp to renew dhcp address

        netif_set_link_up(KNX_NETIF.getNetIf());
        if (_useStaticIP)
            netif_set_ipaddr(KNX_NETIF.getNetIf(), _staticLocalIP);
        else
            dhcp_network_changed_link_up(KNX_NETIF.getNetIf());

when state change to false, i remove current address

        netif_set_ipaddr(KNX_NETIF.getNetIf(), 0);
        netif_set_link_down(KNX_NETIF.getNetIf());

in principle, i would prefer the stack to handle these basic functions. but so far we have always had to do it ourselves.

Is there any way of seeing where your excess time is being spent in loop?

Our calls in the loop are limited to max. 6ms. Then there is a return to main. in my opinion, the long time can only be caused by the interrupt.

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

Those calls are not protected and you may end up re-entering LWIP which gives undefined behavior. Can't say it's causing your problem, but it's not safe in general.

It's simple to add the infra to protect them. Look at cores/lwip_wrap.cpp and the lib/platform_wrap.txt file. Using those templates it's pretty simple to add in the calls you're doing.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

you've already helped me with that :D

that means, if i leave the handling temporarily out, the error should be gone. then if that were so, you could think about getting the interrupt safe.

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

Our calls in the loop are limited to max. 6ms. Then there is a return to main. in my opinion, the long time can only be caused by the interrupt.

Possibly, but the thing is the IRQ mode calls the exact same handler (LWIPIntfDev<template>::handlePackets) as the async_context one does. It just doesn't poll every 20ms, instead it reads as soon as the HW says there's a packet:

template<class RawDev>
void LwipIntfDev<RawDev>::_irq(void *param) {
LwipIntfDev *d = static_cast<LwipIntfDev*>(param);
ethernet_arch_lwip_begin();
d->handlePackets();
sys_check_timeouts();
ethernet_arch_lwip_end();
}

static void ethernet_timeout_reached(__unused async_context_t *context, __unused async_at_time_worker_t *worker) {
assert(worker == &ethernet_timeout_worker);
__ethernet_timeout_reached_calls++;
ethernet_arch_lwip_gpio_mask(); // Ensure non-polled devices won't interrupt us
for (auto handlePacket : _handlePacketList) {
handlePacket.second();
}
#if defined(ARDUINO_RASPBERRY_PI_PICO_W)
if (!rp2040.isPicoW()) {
sys_check_timeouts();
}
#else
sys_check_timeouts();
#endif
ethernet_arch_lwip_gpio_unmask();
}

The packet handlers will try and read up to 10 packets for all HW:

template<class RawDev>
err_t LwipIntfDev<RawDev>::handlePackets() {
int pkt = 0;
while (1) {
if (++pkt == 10)
// prevent starvation
{
return ERR_OK;
}

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

you've already helped me with that :D

Actually, the handling may need to change since now you need to disable the GPIO interrupts as well as grab a mutex. See about adding a ethernet_arch_lwip_gpio_mask and ethernet_arch_lwip_gpio_unmask before taking the mutex and after releasing it. Again, I don't imagine that code gets called much so it may not be related to your issue here, but better safe than sorry...

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

One other thing, thinking about it, are you getting a packet storm? It's 2024 so I hope you're not on a hub, but with the IRQ mode it may be possible if you send 100s of packets at high speed that the 10-packet-per-IRQ call would end up being called over and over.

In async_context mode, you will get at max 10 packets every polling period (20ms by default). If your HW gets 20 packets in 20ms, the HW will throw away half of them.

In IRQ mode, if after pulling 10 packets out of the HW the IRQ gets re-asserted, the IRQ will be called again almost immediately. You'd get all 20 packets (assuming LWIP buffers available) but spend 2x the processing time doing so, of course.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

I was planning to do it this way.

    ethernet_arch_lwip_begin();
    return KNX_NETIF.isLinked();
    ethernet_arch_lwip_end();

should i do ethernet_arch_lwip_begin + ethernet_arch_lwip_gpio_mask or instead?

I hope you're not on a hub

no it was modern network equipment :D

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

That's not going to call the gpio masking, so you'll want to add in the calls manually. (Also you'll not want to return before unlocking it. 😆 ) But, again, it seems like those calls will be very infrequently made.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

Unfortunately, removing the calls didn't help. at least the device hung up after about 6-7 hours. but it never lasted that long.

what i also noticed during the test our warning:

0d 01:11:06: Common:                  Warning: The loop took longer than usual (133 >= 100)
0d 01:21:06: Common:                  Warning: The loop took longer than usual (132 >= 100)
0d 01:31:06: Common:                  Warning: The loop took longer than usual (140 >= 100)

is displayed exactly every 10 minutes. and by that I mean exactly 10m

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

There's no built-in 600 second timers AFAIK, so can't really help you there from the core side.

Off the top of my head, what is your DHCP lease lifetime? Could you be requesting and receiving a new (same) lease at 10m intervals?

I guess one thing I would have to say is that the polled networking might be okay for a soft-realtime system, but the IRQ one would not. As mentioned before, the IRQ will try and process all packets, meaning that if you get a packet storm the CPU usage is unbounded as every packet will at least be attempted to be read. For the polled/async_context version you're guaranteed no more than 10 packets per polling period, so there is an soft upper bound for the time spent doing it. Any more would be thrown away by the HW.

You could look at checking the packets being processed every loop. Look at the LwipIntfDev<>::packets{received,sent}() method.

You could also do some instrumentation on the IRQ side, tracking the delta rp2040.getCycleCount64() from IRQ start to finish.

from arduino-pico.

traxanos avatar traxanos commented on June 3, 2024

The lag doesn't really bother me. Our platform uses dma and interrupts for time criticals. if the loop runs somethimes a few ms, it no longer has any effect.

dhcp does not seem to be the reason. during the lag there are neither dhcp requests on the network nor do the leasetimes match.

currently the problem can be reproduced quite well by calling isLinked.

This one doesn't work.

    ethernet_arch_lwip_begin();
    const bool state = KNX_NETIF.isLinked();
    ethernet_arch_lwip_end();
    return state;

but if I understand correctly, I have to do this so that the interrupt is skipped.

    ethernet_arch_lwip_gpio_mask();
    const bool state = KNX_NETIF.isLinked();
    ethernet_arch_lwip_gpio_unmask();
    return state;

or should do both?

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

You'll want to disable the GPIO interrupt then take the lwip mutex (ethernet_arch_lwip_gpio_mask(); ethernet_arch_lwip_begin()) and the reverse order when done. There may be a better way of centralizing these steps (i.e. move it silently into the isLinked method) if this does turn our to be the issue.

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

In this case, it's not LWIP but SPI which you're protecting against re-entrancy.

bool isLinked() {
ethernet_arch_lwip_begin();
auto ret = wizphy_getphylink() == PHY_LINK_ON;
ethernet_arch_lwip_end();
return ret;
}

That call in the function does SPI operations. If you get a packet-avail IRQ on the GPIO while that SPI is running you're going to start another SPI operation in the middle of an ongoing one. At best the internal SPI object state will be destroyed. At worst, it'll completely confuse the W5500 chip and you'd need a reset/power cycle to clear it up.

It's a bug in the w5500 driver, I would say. Adding in the mask call before the lwip_mutex_grab call and after the lwip mutex-release call should be done. I need to verify the other devices, too, now that a generic failure mode was found.

from arduino-pico.

earlephilhower avatar earlephilhower commented on June 3, 2024

Also in the ENC28J driver, but not in the W5100 (because there is no link register to read so it always returns true).

from arduino-pico.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.