Coder Social home page Coder Social logo

Comments (14)

lorenzodonini avatar lorenzodonini commented on August 15, 2024 2

@utsavanand2 The error you're seeing occurs when a charging station connects to the server, while the server already holds an open connection with the same charging station. The error occurs on websocket level btw.

Normally, this shouldn't happen, but we decided to handle this edge case, since duplicate IDs can lead to other issues instead.

If there's network issues and the websocket close message never gets through to the other side, but for some reason the new connect message does, then:

  1. the connection gets accepted by the server
  2. the server kills the connection immediately afterwards, leading to the client logging disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists

I don't remember why it was decided to send a 1008 error instead of denying a websocket upgrade in the first place, but there was probably a reason. I'm looking through documentation and RFCs to get wiser.

from ocpp-go.

lorenzodonini avatar lorenzodonini commented on August 15, 2024 1

The issue described in #141 is completely separate and not linked to this at all.

#133 might indeed be related though. I'll look into this further and hopefully find a root cause soon.

from ocpp-go.

lorenzodonini avatar lorenzodonini commented on August 15, 2024

Hey,

the initialization snippet looks good to me, although 8s seems a bit low.
The timeout you are setting simply makes sure an outgoing request is canceled, if no response is received from the charging station within those 8 seconds. This allows you to overcome the issue where charging stations don't reply quickly enough (e.g. because of network issues), but it also means that message is lost. If the charging station eventually replies, your CSMS will never process that response.

The only case, where automatic retransmission is currently supported, is if the websocket layer recognizes that the other endpoint disconnected. This is typically achieved with the ping-pong mechanism. The ocpp-j layer currently has no other way of knowing, whether the other endpoint is having network issues, which is why the timeout is meant to be a last resort.

It's very hard to just guess the cause of the issue, since you're apparently dealing with intermittent network connections.
Some information that would help:

  1. Are you using websocket ping-pong in the first place? If so, what's the ping-pong timeout?
  2. Are you manually retransmitting messages, when they're canceled due to the timeout?
  3. Can you provide a log dump of both affected endpoints (the part where the issue arises)? If you have debug logs enabled on the CSMS, that should also provide more information about the message flow

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

Hi @lorenzodonini, I switched back to the previous release where we did have some issues with the stations with network issues, but we didn't face a complete downtime for all charging stations. But with the new release of v0.16.0, even the 30 second timeout is giving us this trouble, at this point my belief is the issue is not with the timeout but rather us switching from v0.16.0 from v0.15.0 because we switched to the custom dispatcher the same time we upgraded to the new release.

1.> No we're not using websocket ping-pong, we just rely on the heartbeats to keep the connection alive.
2.> No we do not retransmit the messages.
3.> I do have the logs but unfortunately the logs are not on the debug level, so they are not of much use except the fact that the quantity of logs drastically reduces over a time period of 24 hrs after which it seems like it continues to run but doesn't accept any new requests.

I have built on top of your examples for the charge-point and I tried connecting to the production server when I started seeing this issue, pasting the logs of the charge-point-emulator below: (the server here is using the 30second default)

tsavanand@Utsavs-MacBook-Pro charge-point-simulator % ./charge-point-simulator 
INFO[2022-08-21T16:41:55+05:30] connected to central system at ws://csms.daheimladen.com 
INFO[2022-08-21T16:42:25+05:30] <nil> ocpp message (2596996162): GenericError - request timed out, no response received from server 
INFO[0040] sending status notification                  
INFO[0050] sending status notification                  
ERRO[2022-08-21T16:42:55+05:30] disconnected from serverread tcp 192.168.0.208:53483->35.246.163.54:80: i/o timeout 
INFO[0060] sending status notification                  
ERRO[2022-08-21T16:43:00+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists 
INFO[0070] sending status notification                  
ERRO[2022-08-21T16:43:06+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists 
ERRO[2022-08-21T16:43:11+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists 
INFO[0080] sending status notification                  
ERRO[2022-08-21T16:43:17+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists 
ERRO[2022-08-21T16:43:22+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists 
INFO[0090] sending status notification                  
INFO[0100] sending status notification                  
INFO[0110] sending status notification                  
INFO[0120] sending status notification                  
INFO[0130] sending status notification                  
INFO[2022-08-21T16:44:08+05:30] ocpp message (2854263694): GenericError - request timed out, no response received from server 

I'll switch our csms service back to v0.15.0, I believe this should fix our problems until we can find out and resolve the issue as it was the last working release for us. If I find anything of interest, I would share it with you here.

I also want to know if you have any plans for opening up your repository to Github Sponsors, as I use this project in production, and being an open source contributor in the past myself for some cloud native projects, I know for a fact that maintaining open source projects is hard, companies benefit from it with no intent of giving back (atleast most of the times). So I would love to pitch in personally as a Github Sponsor for this project ❤️
Thanks a lot for your time and efforts.

from ocpp-go.

lorenzodonini avatar lorenzodonini commented on August 15, 2024
INFO[2022-08-21T16:42:25+05:30] <nil> ocpp message (2596996162): GenericError - request timed out, no response received from server 

if I understand this right, the charge point sent a request to the CSMS, which never replied, triggering the timeout? Can you verify, whether the CSMS ever received the request?

Btw to enable debug logs you can do this:

ws.SetLogger(logger)
ocppj.SetLogger(logger)
// the logger needs to be set to debug level

Any logger that adheres to the interface defined in logging/log.go will do.

If you could provide a detailed message flow between the two endpoints with debug logs enabled, I may be able to identify where the issue is arising at least.

I will definitely have another look at the changes introduced in v0.16 and double-check if something broke. I hope v0.15 serves you well in the meantime.

I'm very happy to know you're using this in production. Yes, maintaning OSS can be tough and I do plan on opening the repo up to Github Sponsors. Before I do so, I just wanted to polish a few things (i.e. publish a documentation site, make sure nothing major is broken). Stay tuned.

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

Issue #133 looks similar. Also is there a relation with the issue #141 ?

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

Hi @lorenzodonini Upon reading the logs from here and conducting some experiments with the library and my dummy ocpp client, I noticed that we get this error when the charging station client abruptly crashes and disconnects from the server:

ERRO[2022-10-01T01:09:46+05:30] disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists

Could this be the reason for the issue we're facing? I see that this check was in v0.15.0 as well and is not new in v0.16.0, but could be that for some other change in v0.16.0 is causing us to hit this policy violation more easily.

In #133 the author of the issue also says "after getted BootNotification request more than 10 times from chargepoint ,ths ocpp server is stucked". I'm carefully reviewing the changes in the code myself, but I would really appreciate your thoughts on this as well :))

I have another query regarding the changes which might relate to our issue:
I was not able to understand the need to increase the size of the requestChannel from 1 to 20 in v0.16.0
Screen Shot 2022-10-01 at 2 51 39 PM

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

@lorenzodonini Yes I saw the check on the websocket file where a connection to the same ID was ressponded with the aforementioned code. But again I'm a bit puzzled because this check was there in v0.15.0 and around 3000 connected stations are working fine with it in our production environment.
In the last months, I've tried to rollout v0.16.0 about 2 to 3 times and after 24-48 hrs the stations would not respond to RFID cards, or RemoteStart and Stops.

The last time the ocpp server became unresponsive with v0.16.0, I ran the ocpp-go client library against my production server and the logs I mentioned above is what I received.
I have enabled debug logs on the staging area where some stations are running without any issues with happy users, but the same is causing issues when I promote this library update to production with around 3000 active charging stations.

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

Hi @lorenzodonini :)) Good afternoon, I'll be publishing the beta version of the csms server to production, which means we will again go from v0.15.0 to v0.16.0, this time with debug logs enabled. The motive is to get some debug logs which will help us find the issue to the problem.

I've also forked your repo and added some error logs that might help if we encounter errors like last time (for e.g., disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists)
https://github.com/utsavanand2/ocpp-go/commit/8b82ac15fa7c5563c942b112ddf3b16e18bc0aca

I've decided to do this promotion on Friday evening and monitor it through the weekend.
Is there any other section of the code or file which you would like me to pay attention to and add more debug or error level logs which might aid you as well?

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

@lorenzodonini I ran the update on the production environment multiple times since Friday, I have the debug logs so should I post them here?
One more question, have you tried using the ocpp-go client library as a tool for loadtesting the ocpp server? I will be trying to create multiple instances of the charge point.
I do have a couple of observations, the error disconnected from serverwebsocket: close 1008 (policy violation): a connection with this ID already exists seems to be coming after the charge point has connected, did not get a response from the CSMS, so probably tries to reconnect again without dropping the connection, but again this is not the primary concern, the concern is that the server seems to give up quickly when a lot of charging stations try to connect.

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

@lorenzodonini I ran some tests yesterday by keeping the library at v0.16.0 but replacing the dispatcher file with the version used in v0.15.0 which doesn't use the context timeout, and it worked in production :)

from ocpp-go.

lorenzodonini avatar lorenzodonini commented on August 15, 2024

@utsavanand2 Thanks for collecting some debug logs. Feel free to upload them to any cloud and either send me a link via email or paste it here (maybe obfuscate sensitive data, if any).

I didn't do an extensive load testing, but scaling a sample setup to 100 clients seems to work just fine in my tests. I expect there is some race condition or incorrect state handling that leads to the issue you're experiencing.

Narrowing it down to the new context timeouts definitely helps. I'll have a deeper look asap and hope to have a fix soon.

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

Hi @lorenzodonini thanks a lot! Can you please drop me an email at [email protected], I would love to discuss this more in detail, and drop a link to the logs via email :)

from ocpp-go.

utsavanand2 avatar utsavanand2 commented on August 15, 2024

@lorenzodonini I can be wrong but I was wondering if we should handle the case for when the context is cancelled in waitForTimeout

func (d *DefaultServerDispatcher) waitForTimeout(clientID string, clientCtx clientTimeoutContext) {
and delete the pending request as we do it for the case when we get an elapsed timeout?

from ocpp-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.