hello, In some circumstances <a class="user-mention notranslate" dat

Hmmmm .... can you please take a look at the comments in issue <a class="issue-link js

Given that, your deion makes a lot more sense to me. I suspect

NoCurrentConnection,about stompgem/stomp

Comments (43)

ripienaar commented on July 28, 2024

I jus tnoticed 1.2.5 is out and some work was done with socket, sorry if this was fixed there - will update and keep an eye on it though for me it happens like once a month

from stomp.

ripienaar commented on July 28, 2024

confirmed to still happen on 1.2.5, whenever there's some bad networking with lots of packet loss, timeouts and such - things seem to go wrong and it will never auto reconnect then.

from stomp.

gmallard commented on July 28, 2024

Hmmmm .... can you please take a look at the comments in issue #47, and also at the commit that closed 47?

I tested that by:

a) Establishing a connection
b) Attempt a publish every so often
c) Kill the broker
d) Verify @closed is true
e) Restart the broker and verify that a connection is reestablished

I am missing something here I think. The retries should all be synchronous in the sense they take place immediately if the connection fails. If max retry counts are exceeded then you are done. The connection is dead.

So I do not understand your comment about "other methods" taking care of reconnection attempts. Are multiple threads involved in what you are doing?

Right now I have absolutely no idea how to recreate this.

I will review that fix for issue #47, and sanity check that again.

from stomp.

ripienaar commented on July 28, 2024

What I mean by other connections taking of reconnection is given the code https://github.com/stompgem/stomp/blob/master/lib/stomp/connection.rb#L361-370

This is effectively saying:

def publish(destination, message, headers = {})
   # raise if not connected
   # prepare message
   # transmit message via a call to socket
end

The last step - transmit message - will call the socket method, and the socket method will attempt to reconnect if the connection failed

However if for whatever reason the @closed property is set to true by the time you enter this method it will raise in the first step, never attempting to call socket and so never kicking in the reconnect logic again.

I might be missing something but it seems to me that in some circumstances this happens, @closed is set to true when it enters these messages and then forever its just raising instead of reconnecting.

I cannot recreate by forcibly restarting things or killing brokers, it seems to happen only during bad network conditions like high packet loss, routing errors etc so would be hard to debug.

Some behaviour have changed in 1.2.5 and I am seeing new logs now so I guess I need to monitor it more and wait patiently for my next network event before I can really get into it. It's a really difficult thing to try to reproduce :(

from stomp.

gmallard commented on July 28, 2024

Yep, I understand what you are saying about a "raise if not connected".

I just reran my "broker kill" test, and that functions as expected.

When you connect are you using:

:max_reconnect_attempts => ????

from stomp.

ripienaar commented on July 28, 2024

I set max attempts to 0

from stomp.

gmallard commented on July 28, 2024

With max recon attempts == 0, you should see a "retry forever" behavior somewhere. I am not understanding why it seems that is not the case.

Do you have any logs from one of these events that you could share?

It the connection shared across multiple threads ?

from stomp.

ripienaar commented on July 28, 2024

yes, its shared across threads. One is almost constantly in c.receive while others publish.

I'll try to introduce packet loss this weekend and see if I can reproduce it and run my system at debug for the time.

from stomp.

gmallard commented on July 28, 2024

Given that, your description makes a lot more sense to me.

I suspect one thread, likely the receiver thread is in a retry loop.

The publisher threads do not 'know' that, and they see the 'raise if not connected' behavior.

Do I understand correctly: you first observed this using base 1.2.4 ?

If yes, then I am still somewhat baffled.

This went in between 1.2.4 and 1.2.5:

270512b

Comments on that commit?

from stomp.

ripienaar commented on July 28, 2024

Yes, 1.2.4 and probably earlier did this, i just last night upgraded to 1.2.5 and will have to see if that improves things

from stomp.

gmallard commented on July 28, 2024

Given the differences between 1.2.4 and 1.2.5, I do not hold out much hope for an improvement .......

from stomp.

gmallard commented on July 28, 2024

Your receiver thread ...... what ack mode is the subscription using? Does the receive logic check for 'ERROR' frames and handle them somehow? I would like to know if your receiver has seen an 'ERROR' frame from AMQ that indicates the connection is absolutely dead.

I have been playing, and have one hacked example where AMQ will actively refuse a reconnect because it has not handled an ACK during shutdown / kill. This particular example reconnects fine to Apollo and RabbitMQ.

Sunday afternoon ..... I am going to the pub for a while.

from stomp.

ripienaar commented on July 28, 2024

I do log but did not seeing anythig related to error frames now. I have it logging at a higher level now and will just hve to see, its unfortunately over SSL so tcpdump isnt much use

doesnt happen that often for me, will just have to wait it out

from stomp.

gmallard commented on July 28, 2024

When this kind of network / broker meltdown occurs, I really need to know if one of your threads is in a infinite retry loop. I believe that should be true, given :max_reconnect_attempts => 0.

The gem originally implemented the 'reliable' concept with 2557ad1 on 20060429. Quite some time ago.

Given your description of the application's design and implementation, I doubt that any gem version could have ever gracefully handled the kind of network / broker behavior you are reporting.

IMO, at a bare minimum, your publish threads need to check @closed before attempting to 'conn.publish'. And do something sane if the check is true. Go in to a sleep / retry loop maybe?

tcpdump and other low level tools .... may introduce behavior that does not match reality. I am sure you are aware of that ......

from stomp.

ripienaar commented on July 28, 2024

tcpdumping a connection really shouldnt be affecting it, on unix at least this is completely unrelated to the app unless the apps specifically changes its behaviour in the presence of a promisc network

anyway, I've run this specific program with the same design for 3 years now on much worse networks and only started noticing a situation where it gets in a situation it cant reconnect since 1.2.x

I really cant add much at this point of use till it happens again and i get better logs

from stomp.

gmallard commented on July 28, 2024

Hmmmmm ... since 1.2.x.

That is interesting.

There was a lot of logic added between 1.1.10 and 1.2.0 course. It was an extensive change.

That was all of the initial Stomp 1.1 protocol level logic, as well as other changes. Others were:

f9c3a36 Raise on duplicate subscribe for reliable connections.
7dd9397 Raise if called after Connection#disconnect or Client#close.
3d5b9a2 Add connect timeout override with hashed parameters.

And possibly:

3d5b9a2 Add connect timeout override with hashed parameters.

If you can run a custom build and test again, I suggest trying a 'revert' of 7dd9397 in particular. It would be an interesting experiment.

from stomp.

gmallard commented on July 28, 2024

When Stomp 1.1 started, the initial thinking behind 7dd9397 was that it would prevent gem users from 'shooting themselves in the foot'.

I still think that reasoning is sound.

However in the next version I will provide a way for that behavior to be nullified on a per connection basis.

from stomp.

ripienaar commented on July 28, 2024

Had one machine do it again today. It was a pretty low spec vmware based VM and it was having quite high CPU and it's network connection was having issues, logs I got was:

activemq.rb:131:in `on_ssl_connectfail' SSL session creation with stomp+ssl://x@x:6165 failed: execution expired
activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry
activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry

And after this it was game over, so I do think this is multi threaded related. Either option to nullify it or a support way to attempt reconnect would be fine, in my case I dont care if these messages get sent when the network is lossy or the machine is struggling I mostly care that the connection recovers on its own when things return to normal.

from stomp.

gmallard commented on July 28, 2024

It is unclear to me that nullifying the checks for closed? is actually going to help in your case. Nevertheless expect a trial gem shortly.

from stomp.

gmallard commented on July 28, 2024

The first log message that contains 'execution expired' has me wondering.... I am now considering the possibility that this line of code:

ssl.connect

should perhaps be something like:

Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do
    ssl.connect
 end

Please confirm/deny: you are running with

:connect_timeout => 0

Please add to your logging code for next time and show me:

parms[:ssl_exception]

from stomp.

ripienaar commented on July 28, 2024

I am running with connect_timeout 0 yes, will add that logging

from stomp.

ripienaar commented on July 28, 2024

Oh, I do log the ssl_exception it's in the logs above:

Log.error("SSL session creation with #{stomp_url(params)} failed: #{params[:ssl_exception]}")

from stomp.

gmallard commented on July 28, 2024

lol. Sorry, I really meant show me this:

parms[:ssl_execption].backtrace.join("\n")

from stomp.

ripienaar commented on July 28, 2024

OK :)

from stomp.

gmallard commented on July 28, 2024

You have described .... a total network meltdown as far as I can see.

Presumably you want retries to occur ..... well, forever essentially.

Consider: Under the following conditions:

Stomp 1.1+ connection
Client is sending heartbeats

nothing is likely to work very well. If a heartbeat send is actually required, the current heartbeat send code (a separate thread) does:

Obtain the transmit_semaphore lock (misnamed)
Writes a "\n" to the connection
If an exception is detected, optionally log it, and re-raise

Kaboom. You are done.

I need to think about that case I guess..... but you also should.

from stomp.

ripienaar commented on July 28, 2024

So I am still not running 1.1 here - activemq bug - but will get there. It's possible that the keep alive thread will try to reconnect when running in that mode, I did not check.

But yes, its a total network meltdown and later recovery. For me and others I know who use it the stomp gem is amazing because you can just tell it to connect and forget about it, it always retries, never gets into a place it cant retry and just basically allow you to build software you can forget about. This behavior changes that.

Ofcourse when the network is out there's nothing to be done, message fail to send etc, totally kewl - but it recovers on its own without programmer/admin/user interaction after the network event and that was a major bonus of this gem - apart from that it speaks STOMP :P

from stomp.

gmallard commented on July 28, 2024

AMQ 5.7.0 seems flaky to me. I have tests that have been running for years that fail when using it. Tests that have nothing to do with this discussion.

from stomp.

gmallard commented on July 28, 2024

Adding a documentation note at this time. I have been unable to recreate this using test strategies described below.

Strategy 1: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:

sleep n seconds
broker stop
sleep m seconds
broker start

Strategy 2: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:

sleep x seconds
sudo iptables ..... # start dropping packets
sleep y seconds
sudo iptables .... # do not drop packets

Strategy 3: A combination / merge of Strategy 1 and 2.

Using Strategy 3 with poorly (or well?) chosen sleep values, it is possible to encounter situations where a client will take a significant amount of time to actually reconnect. However, in all cases a successful reconnect does eventually happen.

Am certainly open to new suggestions for recreating this.

Note: during this effort I did uncover the bug that reconnect_delay was not reinitialized upon a successful reconnect. That is fixed, but only in the latest gem version. See: 9099371

from stomp.

ripienaar commented on July 28, 2024

I am still alive, been traveling a lot so not had time to devote to this.

However I now have 2 hosts who does this daily so at least I have somewhere that I can play and try various things. Upgraded to your latest release, set the closed_check option and will monitor.

I should probably chat with you about my general approach re connection/sending/sharing connection between threads at some point as I've found a few more weird things and I think it's because I am just doing it wrong or based on wrong assumptions

from stomp.

gmallard commented on July 28, 2024

I will continue to monitor this traffic of course.

As noted in my most previous post I have been totally unable to recreate ..... what you are apparently seeing.

We can certainly discuss your approach. From the somewhat minimal description you have provided ..... at first blush I see nothing inherently wrong with it.

From a multi-thread perspective it seems there are several basic approaches:

Single connection, shared across threads
Multiple connections, one per thread
A hybrid of 1) and 2)

I think your approach is using option 1).

from stomp.

gmallard commented on July 28, 2024

Does this app use failover, like

hash = {
  :hosts => [
    # First connect is to remotehost1
    {:login => "login1", :passcode => "passcode1", :host => "remotehost1", :port => 61612, :ssl => true},
    # First failover connect is to remotehost2
    {:login => "login2", :passcode => "passcode2", :host => "remotehost2", :port => 61613, :ssl => false},

  ], .....

????

from stomp.

ripienaar commented on July 28, 2024

not in mine, but earlier this week at a client I saw failover also not working in 1.2.x where it worked with the same code in 1.1.x.

Once I am back from my travels I'll attempt to reproduce that with a simple bit of code I think that should be easier than this one from what I saw

from stomp.

gmallard commented on July 28, 2024

One more question - since I am on a fishing expedition - are these connections straight TCP or SSL or either?

from stomp.

ripienaar commented on July 28, 2024

yes, ssl - the failover issue I had was also with SSL fwiw but I have not started on reproducing that one

from stomp.

gmallard commented on July 28, 2024

I want to try and set up an environment here that is as close to yours as possible. So:

a) What version of AMQ ??
b) Is AMQ running with ?needClientAuth=true ??
c) Are there any .... unusual ... AMQ configuration parameters specified ??
d) Ruby version - one more time please :-)

from stomp.

gmallard commented on July 28, 2024

Another Q. What is your transport scheme? I am thinking:

stomp+nio+ssl:// ........

but want to confirm.

from stomp.

ripienaar commented on July 28, 2024

XML file @ http://p.devco.net/268/ I just removed the auth and acl bits - these connections are to the verified_stompssl port

ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

from stomp.

gmallard commented on July 28, 2024

AMQ Version ????

Might be relevant ......

from stomp.

ripienaar commented on July 28, 2024

sorry - it your much hated 5.7, though i had the same in 5.6 iirc.

My 2 machines that were doing this daily for like a week has now stop doing it all on their own, so frustrating not to be able to gather info or means of replication from them

from stomp.

gmallard commented on July 28, 2024

So ..... I am thinking I should close this. It can be re-opened at any time of course.

Do you have any updates?

from stomp.

ripienaar commented on July 28, 2024

no, i had a week where it happened like every few hours in november, then it just stopped and never came back - infuriating.

close it, I'll open again if i ever get proper data

from stomp.

gmallard commented on July 28, 2024

I will close it for now.

Ya' know .... my day to day is in a shop with a very large .NET and Java infrastructure. We use all sorts of tech to do failure detection, fail-over, and as a side effect load balancing.

I think you have some kind of infrastructure problem. (Many possibilities: flaky NICs. flaky switches, unexpected network changes (routing table updates maybe?), ???? the possibilities are almost endless.

Just my opinion at present.

Regards, Guy

from stomp.

ripienaar commented on July 28, 2024

totally - my machines are at lots of small ISPs and communicate over the internet - it's guaranteed to suck and have issues. Stomp gem pretty much always retries solidly which is why i like it :)

from stomp.

NoCurrentConnection about stomp HOT 43 CLOSED

Comments (43)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent