Comments (43)
I jus tnoticed 1.2.5 is out and some work was done with socket, sorry if this was fixed there - will update and keep an eye on it though for me it happens like once a month
from stomp.
confirmed to still happen on 1.2.5, whenever there's some bad networking with lots of packet loss, timeouts and such - things seem to go wrong and it will never auto reconnect then.
from stomp.
Hmmmm .... can you please take a look at the comments in issue #47, and also at the commit that closed 47?
I tested that by:
a) Establishing a connection
b) Attempt a publish every so often
c) Kill the broker
d) Verify @closed is true
e) Restart the broker and verify that a connection is reestablished
I am missing something here I think. The retries should all be synchronous in the sense they take place immediately if the connection fails. If max retry counts are exceeded then you are done. The connection is dead.
So I do not understand your comment about "other methods" taking care of reconnection attempts. Are multiple threads involved in what you are doing?
Right now I have absolutely no idea how to recreate this.
I will review that fix for issue #47, and sanity check that again.
from stomp.
What I mean by other connections taking of reconnection is given the code https://github.com/stompgem/stomp/blob/master/lib/stomp/connection.rb#L361-370
This is effectively saying:
def publish(destination, message, headers = {}) # raise if not connected # prepare message # transmit message via a call to socket end
The last step - transmit message - will call the socket method, and the socket method will attempt to reconnect if the connection failed
However if for whatever reason the @closed property is set to true by the time you enter this method it will raise in the first step, never attempting to call socket and so never kicking in the reconnect logic again.
I might be missing something but it seems to me that in some circumstances this happens, @closed is set to true when it enters these messages and then forever its just raising instead of reconnecting.
I cannot recreate by forcibly restarting things or killing brokers, it seems to happen only during bad network conditions like high packet loss, routing errors etc so would be hard to debug.
Some behaviour have changed in 1.2.5 and I am seeing new logs now so I guess I need to monitor it more and wait patiently for my next network event before I can really get into it. It's a really difficult thing to try to reproduce :(
from stomp.
Yep, I understand what you are saying about a "raise if not connected".
I just reran my "broker kill" test, and that functions as expected.
When you connect are you using:
:max_reconnect_attempts => ????
from stomp.
I set max attempts to 0
from stomp.
With max recon attempts == 0, you should see a "retry forever" behavior somewhere. I am not understanding why it seems that is not the case.
Do you have any logs from one of these events that you could share?
It the connection shared across multiple threads ?
from stomp.
yes, its shared across threads. One is almost constantly in c.receive while others publish.
I'll try to introduce packet loss this weekend and see if I can reproduce it and run my system at debug for the time.
from stomp.
Given that, your description makes a lot more sense to me.
I suspect one thread, likely the receiver thread is in a retry loop.
The publisher threads do not 'know' that, and they see the 'raise if not connected' behavior.
Do I understand correctly: you first observed this using base 1.2.4 ?
If yes, then I am still somewhat baffled.
This went in between 1.2.4 and 1.2.5:
Comments on that commit?
from stomp.
Yes, 1.2.4 and probably earlier did this, i just last night upgraded to 1.2.5 and will have to see if that improves things
from stomp.
Given the differences between 1.2.4 and 1.2.5, I do not hold out much hope for an improvement .......
from stomp.
Your receiver thread ...... what ack mode is the subscription using? Does the receive logic check for 'ERROR' frames and handle them somehow? I would like to know if your receiver has seen an 'ERROR' frame from AMQ that indicates the connection is absolutely dead.
I have been playing, and have one hacked example where AMQ will actively refuse a reconnect because it has not handled an ACK during shutdown / kill. This particular example reconnects fine to Apollo and RabbitMQ.
Sunday afternoon ..... I am going to the pub for a while.
from stomp.
I do log but did not seeing anythig related to error frames now. I have it logging at a higher level now and will just hve to see, its unfortunately over SSL so tcpdump isnt much use
doesnt happen that often for me, will just have to wait it out
from stomp.
When this kind of network / broker meltdown occurs, I really need to know if one of your threads is in a infinite retry loop. I believe that should be true, given :max_reconnect_attempts => 0.
The gem originally implemented the 'reliable' concept with 2557ad1 on 20060429. Quite some time ago.
Given your description of the application's design and implementation, I doubt that any gem version could have ever gracefully handled the kind of network / broker behavior you are reporting.
IMO, at a bare minimum, your publish threads need to check @closed before attempting to 'conn.publish'. And do something sane if the check is true. Go in to a sleep / retry loop maybe?
tcpdump and other low level tools .... may introduce behavior that does not match reality. I am sure you are aware of that ......
from stomp.
tcpdumping a connection really shouldnt be affecting it, on unix at least this is completely unrelated to the app unless the apps specifically changes its behaviour in the presence of a promisc network
anyway, I've run this specific program with the same design for 3 years now on much worse networks and only started noticing a situation where it gets in a situation it cant reconnect since 1.2.x
I really cant add much at this point of use till it happens again and i get better logs
from stomp.
Hmmmmm ... since 1.2.x.
That is interesting.
There was a lot of logic added between 1.1.10 and 1.2.0 course. It was an extensive change.
That was all of the initial Stomp 1.1 protocol level logic, as well as other changes. Others were:
f9c3a36 Raise on duplicate subscribe for reliable connections.
7dd9397 Raise if called after Connection#disconnect or Client#close.
3d5b9a2 Add connect timeout override with hashed parameters.
And possibly:
3d5b9a2 Add connect timeout override with hashed parameters.
If you can run a custom build and test again, I suggest trying a 'revert' of 7dd9397 in particular. It would be an interesting experiment.
from stomp.
When Stomp 1.1 started, the initial thinking behind 7dd9397 was that it would prevent gem users from 'shooting themselves in the foot'.
I still think that reasoning is sound.
However in the next version I will provide a way for that behavior to be nullified on a per connection basis.
from stomp.
Had one machine do it again today. It was a pretty low spec vmware based VM and it was having quite high CPU and it's network connection was having issues, logs I got was:
activemq.rb:131:in `on_ssl_connectfail' SSL session creation with stomp+ssl://x@x:6165 failed: execution expired activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry
And after this it was game over, so I do think this is multi threaded related. Either option to nullify it or a support way to attempt reconnect would be fine, in my case I dont care if these messages get sent when the network is lossy or the machine is struggling I mostly care that the connection recovers on its own when things return to normal.
from stomp.
It is unclear to me that nullifying the checks for closed? is actually going to help in your case. Nevertheless expect a trial gem shortly.
from stomp.
The first log message that contains 'execution expired' has me wondering.... I am now considering the possibility that this line of code:
ssl.connect
should perhaps be something like:
Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do
ssl.connect
end
Please confirm/deny: you are running with
:connect_timeout => 0
Please add to your logging code for next time and show me:
parms[:ssl_exception]
from stomp.
I am running with connect_timeout 0 yes, will add that logging
from stomp.
Oh, I do log the ssl_exception it's in the logs above:
Log.error("SSL session creation with #{stomp_url(params)} failed: #{params[:ssl_exception]}")
from stomp.
lol. Sorry, I really meant show me this:
parms[:ssl_execption].backtrace.join("\n")
from stomp.
OK :)
from stomp.
You have described .... a total network meltdown as far as I can see.
Presumably you want retries to occur ..... well, forever essentially.
Consider: Under the following conditions:
- Stomp 1.1+ connection
- Client is sending heartbeats
nothing is likely to work very well. If a heartbeat send is actually required, the current heartbeat send code (a separate thread) does:
- Obtain the transmit_semaphore lock (misnamed)
- Writes a "\n" to the connection
- If an exception is detected, optionally log it, and re-raise
Kaboom. You are done.
I need to think about that case I guess..... but you also should.
from stomp.
So I am still not running 1.1 here - activemq bug - but will get there. It's possible that the keep alive thread will try to reconnect when running in that mode, I did not check.
But yes, its a total network meltdown and later recovery. For me and others I know who use it the stomp gem is amazing because you can just tell it to connect and forget about it, it always retries, never gets into a place it cant retry and just basically allow you to build software you can forget about. This behavior changes that.
Ofcourse when the network is out there's nothing to be done, message fail to send etc, totally kewl - but it recovers on its own without programmer/admin/user interaction after the network event and that was a major bonus of this gem - apart from that it speaks STOMP :P
from stomp.
AMQ 5.7.0 seems flaky to me. I have tests that have been running for years that fail when using it. Tests that have nothing to do with this discussion.
from stomp.
Adding a documentation note at this time. I have been unable to recreate this using test strategies described below.
Strategy 1: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:
sleep n seconds
broker stop
sleep m seconds
broker start
Strategy 2: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:
sleep x seconds
sudo iptables ..... # start dropping packets
sleep y seconds
sudo iptables .... # do not drop packets
Strategy 3: A combination / merge of Strategy 1 and 2.
Using Strategy 3 with poorly (or well?) chosen sleep values, it is possible to encounter situations where a client will take a significant amount of time to actually reconnect. However, in all cases a successful reconnect does eventually happen.
Am certainly open to new suggestions for recreating this.
Note: during this effort I did uncover the bug that reconnect_delay was not reinitialized upon a successful reconnect. That is fixed, but only in the latest gem version. See: 9099371
from stomp.
I am still alive, been traveling a lot so not had time to devote to this.
However I now have 2 hosts who does this daily so at least I have somewhere that I can play and try various things. Upgraded to your latest release, set the closed_check option and will monitor.
I should probably chat with you about my general approach re connection/sending/sharing connection between threads at some point as I've found a few more weird things and I think it's because I am just doing it wrong or based on wrong assumptions
from stomp.
I will continue to monitor this traffic of course.
As noted in my most previous post I have been totally unable to recreate ..... what you are apparently seeing.
We can certainly discuss your approach. From the somewhat minimal description you have provided ..... at first blush I see nothing inherently wrong with it.
From a multi-thread perspective it seems there are several basic approaches:
- Single connection, shared across threads
- Multiple connections, one per thread
- A hybrid of 1) and 2)
I think your approach is using option 1).
from stomp.
Does this app use failover, like
hash = {
:hosts => [
# First connect is to remotehost1
{:login => "login1", :passcode => "passcode1", :host => "remotehost1", :port => 61612, :ssl => true},
# First failover connect is to remotehost2
{:login => "login2", :passcode => "passcode2", :host => "remotehost2", :port => 61613, :ssl => false},
], .....
????
from stomp.
not in mine, but earlier this week at a client I saw failover also not working in 1.2.x where it worked with the same code in 1.1.x.
Once I am back from my travels I'll attempt to reproduce that with a simple bit of code I think that should be easier than this one from what I saw
from stomp.
One more question - since I am on a fishing expedition - are these connections straight TCP or SSL or either?
from stomp.
yes, ssl - the failover issue I had was also with SSL fwiw but I have not started on reproducing that one
from stomp.
I want to try and set up an environment here that is as close to yours as possible. So:
a) What version of AMQ ??
b) Is AMQ running with ?needClientAuth=true ??
c) Are there any .... unusual ... AMQ configuration parameters specified ??
d) Ruby version - one more time please :-)
from stomp.
Another Q. What is your transport scheme? I am thinking:
stomp+nio+ssl:// ........
but want to confirm.
from stomp.
XML file @ http://p.devco.net/268/ I just removed the auth and acl bits - these connections are to the verified_stompssl port
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
from stomp.
AMQ Version ????
Might be relevant ......
from stomp.
sorry - it your much hated 5.7, though i had the same in 5.6 iirc.
My 2 machines that were doing this daily for like a week has now stop doing it all on their own, so frustrating not to be able to gather info or means of replication from them
from stomp.
So ..... I am thinking I should close this. It can be re-opened at any time of course.
Do you have any updates?
from stomp.
no, i had a week where it happened like every few hours in november, then it just stopped and never came back - infuriating.
close it, I'll open again if i ever get proper data
from stomp.
I will close it for now.
Ya' know .... my day to day is in a shop with a very large .NET and Java infrastructure. We use all sorts of tech to do failure detection, fail-over, and as a side effect load balancing.
I think you have some kind of infrastructure problem. (Many possibilities: flaky NICs. flaky switches, unexpected network changes (routing table updates maybe?), ???? the possibilities are almost endless.
Just my opinion at present.
Regards, Guy
from stomp.
totally - my machines are at lots of small ISPs and communicate over the internet - it's guaranteed to suck and have issues. Stomp gem pretty much always retries solidly which is why i like it :)
from stomp.
Related Issues (20)
- Unexpected ACK received for message-id HOT 7
- multihost failover not really working HOT 1
- netio.rb:47:in `block (2 levels) in _receive': Connected, header read is nil, is this really a Stomp Server? (Stomp::Error::StompServerError) HOT 4
- Nasty-looking exception under jRuby HOT 3
- 1.4.5 gives me a ServerFrameNameError HOT 6
- OpenSSL Deprecations HOT 2
- New Release request HOT 3
- Code for ENV keys not in release HOT 3
- Support Stomp+SSL Url Strings in non-failover case. HOT 4
- Header decode shortcoming HOT 1
- stdout polluted with stack trace HOT 1
- Amazon ActiveMQ drops connection for long running listeners HOT 3
- rspec3 - doubles or partial doubles from rspec-mocks outside .... HOT 6
- Warning: constant ::Fixnum is deprecated HOT 1
- Provide tagged 1.4.9 release HOT 1
- Is there any method to check if subscription is present? HOT 1
- Subscribers are not receiving all messages that we are publishing HOT 9
- Man pages Debian.
- URL options not supported when passing URL to client
- What range of ruby versions are supported?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stomp.