|[jetty-users] WebSocket Write Timeout Exceptions under load|
|I am using Jetty as a stand alone app server in a production system. The primary use of the servers associated with this question is for handling WebSocket traffic. We are using Jetty, because in my analysis, Jetty seems to be king in the Java world when it comes to a high performance implementation of non-blocking IO. But... I have run across a strange error/behavior I can't explain. I was hoping for any insight people might have.|
Here is my best effort to describe the behavior without information overload.
When we have a solid # of Websockets open on a server (say 5-10K) and we send a message to all of them, say, every second (So that's a total of 5-10K messages going out each second). There seems to be a threshold where we start getting the following errors when calling connection.sendMessage(stringMsg); where connection is a org.eclipse.jetty.websocket.WebSocket.Connection.
java.io.IOException: Write timeout
When Write timeout occurs, if you are familiar with this, you know this happens when the buffer in WebSocketGenerator is full and Jetty blocks while it writes to the channel. If it blocks longer than MaxIdleTime it throws the Write timeout error. So, this is the most annoying error because the thread that we have sending the message is used up for the whole length of MaxIdleTime. The ClosedChannelException mostly happens within a few milliseconds, so its less annoying.
Even more annoying is in both cases, we find calling connection.isOpen() will mostly return true even if these errors occur.
Anybody have any ideas after the above explanation?
Here's some more info:
- We are using Jetty 7.6.4.
- MaxIdleTime in our jetty.xml is 300 seconds, but we set MaxIdleTime for each connection onOpen to 60seconds. So Write Timeout occurs after we wait 60 seconds.
- When sending say 5000 msg/second out there are 0 errors and all connection.sendMessage() calls return in under 5 seconds and 99.99+% of them are 0-3ms
- When sending say 10000 msg/second, there are a small # of errors, say 1 a minute. 99.99+% of the messages are still 0 to a few ms. There are 3-5 messages each minute that are 45 seconds or above. many of them above 60 seconds and throw exception. Note - if we change MaxIdleTime to 300 seconds, there is almost no decrease in # of exceptions, maybe a few.
- If we get these exceptions, isOpen is still often open... so if we retry sending the message, the second try will also timeout. So, WebSockets has just stopped. So -- you can see, this is not any kind of linear build up just due to overloading. At 5000/second all WebSockets can run for hours and none of them will fail (note we are running these tests using servers and WebSockets client test machines that are in Amazon AWS - so the network is solid)... and then if we run at 10000/second, we will have failed WebSockets every minute, while 99.99% of all messages and 99% of all WebSockets will receive each message in just a few milliseconds.
- When we add more WebSockets or increase the message rate - we increase the # of test client machines, so we don't believe the problem is on the test client side.
- The CPU and Memory on the server is healthy during the test. CPU runs 30-50% during 5000 or 10000 msg/second test. Memory is low. Most memory is our registry objects that hold the handle to the websocket. 10K of these websockets is nothing in terms of memory. We are running on 4 core boxes with 7GB or RAM
So from these details, does anybody have a good explanation of this kind of behavior? Is this expected? Any ideas on what we could tune to try and get rid of this problem? Or at least reduce the # of "hung websockets".
Thanks in advance for any ideas.
Back to the top