I recently stumbled upon an issue with AWS ELB, in which some slow server code along with the ELB fronting it can
cause auto retries to occur every minute. This has the potential of
letting a limited spike in usage escalating into a lengthier period of
sluggishness on the site.
Why would the retry occur? ELB has a default idle timeout of 60 seconds. When the timer is up, it hastily closes the socket connections on its both ends, without first giving a proper HTTP 50x error response to its client, and the client's TCP stack then decides that it needs to try again.
What made it interesting is that when request originates from inside a corporate network, a joint debugging session with Amazon people involving packet captures on both ELB and the originating client shows that the retry could be detected on the ELB as coming in from its client even though the originating client didn't send it. On the other hand, when making the same test over public internet, the retry can be seen clearly on the originating client's packet capture, at TCP layer. So a sensible conclusion agreed with Amazon people is that in the first case some corporate network element (NAT router?) made the retry.
Originally Amazon support said that it would take a month or two for the issue to be fixed on ELB. Surprisingly they quickly came back with invitation to jointly test their fixes in our environment. Sure enough, with the new ELB, once the 504 (gateway timeout error) was returned to client upon ELB timeout, the retries stopped.
Before the fix is rolled out publicly, if it is deemed worthy to mitigate the retry for your system, an option is to work with Amazon support to lengthen the ELB idle timeout. Depending upon who gets your support ticket, he will likely suggest that the idle timeout on the backend also be increased to be at least 1 second over the ELB idle timeout as ELB prefers to be the connection dumper than dumpee, as in the former case it is able to close the TCP connection down more properly. TCP keepalive characteristics of the backend is another thing that they likely suggest you to tweak.
The real solution is to fix the damn code obviously, whether it involves smarter algorithms, better caching, or doing things asynchronously. And of course, it had to take me six paragraphs to get to that :-)
Why would the retry occur? ELB has a default idle timeout of 60 seconds. When the timer is up, it hastily closes the socket connections on its both ends, without first giving a proper HTTP 50x error response to its client, and the client's TCP stack then decides that it needs to try again.
What made it interesting is that when request originates from inside a corporate network, a joint debugging session with Amazon people involving packet captures on both ELB and the originating client shows that the retry could be detected on the ELB as coming in from its client even though the originating client didn't send it. On the other hand, when making the same test over public internet, the retry can be seen clearly on the originating client's packet capture, at TCP layer. So a sensible conclusion agreed with Amazon people is that in the first case some corporate network element (NAT router?) made the retry.
Originally Amazon support said that it would take a month or two for the issue to be fixed on ELB. Surprisingly they quickly came back with invitation to jointly test their fixes in our environment. Sure enough, with the new ELB, once the 504 (gateway timeout error) was returned to client upon ELB timeout, the retries stopped.
Before the fix is rolled out publicly, if it is deemed worthy to mitigate the retry for your system, an option is to work with Amazon support to lengthen the ELB idle timeout. Depending upon who gets your support ticket, he will likely suggest that the idle timeout on the backend also be increased to be at least 1 second over the ELB idle timeout as ELB prefers to be the connection dumper than dumpee, as in the former case it is able to close the TCP connection down more properly. TCP keepalive characteristics of the backend is another thing that they likely suggest you to tweak.
The real solution is to fix the damn code obviously, whether it involves smarter algorithms, better caching, or doing things asynchronously. And of course, it had to take me six paragraphs to get to that :-)