The Phantom (Latency) Menace
Recollection of a production issue causing high latency in a Node application with little observability. Covers initial symptoms, potential avenues of investigation, and ultimately the solution.
A long time ago in a satellite office far, far away...
There exists a cloud native Node service where low latency is critical. It is the foundation of our team's services and the only Node project... everything else is not sensitive to latency and written in Java.
One day, while testing another service that consumes this service, things are slow. Real slow. Tens of seconds to complete a request levels of slow. I felt a disturbance in the force and unplanned work rapidly approaching.
Darth Sidious the Application
The application has a simple purpose: shuttle a small payload of data to and from DynamoDB based on the incoming request and log the results for later downstream consumption. As such, the typical flow through the system is to read from DynamoDB, write a log entry, and send back a response.
With no significant CI/CD pipeline, the release deltas are normally substantial. This issue appeared right around a release, further clouding situation.
To frame all this, the service was hastily built in about a month due to a third party vendor shutting down. No one on the team had any prior professional experience with Node. As a result, shortcuts were taken to reach the deadline. There are minimal unit tests, some integration tests, poor runtime observability, and little meaningful application logging. Of course, metrics are also minimal too - whatever Cloudwatch picks up for the load balancer and EC2 instances. Basically, once it is deployed on AWS, we are at the mercy of Jeff Bezos for it to keep working.
All together this put us in a difficult situation. A service disruption with a known issue is bad but a painfully slow service with no observability is arguably worse.
Fear is the path to the dark side.
Uncertain as to the source of the problem, we began to question anything that seemed like a reasonable avenue:
Something in the release? Should we roll back?
Maybe its infrastructure? The load balancer? Maybe its something in the EC2 instances?
Is it something in the configs? They haven't changed in months.
Requests are not being dropped, so the network is okay, right?
Could this be load related? How? Our load testing throws more than the live traffic.
Why does it only happen at certain times of the day?
Did an I/O limit get hit?
The logs tell us nothing useful, why is there so little logging in this application?
Much like the Dark side clouding the Jedi Council, we were unable to see the true cause. We blindly grasped at any plausible cause. Unfocused, we were.
CPU, memory, and network utilization all proved nominal. Latency spiked the worst during peak traffic and there was increased resource usage. However, it was not proportional to the increase in latency. Cloudwatch showed our DynamoDB read and write times were normal as well. It was an insidious issue.
Remember, your focus determines your reality.
For a week, with growing desperation and fear, we tried to get the latency situation under control. War rooms were convened. Food was ordered. Whiteboards were used. Rollbacks to a known good version occurred. Nothing worked. Eventually management started talking about hiring a Node consultant to fix the situation.
Near the end of the first week we started to question the network connections. The hypothesis at the time was that we simply had too many requests coming in per second. The peak volume of requests served is where latency spiked the hardest. Perhaps opening and closing connections that frequently would cause a delay for the next request to wait until a connection was available? Attempts at varying the number of instances running proved the latency spike triggered sooner with fewer instances. In hindsight, a solid clue but it was lost in the shuffle of all the other clues. This was shelved in favor of other areas to explore.
Week two brought out the application profiling and flame charts. The Clinic.js suite of tools proved useful, but no one had profiled this application before so there was no baseline of "normal". Additionally, we couldn't run the profiler against production. In the end, the profiling did not reveal much for the latency situation as local developer instances could not generate enough traffic to cause latency. Similarly, our load testing tools were not fruitful.
Eventually we began a deep dive on Node and how connections are formed. And then we found it. Node has http.Agent, in it you can specify a variety of options about connection persistence. With time running out and our backs against the wall we began fiddling with the various Agent options. Setting
keepAlive to true brought our latency levels to historic lows. We were all stunned. How could a single configuration change not only resolve the issue but improve application performance to such a degree? Latency was now far lower than it ever was before. Response times dropped to single digits.
keepAlive should have a negligible impact on incoming requests from customers, right?
Mind tricks don't work on me
Throughout all this we did not consider the latency from the application to DynamoDB. The Cloudwatch metrics for DynamoDB are for the read and write times, not the time from starting the code trying to access DynamoDB until it returns a result.
http.Agent documentation highlights a key point:
Whether [the socket connection] is destroyed or pooled depends on the
The application was processing well over a 100 requests per second and destroying at least twice as many connections to the database. We suspect, at a certain level, this became too inefficient and caused the application to wait. With
keepAlive enabled, the application no longer spends time or resources opening and closing socket connections.
As for how we could missed this, we brought in a software engineer from another team for a week during the initial month when the application was created. He had prior Node experience and wrote the DynamoDB related code where the Agent configs were set. As no one else had prior Node experience, no one knew about
keepAlive. Sadly, the same engineer left the company prior to this latency issue.
Despite the struggles and simple solution, we gained a large number of insights about our application, highlighted areas to improve, and features to implement. Yet, after all that time spent looking for a solution it felt like the answer was a cheap trick - it wasn't a satisfying ending.
Photo by Patrick McManaman on Unsplash