How We Learned to Stop Worrying and Scale Production

Published in

Thanx

7 min readJul 26, 2019

Being a rapidly growing SaaS company, we sometimes encounter performance challenges as we scale. A few months back we saw some degraded performance and outages across our platform which was inconsistent with the expectations we set for ourselves as an engineering team. While we were able to get the issue under control quickly, it was clear that we had some work ahead of us to make sure it wouldn’t happen again. Here is what we learned from this incident.

What happened?

This is what we saw when we jumped in to investigate:

Heroku was full of purple bars — we were getting hit with waves of timeouts.

New Relic had a mountain of yellow — our MySQL database was having some issues.

CloudWatch indicated that our DB connections were pretty spikey

Many services and workers shared this database, and it was clearly getting hammered when all these sources were busy. When we investigated in-depth, we found some table locks and slower than normal queries happening across the board. Working backward, it was clear to us that this database was the source of our heartache, and we came to the conclusion that our usage had outpaced the database’s abilities.

So we knew what happened. Two paths were clear to us: we could either lessen the load on the database or we could throw some money at it. Either way, the issues should go away. But why did it happen? Understanding why the issue occurred would help shed light on a longer-term solution. At a high level, we’ve been rapidly expanding our platform horizontally. Yesteryear’s rails monolith has been torn apart in favor of microservices. We pulled non-critical logic out of the critical path and built a fleet of workers. All of these services and all of these workers were set up to dynamically scale based on load.

In the end, we realized that we went about scaling wrong. We configured these services and workers as if they were scaling in isolation. Do we have a large email campaign that needs to go out? Sure, scale-up to one hundred email workers! Do a lot of users need their rewards updated? Yup, one hundred and fifty reward workers should get that done in an appropriate amount of time. Could our platform handle those two hundred and fifty workers? Sure. But what happens if we are scaling up ten different workers at the same time? Twenty? You can see the problem. We scaled the workers as if they worked independently, but they didn’t. They shared resources, and when all workers were firing on all cylinders, there just wasn’t enough of those shared resources to go around.

We regret allowing our growing pains to affect our customers. The silver lining is that we learned some really impactful lessons that helped strengthen our engineering organization and the Thanx platform.

Tips, Lessons Learned, and Some All Too Obvious Things

Fix The Problem

Sometimes it is easy to lose sight of the short term solutions when everyone’s pitching in on solving the big-picture items. We, unfortunately, fell into this trap. We keyed in early on a possible solution: lowering the number of DB connections/requests. But instead, we quickly got lost in debating database connection limits, trying to understand the finer details of what could have caused the issues, and found ourselves traveling down other side roads to bike sheds. We discussed detailed plans for database upgrades and architectural changes. But we never ended up fixing the problem — at least not until it happened again the following morning.

Invest in Infrastructure

We’re talking about time, not money. Spend some time understanding the limitations of your current architecture and what steps you would need to take to move your platform forward. Keep your infrastructure up-to-date.

In our particular case, not staying on top of database upgrades contributed significantly to our issue. Our RDS instances were still provisioned with magnetic drives. AWS’s tag line on magnetic drives is their platform “supports magnetic storage for backward compatibility”. Solid State Drives (SSDs), on the other hand, are more cost-effective, offer better latency, are burstable and offer overall better performance. There is no reason we shouldn’t have already been on SSDs, but it took an outage to finally convince us to spend the time and effort to upgrade.

Since the upgrade, we’ve seen significant performance improvements — latency improved by 50%! Had we invested in this upgrade earlier, there might not have been a reason to write this blog post.

Prioritize Scale

This is what we wish we had done (and what we subsequently implemented) to better prepare for the future.

Understand your pipeline.

As a company, we are moving upmarket and, as a result, are targeting and onboarding new customers larger than any currently on our platform. In the six months leading up to the incident:

The number of requests our services processed daily had increased by nearly 2x.
The number of emails we were sending a month had grown by almost 5x.

We should have created a projection for this growth and used that information to stress test. Doing so would have helped us understand how our production environment would have reacted under these conditions and helped us mitigate some of these issues. We would have learned that some of the auto-scaling mechanisms we put in place to help with scale weren’t appropriately bounded and a significant amount of scaling would cause problems.

Isolate the critical path.

For us, and probably many of you, the critical path is customer-facing requests. For us, this might be an end-user pulling their rewards into their mobile app, or a merchant-user previewing the number of customers their campaign will target. These requests should not be slowed down due to increased activity caused by activity on non-critical paths. A preventative measure we started recently focusing on was moving asynchronous processing (data exports, data health checks, etc.) away from our master database instance and on to a read replica — a copy of our master database which doesn’t allow any changes.

Ownership.

Assign ownership. Empower ownership. Sign off on ownership. Do what you need to do enable someone to move your platform forward. In our experience, most engineers know where the bottlenecks in their system are. The problem isn’t identifying them, or even putting together a solid plan to refactor. Rather, start-ups often focus on adding features at the cost of supporting platform needs.

To address this at Thanx we’ve recently established the Platform Team — a group of engineers who take ownership of the scalability, performance, and uptime of our platform.

Know your tools.

Tooling is pretty subjective, but it makes sense to have tools in place to alert us as issues arise, as well as ones that provide a window into performance diagnostics. Our tool stack included the following: Sentry, CloudWatch, PagerDuty, New Relic, and Loggly. Our original modus operandi was to set up alerts through PagerDuty when something went wrong and the on-call engineer needed to intervene. Once alerted, we would then use these tools to piece together what happened.

We do things a little differently now. While we still have alarms in place to handle fires, we better use our tools to actively monitor performance. In New Relic, we’ve identified key transactions critical to our operation and have added alerts for degraded performance. Within RDS we’ve set up Performance Insights, and with the improved visibility into performance, we have the ability to tune the database as needed. We’ve taken to identifying poor performance before it becomes an issue and, to that end, we’ve started using Scout. Scout has helped us identify and address slow endpoints. In particular, it easily uncovers n+1 queries and significant memory bloat.

All in all, we now feel like we have the right tools, and an ample understanding of our tools, to prevent similar issues from happening again.

Post-Mortems

We use post-mortem meetings to discuss an issue after it has been resolved and identify the underlying problems that brought it to pass. The goal is to establish measures to prevent issues from happening again. So if an incident occurs, conduct a post-mortem. The 5 Whys help us direct our post-mortems and give us a simple framework for uncovering those root causes to our problems, as well as prevent post-mortems from devolving into blame-fests. The result is more ‘how do we succeed in the future?’ instead of ‘how did you let this happen?’.

That is the whiteboard from our post mortem for this incident. In that meeting alone, we identified most of the shortcomings and lessons learned from above.

A Final Note

One of our core behaviors at Thanx is Focus On What Matters. If you take any of these lessons learned to heart, you should make sure you hold steadfast to this. You need to invest in your platform when it matters. If done too early, you’ll waste time and effort that could be better spent on the product. If too late, you might find yourself writing your own blog post.