On the Merits of Combining Benchmarking, Loadtesting, and Integration Testing

A developer’s musings on DRYing up the required bits of keeping a pulse on the pre-prod health of a live services game.

Tim Mendez
7 min readOct 31, 2023

Background

I’m the Lead Principal Director of Executive Platform Software Development Engineering and Architecture at Wonderstorm where we’re creating a live services game set in the world of The Dragon Prince TV show.

We’re building out our platform using Pragma, a platform-as-a-service (highly recommend).

Origins of Motivations

Benchmarking

TL;DR Needed to know how long things took and when they changed.

The platform for our internal playtests was running on a very inexpensive EC2 instance not much more powerful than a Raspberry Pi with a monolithic architecture.

As the game grew over time in terms of amount of uncached content being delivered upon login to players, the complexity of CPU cycles being spent calculating what items to grant to players on first/subsequent logins, and calculating any item updates that needed to happen on login, we were gradually inching closer to the allocated hardware’s limit.

More importantly, we didn’t have any saved measure of performance when high-impact (in a CPU or memory sense) features were implemented, updated, or removed. Several times a seemingly innocuous PR would be merged in, and:

  • our login payload would grow twice as large
  • certain types of instanced item updates would spin 20x what they needed to
  • the list of what requirements players met based on current progression would be recalculated more frequently than required
Elmo, bringer of chaos

These things often would take a while to be noticed, at which point it was more difficult to find the offending commit.

At my time at Amazon, we would manually gather changes to things like bundle size and include them in the CR, which was helpful, but not as nice as it could have been if automated. The manual solution wasn’t terribly realistic with our prolific design team committing content directly to trunk.

Every once in a while I would be using Postman to debug a feature on a hosted instance, be running through a login scenario, and make a mental note of how large our login payload was growing, and how long it took to turnaround. At one point I created an epic for benchmarking, so I could automate the tracking of it. It was slotted into our backlog and inevitably pushed down further due to more important (at the time) feature work being created.

Integration Testing

TL;DR Needed to move fast safely.

Aight so I don’t like writing integration tests.

David Rose saying Ew

I love unit tests and believe very strongly in them, but we also fully smoketest our game when developing features, so it’s hard for us to justify the time it takes to write integration tests. As we get closer to launch, the importance of regression-catching integration tests looms a bit more. Once live, we can’t have a game-breaking bug on the platform side — such a thing would be catastrophic.

However, as we get closer to ship, the urgency of other feature work also increases. At the same time, we need to stop nuking the databases when making huge changes to items that exist in the player’s inventories, hero experience level curves, etc. by writing meticulous migration scripts. Throw in an extra ~25% of time spent on robust integration tests, and suddenly turnaround time for features has increased quite a bit.

We run into the paradox of needing to move faster at a time where we deliver features the slowest.

All that being said, we need to write loadtesting scenarios anyway…

Loadtesting

TL;DR Needed to loadtest because of course.

Not much to add here — if you don’t loadtest a live services game which is bound to get a huge spike of traffic right at launch you will fall over from failing to discover bottlenecks under load.

Enter: Benchamin

The benchmarking epic we had slotted into to a sprint had gotten kicked so far down the road you couldn’t see it anymore without scrolling. Another internal playtest happens and a lot of players have trouble logging in. We’ve finally pushed our t3a.micro to the limit with having just ~80 players hit the login endpoint at the same time. We do some digging and find the resources on the box strained with spikes when everyone tried to login at the same time, so we bump the instance to a c5.4xlarge. Playtests run smoothly again, but this is a good chance to prioritize benchmarking so we don’t run into any more surprises.

Logo for Benchamin

I write a quick and dirty proof of concept of a benchmarking script to start collecting login data and name it Benchamin. The first iteration is:

  1. Operator authenticates, this allows admin-level controls
  2. Operator creates a player account
  3. Player authenticates
  4. Player hits the login endpoint
  5. CLI output of how long it took and the bundle size

Things look promising and I run it over a few days to keep track of things.

We find and fix the bug that was causing slow logins, and get to talking about how we haven’t spent any calories on loadtesting yet. Pragma has done some initial loadtesting, but doesn’t yet have a generic solution that we can apply to our game. I figure we can expand Benchamin to encompass miniature loadtests, so I rewrite it a bit to be:

  1. Benchamin connects to a shard (environment) specified via CLI
  2. Operator authenticates
  3. Operator creates N player accounts
  4. Players authenticate
  5. Players hit the login endpoint sequentially, and then simultaneously
  6. CLI output of min, max, median, and average times of each method to login, with a percentage increase or decrease compared to the previous run

We run it against another t3a.micro instance with an old commit that had the increased login time bug and are able to reproduce the initial problem with failed logins. Hooray! 🎉

Task failed successfully alert box

I take a quick detour to plop it up in the cloud with Slack integration. We have another internal tool that posts the status of every shard to a #shard-status channel in Slack when a deployment is successful, so we have Benchamin read from posts to #shard-status, and if it’s our Daily shard, run a new benchmark/mini-loadtest against it. The results are then posted to Slack so we can have an easily accessible history of our benchmarks.

If we see an anomaly rise in time or size of a request, we can check the previous day’s commits for anything that looks fishy, to easily spot and fix problems before they go too far. Developers can still run the CLI tool locally to make sure they don’t blow things up before submitting a PR.

It’s great to have a lil loadtesting tool to slam login with recorded benchmark history, but we have work scheduled to break out our services to independent horizontally and vertically scalable instances. This means we’d have a bunch of services not covered by any sort of loadtesting. So, the next step is to expand Benchamin along our player’s journey that hits other services.

  1. Operator authenticates
  2. Operator creates N player accounts
  3. Players authenticate
  4. Players hit the login endpoint sequentially, and then simultaneously
  5. Players pick up an item
  6. Players equip the item
  7. Players pick up a quest
  8. Players get into matchmaking with unique heroes
  9. Players are matchmade and gameservers are spun up…

…hold up.

Did I just trick myself into writing integration tests?

In order for Benchamin to successfully execute these scenarios, it has to expect the player to have the item in their inventory in order to equip it, players heroes must exist, they must be certain levels, matchmaking must matchmake heroes into parties the way we expect, etc.

With every scenario we write, we get a 3x ROI — benchmarks for how long the thing takes, loadtests to see how many of the thing can happen at once, and integration tests to ensure you can do the thing.

It’s not a true replacement for integration tests, but happy path is covered!

What’s Next?

We would like to open source Benchamin eventually, but right now it’s in a state where it’s too heavily tailored to our game.

It’s not very realistic loadtesting, since it just uses two methods: sequentially and simultaneously, and just focuses on one endpoint at a time. Next up is to have a realistic simulation mode, where players are logging in while other players are equipping things and others are entering matchmaking — performing random requests at varied intervals. Then, we can have a worst-case report and typical-case report.

Tracking over time in Slack is nice, but there aren’t any pretty charts yet. Throwing something like that in should be fairly simple to track the most important benchmarks at an even easier glance: median synchronous login time and bundle size over time for first login and for subsequent logins.

Thanks for reading ✌️

--

--