Wintermod - The First 48 Hours (An Apology)
The past 48 hours have been fun and at times completely stressful, but overall I think it has been a great success. However, that being said.. I would like to apologize to our players and to 10gbps.io for some miscalculations on my part that resulted in quite slow downloads between 09:00 UTC and 16:30 UTC. To players: you don’t deserve slow download speeds for a special event such as wintermod. To 10gbps.io: it wasn’t fair for us to have boasted your services as being able to handle the mod downloads when we forecasted much lower loads. Your server fought well in our losing battle.
I want to clear up a few things and make sure that everyone knows: 1. The story to how this happened in the first place. 2. What measures we (attempted) to take during the slow speeds and what we learned from it. 3. How truly badass the server was that 10gbps.io provided to us to use for distributing wintermod.
So, let’s look into things and get nerdy for a bit. Buckle up. It’s gonna be a ride.
To start on our journey, planning for wintermod beings in October / November for me. I must look around for capable services (servers) for the CDN (Content Delivery Network). In 2015 we used a fleet of ~20-30 servers @ 100mbps each. A cool 2-3 gbps (or so we thought) of bandwidth was at our fingertips. While this worked, we soon found out that some of the servers would be capped at 24mbps… and just randomly go offline. No good. We struggled through for the next few days but eventually(:tm:) everyone was able to download the mod. One of the important lessons we learned from that year was.. SPLIT UP THE FILES! We wasted an ENORMOUS amount of bandwidth from players getting frustrated from the ~600MB filesize. Players would start the download, get to some number like 75%.. Cancel and try later. This meant we sent much more data out than we needed to.
In 2016 we had some CDN creation under our belt, knew we had to split up the wintermod and had a brand new launcher that would help facilitate downloading and installing the mod for players. Win win right? No, not so fast. While we did have more bandwidth that year, if memory serves, around 15 servers @ 800mbps. Or a cool 12gbps. Nice! What went wrong? Checksums. We split the files up into smaller chunks, and it turns out when you’re seeding data to all of the 15 CDN nodes, some of those files can be a tad bit corrupt. Great. We ended up needing to write some seeding scripts to ensure the CDN servers got all of the files players needed for the wintermod, and that the md5 checksum matched the master CDN with what they downloaded. After a few tense hours of wrestling with the servers, we had a nice little fleet. The release went “ok” but we made a mistake.. We launched during high load, kicked off all 5,000 players online. We saw an enormous peak of usage, maxing out the 12gbps we had due to .. well. 5,000 people trying to download all at once. Oops.
So, 2017 comes along. We got this, right? Well, sorta. The provider from 2016 has been experiencing issues with servers being out of stock. We had the option of ordering 20+ servers when they showed up in stock over the course of october-now and pay an incredible rate for what would be idle servers, find another solution elsewhere (with very limited options), or not release wintermod at all. The latter wanted to be avoided at all costs.
While comparing services, I reached out to 10gbps.io letting them know what our mod is about and to see if they’d be able to help us at all with a possible sponsorship deal. To my surprise they agreed and provided us a 10gbps capable server. This was incredible. They quite literally saved -christmas- wintermod!
Once the wintermod was updated to support ETS2 1.30, our plan was to release during a low player population time so the late night truckers who were on wouldn’t be stuck with long download times and we could fix any issues that would come up without disrupting too many. After all, we DID learn from last year, right?! Right. Let’s figure we’re seeing a peak of 10,000 players; looking at stats.truckersmp.com we can see that on a given day, the online number of players goes up by ~1,000 per hour. One would think this would mean we should expect ~1,000 downloads per hour. Wintermod is 700MB or.. ~700GB per hour. After some napkin math that means with a 10gbps connection to the internet at worst, 1,000 players downloading the wintermod should equate to a download time of 10 minutes if everyone is downloading at their max speed and the server is sending at full 10gbps. Awesome!
This is where the fun begins.
01:42 UTC. Let’s release! Only 2900 truckers were online. The download server was tuned up with all engines running. MWL4 was sleepy and I was overly excited.
We started to watch the download server and ensured that players were getting the mod correctly.. And damn fast might I add! Between 1:42 and 2:08, if you downloaded the mod, you would not be allowed into a server due to version mismatch. This was planned.
02:08 UTC. Servers are restarted. All ~1700 truckers online are kicked off. As you can see on the graph, we BARELY touched 9gbps which slowly decreased and started our true downward slope at 2:40. Awesome!
05:33 UTC. Time for bed and a tweet. Things are looking good, traffic is steady and low, well within ranges.
05:34 UTC. This is where things get hilarious. Remember that low traffic I just tweeted about? Yeah. It turns out 05:34 UTC is the exact time when virtual truckers start to wake up. Let’s look at a graph of 05:30 -> 09:00.
Yikes… what’s next?
Oh yeah… truckers cap the poor server at its max 10gbps… FOR HOURS.
13:53 UTC. Time to wake up.
Always in a good mood early in the morning. :)
Yep. Solid Hours of 10gbps. Something isn’t right here..
Yeah, that number was supposed to be like.. under 10,000. As you can see from my frantic typing, it wasn’t. SO~! How do we fix this? ADD MORE SERVERS! Right?! Yeah. Let’s throw 10 servers at the problem 800mbps each. Let’s fire up csshx and get to work.
14:52 UTC. Servers ready to go. Throw em’ in rotation.
A few minutes pass.. DNS propagates. We’re truckin’ now! ~17gbps!
Yeee buddy. Take that you silly bandwidth problem.
15.24 UTC. I’ve saved the day, time to go home. Wait… So.. did the CDN get everyone caught up? And now the load issue is resolved? That’s odd.. Why isn't our 10gbps server being utilized anymore?
This is either really good, or really bad. Hey, that scrollbar on the side of your browser still goes down, this blogpost can’t end here, right? Right.
15:30 UTC. The Dawning. I just added 10 800mbps servers into a Round Robin cluster with a secondary static A record pointing to our 10gbps server. Why doesn’t.. Oh.. right.. We’re now serving our beefy 10gbps server 1/10th of the traffic now.. Heh.. oops. (Seriously, looking back now I feel absolutely foolish I didn’t foresee this, but no better way to learn/realize new things than trial by fire!) Now, at this point, by reasoning of deduction. We can “safely” assume that our traffic pull EASILY could have been in the range of 20-30gbps since we just 1/10th our load and our server is STILL pushing out ~ 3gbps. Yikes. Our Maximum now is below what the 10gbps server was doing.. Let’s undo this silly idea.
15:32 UTC. DNS changes take effect again, our 10gbps server is now taking most of the new requests, the CDN servers that were in rotation are finishing up the downloads going to the users who connected to them.
15:45 UTC. Stress at all time high. Options about what could work to prevent this in the future discussed. TL;DR, Have some fancy HAproxy load balanced with servers with a 302 redirect on the server taking the brunt of the requests to the fleet of weighted CDN servers (always send requests to the lowest utilized server.) But that’ll have to be after testing and research. Maybe next year.
15:46 UTC. After reading this and laughing, I realized It’s time to step away from the computer. This is a computer game where you drive trucks, I should not be letting this get me so worked up. I will let things even out on the CDN fiasco, get some food, try to relax and most importantly clear my mind.
15:49 UTC. While pouring water into my oatmeal, it strikes me. I can do the 302 trick, but it doesn’t have to be ‘high tech’ with weighted rotation. Let’s just let cloudflare do that under a different subdomain. It can’t be that easy, can it? Let’s try to offload the core_ets2mp.dll file to a single CDN server...
16:01 UTC. Eat and research.
16:05 UTC. WHAT THE HELL?
16:05 UTC. KAT YOU IDIOT! Some CDN server snuck back into the DNS rotation. GET OUTTA HERE!
16:06 UTC. Traffic beings recovering.
16:09 UTC. BEHOLD! THE POWER TO SAVE THE SNOW!
(To anyone requesting the file /files/data/core_ets2mp.dll, tell em’ to go down the road to http://downloads.ets2mp.com and find the file there.)
16:20 UTC. Reload nginx with the new config…
Mother of god.. 404mbps for one 13MB dll file..
16:?? Add a few other files to offset the load to our fleet.
We’ve recovered, Sometime within here around 16:40 our load took a downward trend. We’re going to the peak of player numbers (~19:00 UTC) I think we can call this done. Disable the 302’s, let the 10gbps server do its work; shut down the 10 CDN servers.
And there we have it. Lessons have been learned. My takeaway is this: 1. There will be things that you try to plan for that will not go as expected. This is ok. The important thing is to try to remain calm as best as possible and work towards a solution. 2. Take a break once in awhile. Clearing your head to do something as silly sounding as pouring water to make oatmeal can do beneficial things. 3. Don’t be an idiot and decide to load balance uneven servers when sleepy and stressed out. 4. Simple solutions can do wonders.
I again want to thank 10gbps.io for providing the server for us that did it’s part so very well. Even though our peak load seems to have been 3x what it could handle. I also want to thank you, our players for holding out with us as we wrestled with balancing what ended up being ~7x the load than we were expecting for a 24 hour period. As of writing this, we have served out around 108,000 wintermod downloads. That’s ~82TB of data in 48 hours. Not too bad for a free mod!
If you’d REALLY like to see what the download server can do now that it’s not being abused with more than we planned, press F1 with the launcher open to clear out your local files and try to download it again! :)