Tech Design at Riot
My name’s Cam Dunn, and I’m the Tech Director for League. I’ve always found it interesting how much we don’t know about human history simply because no one made a record at the time. Who first invented the hammer? Hammers with handles have been around for tens of thousands of years, but nobody knows who made the first one for sure. Sometimes, history feels like we’re collectively waking up after a drunken night out and asking, “My god, what did we do last night?”
Riot came after the invention of language, but still we suffer from some of the same problems—during those first formative years, everything happened so quickly that sometimes we forgot to write it down or capture it clearly. I remember the first time I visited a Riot data center, where I saw a single rack of running servers completely surrounded by others that’d been powered down. When I asked about it, the answer I got was, “Yeah, we don’t actually know what’s running on those machines, so we can’t turn them off.”
We did eventually identify those machines so we could shut them down, but the experience left me thinking: we need a way of querying services, even if we don’t know what they are. We need a standard for service identification. But how to best establish such a standard?
Here at Riot, we highly value autonomy; in fact, we describe our teams as “highly aligned, highly autonomous.” That level of autonomy means that any kind of formal approval process for technical designs or standards — the kind where “The Man” (pictured below) needs to put his or her rubber stamp on your design — wouldn’t work here. But, that doesn’t mean we’re left with a free for all: we have technical standards and engineering best practices like you’d expect. And to get alignment we use a Request For Comment process.
The Man, in his/her various forms
RFCs aren’t something we invented—internet standards are all RFCs, and communities like Python use a system called PEPs, which is very similar. We hitched the name, but tweaked the process for our own needs.
Our general approach for RFCs is simple: let’s say you’re about to write a new system or make some non-trivial changes to an existing system. You know there are lots of great engineers across Riot, including many who might be experts in your problem domain. So you write up a broad proposal about what you’re about to do and send it around to the entire engineering team to ask for their comments. As an example, to avoid “The Case Of The Mystery Servers” (the example I started this post with), I wrote up an RFC proposing that every service expose its ID through a standard mechanism, so you can find out what anything is… even if you don’t know what it is.
I then gave my RFC a unique ID and tagged it with metadata to help other engineers who might be interested find it. Tags can include team or organization name, or the domain to which the RFC directly applies. These additional details help with maintenance and avoid overload. While everyone has access to every RFC, this system makes picking out just the ones you care about much easier. My Mystery Servers RFC ended up being RFC100, and I tagged it with “SOA,” “API,” and “Standard,” so people who cared about those things could take note and start adding their comments.
As you get feedback on your proposal, the choice is up to you on what to do with it. This isn’t an approval process; this is about getting valuable advice. Generally, people ignore some comments but amend the design to take others into account. Sometimes, people don’t get any actionable feedback and the design stands exactly as-is. Other times, the author realizes through feedback that the idea sucks and scraps it entirely. The decision stays in the hands of the original engineer.
For RFC100, I got lots of great feedback. Engineers stronger in API design pointed out aspects of my proposal that were too vague or too game-specific and suggested improvements. For example, my background is in game technology, and I originally proposed that every server would expose its frame rate, which is a videogame concept that doesn’t necessarily apply to more traditional web services. So we tweaked it to a more applicable, general metrics endpoint.
In terms of feedback, RFC100 isn’t a unique example. Every time I’ve written an RFC, someone at Riot has noticed improvements or pointed out potential pitfalls in my proposed approach. Maybe they’re a 20-year veteran who has written a similar system five times at previous gigs. Or, maybe they’re a younger engineer who has never even heard of the problem space but still picked up on some gap in my proposal. If I’d just started writing the code, I would’ve ended up with a much worse result.
Feedback is invaluable, but how does something become a standard? If everyone’s just doing whatever they want, how the hell does anything work together? The RFC process also allows Rioters to come to agreement on technical standards. We do this through a process called “adoption,” whereby teams (or groups of teams, or whole products) can opt-in to adopting an RFC.
To adopt an RFC simply means that we expect everyone in the adopting scope will follow the guidance given in the RFC if it makes sense for them. Sometimes they won't have to do anything—maybe the RFC only applies to Java development, and they write C++. Or, maybe the RFC talks about gathering performance data from game servers, but they work on internal tools. That's okay. Good judgment is the determining factor here. But in general if something’s adopted at "riot.lol" scope, it then applies to everyone in League of Legends Engineering. Generally by the time we would consider adopting something across the whole project, each team has individually opted in for their scope so there are no surprises.
As might be expected, any process like this can face some suspicion and needs continual tweaking. Over the past three years, we’ve iterated to simplify and improve upon the initial process we put in place. For example, when we started, each RFC had a list of “stakeholders” who were the “important-sounding” people the author thought might care about the topic. However, trying to pick out a list of other people who you, the author, think will care about a given RFC comes with pitfalls that ultimately makes it not as valuable. We now prefer a self-service model whereby potential stakeholders self-identify rather than having the author know everyone at Riot who’d be interested in their proposal. RFC100 was pretty broadly applicable, so a bunch of engineers writing services jumped in to comment. This is just one example of how we continue to tweak the process to make it fit for purpose.
Riot RFC Library
We now (circa October 2015) have over 425 RFCs in our internal library, across every aspect of our tech stack. Most of these are “what do you think?” style proposals, with the author looking for feedback on their approach. Some of these are “standards” RFCs, which define things like communication protocols or coding style guides. Some have a handful of comments, while others have hundreds. Some we abandoned, and others we adopted across the company as a Riot-wide standard.
In the end, my RFC100 ended up a Riot-wide standard called “Query and Control RPC Pattern.” It defines standard endpoints for things like service identification. So now you can ask any Riot service for its id, and there’s no chance that we’ll have servers running because we’re “not sure what it is, but it might be important.” Future visitors to our data centers will not be left scratching their heads about mysteriously running servers.
And thanks to RFCs, Riot engineers are scratching their heads about fewer things every day, now that we have a way of writing things down that works for us. Nobody at Riot wants to be “The Man,” but everyone recognizes the value of sharing our technical designs and standards. The RFC process gives us a decentralized and self-service way of doing this.