Key Conjurer: Our Policy of Least Privilege

Hi, my name is Reza Nikoopour and I’m a security engineer on the Security team at Riot. My team is responsible for securing Riot infrastructure wherever we’re deployed - whether that means internal or external data centers or clouds. We provide cloud security guidance to the rest of Riot, and we’re responsible for Key Conjurer, our open source AWS API programmatic access solution.

Key Conjurer uses AWS STS to create temporary AWS API credentials for accessing our AWS infrastructure programmatically. This solves the problem of having permanent credentials with 24/7 access to Riot AWS infrastructure on our developers’ machines. Permanent credentials present massive security concerns for an organization because they are difficult to manage, track, and rotate properly.

In this article, I’ll walk you through the problems that prompted us to build Key Conjurer, our iterations (including the technical details and final result), and the impacts of our solutions.

The Problem with Permanent Credentials

Managing permanent credentials at scale is notoriously difficult. While handling access on an account is easy with few users, as accounts grow it becomes more difficult to scale proper access management. Credentials aren’t rotated properly, ownership becomes difficult to track, and permission sets grow over time. This results in untracked keys which aren’t regularly reviewed and rarely have permissions reduced. It becomes a serious challenge to easily tell who has access to what, and even harder to take corrective action when something isn’t right.

Security Tech Debt

All of this makes it difficult for security teams to understand what level of access our users have versus what they actually need. We didn’t have insight into who had access, because historically we haven’t controlled IAM users directly - they're managed at the local account level. This is a massive problem because security needs to be able to track who has access to resources in our AWS environment.

Something Needed to Change

At Riot, when we first got into the cloud, we had a shared account model. This enabled us to move quickly - but moving quickly introduced a lot of security debt for the future. For example, at first we gave devs unfettered access as they needed it to resolve player pain as fast as possible. This led to us having tons of users with a variety of different levels of AWS access, and that’s pretty impossible to manage.

Whoops, looks like someone embedded an AWS API key in their code...

While the speed gain helped us provide better player experiences, it became difficult to properly manage our permanent AWS API credentials. A few years after our move to AWS, one of our developers accidentally pushed their API credentials to public github. Within a few minutes of the credential being public, bad actors were able to spin up 1283 AWS spot instances in our environment. The new instances were subsequently detected and we terminated them.

We were extremely lucky as the attackers were noisy smash/grab hackers, so we were able to resolve the issue quickly. This made us realize we had to prioritize figuring out how to implement the policy of least privilege so that we could operate securely in the cloud without impacting developer speed. In other words, we would have to limit programmatic access to services within our AWS accounts to only what was needed.

Building Our Solution

Restricting access to only those who absolutely need it can be an unpopular change. Fortunately, the whole organization aligned with us switching to this policy, which we knew was key for building safer systems. As you can read in a previous post, security at Riot had gone through a cultural evolution. This made it easier to get alignment around our solutions because we had focused on gathering feedback and building trust and communication.

While going through these iterations, we discovered two key aspects of a successful solution.

- It needed to be something developers would actually use.

- It needed to hit our bar for high quality security.

Creating A Usable Product

Our initial solution for removing permanent credentials (let’s call it “Version 1”) technically worked, but it had drawbacks that made adoption difficult. When we implemented our solution, AWS STS credentials were valid for 1 hour maximum. This caused pain for our devs because requesting AWS access every hour was slowing their velocity. Several teams adopted the solution, but it didn’t resonate with some heavy use-cases like teams running build jobs. We knew we had to prioritize making a credentials system that was not only secure, but that made the lives of our developers easier and enabled them to work without having to repeatedly authenticate. At this time, we had a 10% reduction of our total permanent AWS API keys.

As soon as AWS introduced multi-hour AWS STS credentials, we refactored our application. Once the changes were implemented, we immediately saw an uptick in adoption as devs realized how much more convenient this iteration was - resulting in a 75% reduction of our total permanent AWS API keys.

Creating A Safe Product

This surfaced other issues with how the application was architected. The service was run on a long-lived EC2 instance which introduced problems like OS and full stack management which were outside the scope of permanent credential management.

This led us to our current iteration: Key Conjurer. The biggest difference is that we’ve moved to a serverless architecture. Now we have on-demand infrastructure handling requests… which safely disappears into the abyss once it’s done.

Key Conjurer

This is the story I’m most excited to tell - how we took this service from 15 users total to 200 users monthly.

A Quick Timeline

In 2017 we released the original service, Version 1, and shortly thereafter we encouraged everyone to delete their own API credentials. We deleted somewhere around 10% of credentials. That's a good start but we knew we could do more to limit the number of permanent AWS API credentials. We listened to our customers and discovered how big of a pain point the 1-hour limit was.

In 2018, while building Key Conjurer with its multi-hour credential capacity, we documented access across the organization. We collected every AWS API key in our environment, put them into a spreadsheet, and asked all gatekeepers to help us understand why they were needed. We now understood the use case for every permanent credential and could make determinations on which ones could be replaced with temporary credentials.

Current Status

As soon as we delivered the 8 hour temporary credentials, we saw a spike in adoption. After working with the organization, we were able to delete 75% of our permanent credentials. While it hasn’t quite reached our goal of 0 permanent credentials, our attack surface was greatly reduced. We went from all devs having 24/7 access to AWS infrastructure to a more on-demand model of access with standardized permissions.

The 2018 key cleanup was the second proudest point of my career (the first was watching my students graduate).

Architecture

Now that you have an idea of where we are and how we got here, I’m going to tell you a bit about how Key Conjurer looks when you get deep into its guts.

The current model is made up of three components. The API, the Web UI, and the CLI.

The API

This is a serverless application running in AWS Lambda, fronted by API Gateway. This provides the authentication and authorization layer, as well as the temporary credentials.

The Web UI

This is the browser front-end, and it’s served via CloudFront backed by an S3 bucket. Any users requesting temporary credentials via the browser see Key Conjurer through this view.

The CLI

Since developers don’t actually use web browsers for anything except StackOverflow, we built a command-line version for them to leverage. Since Riot has hybrid infrastructure with Windows, Mac, and Linux computers, we built the CLI in Golang so we could easily compile for each platform.

Leveraging Security At Riot

The only reason this whole process of standardization and iteration actually worked within the org is because security is embraced here. We try not to make our developer’s lives difficult. If devs are telling us that our product isn’t convenient, we listen to why and prioritize that feedback.

At Riot, we often say that we don’t want to be the security team in the corner saying “no.” We want to be the team that works with other teams to make sure they understand the risks and help them make the call for fixing or accepting them. If you’d like to hear more about our cultural transformation as a security team, check out this previous article by Mark Hillick, Jason Clark, and David Rook.

A huge part of our earlier efforts was trying to capture what developers actually needed to drive adoption. The fact that our users actively engaged with our solution helped us make something they'd actually want to use, and today we are pleased to see engineers throughout Riot adopting our tool.

Looking Forward

Today, we’re open-sourcing Key Conjurer. We’re excited to share our tool with the world and would love others to test it out for themselves.

So what’s next?

We still have around 200 keys left because they’re needed for service-to-service use cases. We’re currently working on providing a solution for this so developers will no longer have to provide permanent API credentials to services outside of AWS. This will help us continue to follow our policy of least privilege.

Thanks for reading! If you have any questions or comments, feel free to comment below.

Posted by Reza Nikoopour