Gamification to the Rescue

What is Gamification?

Gamification is a method of improving by adding game-like elements to non-game contexts. It’s a way to make work more fun and engaging, and it can be a powerful tool for improving motivation and productivity.

It sounds compeltely crazy, but it’s actually a really effective way to get people to do things they might not otherwise want to do. It’s all about tapping into our natural desire for competition, achievement, and recognition.

A Story of Gamification

A few years ago I was working at an organization that was cutting edge in its field. So much so that it had to invent much of the technology it used.

If you know something can be done, its really easy to do it, and do it well. If you’re having to feel your way in the dark, then its hard to see the problems until they’re deeply embedded in the system. In our case we were so busy on the “can it even be done” that we weren’t looking at whether we were doing the basics well.

One major area of concern was vulnerabiltiies. We had a lot of them, but we had such a backlog of features, tech debt and bugs, we just never seemed to find time to get to them. We had a lot of tools, but they were all just sitting there, not being used.

We tried running “Vulnerability Sprints” - a week where we would focus on fixing vulnerabilities. It was a good idea, but it just didn’t work. We would get a few done, but then we would get distracted by other things, and the momentum would be lost.

I was running a small SRE team trying to work on making our systems work better with each other. Vulnerabilities were a problem, but they weren’t my problem. I had my own backlog of work to do, and I just didn’t have the time to focus on them.

Then we had a minor situation occur. It was nothing earth shattering, and nothing was at risk, but it was the first time something had gained the visibility that vulnerabilities were a problem we couldnt ignore. We had to do something that not only got on top of the problem, but something that lasted.

My team didnt have much capacity, but I was asked to look into it. I wrote some scripts to pull some stats, and I was able to match them up to various tags on our cloud infrastructure. It was by no means perfect, but it was good enough to get a rough idea of where the vulnerabilities were coming from.

I wrote a simple scheduled script that would pull the stats and push the data into our company reporting tools. I set up a dashboard that would show the number of vulnerabilities, and how they were trending over time. I also set up some alerts that would notify us when we had a certain number of vulnerabilities.

And the tumbleweeds kept on tumbling. No one cared. We had the data, we had the tools, but no one was using them.

Instead I madified the script to generate a set of simple HTML charts. I used that to create a set of (worst case first) leaderboards. Essentially the same data sliced and diced in different ways. I didn’t know what was going to work, so I made a few that seemed like they might be useful. I had the script send the report the the company-wide email distribution list.

On day one we had:

  • Critical: 6237
  • High: 64598

The email went out at 3am, and by the time i logged in at 8:30 I had a few Slack messages from people asking about the report.

By the end of the day I had more people talking, and some genuinely angry that I was attempting to make them look bad.

My response to everyone was the same - “I’m not trying to make anyone look bad, i’m just making sure we all know where we stand”

over the next few days I had engineers reaching out asking where i got the data from, what they could do to fix problems etc. It was a little overwhelming. By the end of the week i was regretting it, but the CTO reached out and couldn’t believe what a huge reaction there had been. Despite various attempts in the past, this was the biggest noise it had ever created.

Among the various conversation my team identified a few key areas that were causing the most problems. The biggest being Container repositories. Teams created images, but never cleaned up old ones. The vulnerability scanners however were still scanning them, and they were still showing up in the reports. We had a lot of old images that were just sitting there, and they were causing a lot of noise. Engineers complained that they weren’t a risk so shouldnt be counted, but we had to draw the line somewhere so i pushed back.

I did however keep telling people that I still wasn’t telling them to do anything they didnt want to do.

I spent the first couple of days the next week writing a tool that scanned a repository, looked at all the places the images could be used, and then deleted images that weren’t needed any more. It could be run in dry-run mode, and it could be set on a schedule. By the time the third email went out, a few teams had implemented it in dry-run mode, and by the time the fourth email went out, a few teams had implemented it in production.

Eventully we replaced my nieve script with a better tool that captured more data and was more accurate, but by the time I stopped sending my report emails we had:

  • Critical: 298
  • High: 12417

Those numbers weren’t perfect, but to put it into context we never told anyone to deal with vulnerabilities, we just published the top 10 worst offenders each week. We reduced cricital vulnerabilities to just 4.7% of the original Criticals, and 19.2% of the original Highs.

The Power of Gamification

The key was that we never told anyone to do anything, and we didnt ask anyone to de-prioritize work they were already planning.

We, as a central team provided tools to do a lot of the heavy lifting, but we didnt force anyone to use them.

We just made sure everyone knew where they stood, and let them decide what to do about it. We let them decide how to prioritize it, and we let them decide how to fix it.

After the Game

Once things were at a manageable level we introduced a new CI/CD module that blocked deployments if they have any critical vulnerabilities at all, and if they have more than 10 high vulnerabilities. We also helped teams set up Dependabot to keep their dependencies up to date. This stopped the flow of dependencies into production, but if we had done this on day one the whole company would have crumbled trying to deal with it. Instead we got on top of the problem, and then put in guard-rails to stop it from happening again.

It may have taken time, but it was a sustainable solution that lasted. We didnt just fix the problem, we fixed the problem and then stopped it from happening again.