Kolide

Heroku’s Managed Service Helps Kolide Do More with a Smaller Team

After an expensive experiment with Kubernetes, Kolide can execute faster on Heroku.

Data breaches are bad news, both for the people exposed and for the company at the center. While it's tempting to blame hackers, many leaks are due to human error. The traditional solution has been to lock down employee devices so that such mistakes are impossible. However, that makes it harder for people to do their jobs.

Kolide takes a different approach. Rather than restricting employees, it works with them. Every few seconds, Kolide checks their devices for potential security issues. If it finds something, Kolide's Slack app alerts the employee and suggests a solution. Kolide also keeps the IT department informed with a security dashboard that offers a bird's eye view of their entire fleet.

An example of the Kolide Slack bot showing a user how to improve their security

Having launched their product on Kubernetes, the Kolide team recognized that the effort of managing their infrastructure was taking resources away from product development. Since switching to Heroku, the Kolide team can focus entirely on delivering value to their customers.

Kubernetes required an entire team just to manage it

When Kolide launched its first SaaS product in 2017, Kubernetes was the exciting technology of the moment. With ambitious plans for hyper growth, Kubernetes seemed like the perfect fit for Kolide as it promised to make scaling easy by creating a dedicated instance of the Kolide platform for each customer.

However, the reality was both harder and costlier than the Kolide team had expected. While Kubernetes provided the basics of orchestrating containers, the Kolide team found that they had to create a great deal of custom tooling just to deploy and manage their software. In particular, rolling out updated code to each customer’s individual namespaces took substantial engineering effort. All of that time spent managing Kubernetes was time taken away from building their own product.

Missed opportunity was not the only cost that Kubernetes brought. Out of an engineering team of 16 people, four full time Site Reliability Engineers (SREs) were needed to look after Kolide’s Kubernetes infrastructure. And the cost of hosting all those containers, many for customers on free trials, became a significant monthly expense.

Bringing version one of Kolide to market gave the Kolide team deep insights into the security needs of their customers. Over time, the team realised they could better serve their customers and their customers’ employees by taking a more collaborative approach to security. That would mean creating an entirely new offering. However, after burning budget on Kubernetes infrastructure, the Kolide team had limited resources. Building their new product would take a new approach, and so the Kolide team turned to Heroku as it would allow them to focus their efforts entirely on feature development.

Heroku’s managed services mean we don’t need to hire dedicated Site Reliability Engineers to run our infrastructure while we validate our app for product/market fit. Instead, we can focus on delivering value to our customers. Jason Meller, Founder & CEO, Kolide

Kolide’s infrastructure on Heroku

Today, Kolide’s approach to employee device security relies on a hybrid infrastructure, split between Heroku and the employee devices themselves. On each employee device is an open source agent named “osquery” whose job is to produce a report on potential security risks. Every few seconds, osquery checks in with Kolide’s Heroku-hosted API to deliver its most recent report and also to receive instructions on what to include in its next report. For example, if a security researcher discovers a new application vulnerability, Kolide would instruct osquery to check for its presence.

A collection of tiles showing different security data

Kolide has two distinct Ruby on Rails applications running on Heroku. The front-line application is Kolide's Device API server. With thousands of employee devices relaying data tens of thousands of times per day, Kolide's Device server runs across several Heroku Dynos in order to handle the volume of requests. Once device data arrives, Kolide must also check it for any notable security risks and then act on anything it finds. When an issue is discovered, Kolide's primary Heroku-hosted app comes into play. This app does double duty as the web UI  dashboard for IT/Security administrators, and also relays information to Slack's API to run Kolide's Slack app, which alerts employees to any issues and offers advice on how to solve them.

Heroku Redis enables Kolide to handle millions of API calls

Heroku's managed data services were a significant factor in why the Kolide team chose to build their revised infrastructure on Heroku. Not only are there on average over 17,000 daily data uploads for each device, but onboarding a large new customer can mean an immediate leap of millions more reports added to the daily total.

To ensure high responsiveness and high throughput from their API server, the Kolide engineering team built the service with the aim of closing individual API calls as quickly as possible. Their original plan was to use Apache Kafka on Heroku to queue inbound device data, ready for processing asynchronously. However, they found that it was simpler instead to use Heroku Redis to ingest the data from API calls and then hand that data for analysis to background Rails processes managed by the Sidekiq job scheduler. Results from the analysis are then written to Kolide's Heroku Postgres database. With this lean setup, Kolide is successfully processing nearly a billion jobs per month.

An example of a live SQL query in Kolide

Just as the API server must remain responsive, the Slack application needs to be quick to respond to user interaction. While web applications use progress bars and other visual clues to indicate that something is happening, users of a chat app expect instantaneous feedback when they perform an action. As such, the Kolide team has a strict requirement that their Slack bot must respond immediately to user queries. Heroku is vital in meeting that expectation. In busier times, the Kolide team can easily increase the number of dynos serving their Slack application and then spin them back down in quieter times.

While their dashboard application receives only a small amount of traffic compared to their API server, Heroku’s simple scaling model ensures that new customers put no noticeable strain on the application.

Heroku’s expertise with Postgres and Redis means that we can take on more customers without having to pause and think about how to handle the increase in demand on our data layer. Jason Meller, Founder and CEO, Kolide

Heroku’s built-in expertise lets Kolide build more, faster

For the Kolide team, Heroku is more than just a way to deploy their apps. Instead, Heroku is a lever that lets them get more value to their customers faster and with a smaller team. Switching to Heroku meant that the Kolide team can focus on iterating quickly on their product without needing a team of dedicated engineers managing the operations and infrastructure.

The Kolide web UI

In moving from Kubernetes to Heroku, Kolide has saved thousands of dollars in monthly platform fees, continued to provide excellent service to customers despite a reduced engineering headcount, and cut their time to market. By working with Heroku, Kolide has built a platform that protects employee devices worldwide.


Inside Kolide on Heroku

Kolide serves thousands of requests per second with three Ruby on Rails apps running on Heroku Dynos, backed by Heroku Postgres and Heroku Redis. They queue inbound data using Redis for asynchronous processing and use Sidekiq to trigger background Ruby on Rails jobs. Their primary app is an API server that communicates with devices across the globe, and it is supported by apps delivering a web dashboard and a Slack chatbot.


Code[ish] podcast icon

Listen to the Code[ish] podcast featuring Jason Meller: “ Cybersecurity”.