Engineering

Hello RedBeat: A New Celery Beat Scheduler

Engineering
Last Updated: May 02, 2017
Marc Sibson

The Heroku Connect team ran into problems with existing task-scheduling libraries. Because of that, we wrote RedBeat , a Celery scheduler that stores scheduled tasks and runtime metadata in Redis . We’ve also open-sourced it so others can use it. Here is the story of why and how we created RedBeat.

Why We Created the RedBeat Celery Scheduler

Heroku Connect makes heavy use of Celery to synchronize data between Salesforce and Heroku Postgres . Celery is an asynchronous task queue that lets us schedule and queue jobs for execution by a background worker process.…

Sockets in a Bind

Engineering
Last Updated: March 30, 2017
Lex Neva

Back on August 11, 2016, Heroku experienced increased routing latency in the EU region of the common runtime. While the official follow-up report describes what happened and what we've done to avoid this in the future, we found the root cause to be puzzling enough to require a deep dive into Linux networking.

The following is a write-up by SRE member Lex Neva ( what's SRE? ) and routing engineer Fred Hebert (now Heroku alumni) of an interesting Linux networking "gotcha" they discovered while working on incident 930.

The Incident

Our monitoring systems paged…

How We Found and Fixed a Filesystem Corruption Bug

Engineering
Last Updated: February 15, 2017
Owen Jacobson

As part of our commitment to security and support, we periodically upgrade the stack image, so that we can install updated package versions, address security vulnerabilities, and add new packages to the stack. Recently we had an incident during which some applications running on the Cedar-14 stack image experienced higher than normal rates of segmentation faults and other “hard” crashes for about five hours . Our engineers tracked down the cause of the error to corrupted dyno filesystems caused by a failed stack upgrade. The sequence of events leading up to this failure, and the technical details of the…

Pulling the Thread on Kafka’s Compacted Topics

Engineering
Last Updated: January 11, 2017
Tom Crayford

At Heroku, we're always working towards improving operational stability with the services we offer. As we recently launched Apache Kafka on Heroku , we've been increasingly focused on hardening Apache Kafka, as well as our automation around it. This particular improvement in stability concerns Kafka's compacted topics, which we haven't talked about before. Compacted topics are a powerful and important feature of Kafka, and as of 0.9, provide the capabilities supporting a number of important features.

Meet the Bug

The bug we had been seeing is that an internal thread that's used by Kafka to…

How We Sped up SNI TLS Handshakes by 5x

Engineering
Last Updated: December 22, 2016
Fred Hebert

During the development of the recently released Heroku SSL feature, a lot of work was carried out to stabilize the system and improve its speed. In this post, I will explain how we managed to improve the speed of our TLS handshakes by 4-5x.

The initial reports of speed issues were sent our way by beta customers who were unhappy about the low level of performance. This was understandable since, after all, we were not greenfielding a solution for which nothing existed, but actively trying to provide an alternative to the SSL Endpoint add-on, which is provided by…

Handling Very Large Tables in Postgres Using Partitioning

Engineering
Last Updated: June 10, 2024
Rimas Silkaitis

One of the interesting patterns that we’ve seen, as a result of managing one of the largest fleets of Postgres databases , is one or two tables growing at a rate that’s much larger and faster than the rest of the tables in the database. In terms of absolute numbers, a table that grows sufficiently large is on the order of hundreds of gigabytes to terabytes in size. Typically, the data in this table tracks events in an application or is analogous to an application log. Having a table of this size isn’t a problem in and of itself,…

Apache Kafka 0.10

Engineering
Last Updated: June 03, 2024
Tom Crayford

At Heroku, we're always striving to provide the best operational experience with the services we offer. As we’ve recently launched Heroku Kafka, we were excited to help out with testing of the latest release of Apache Kafka, version 0.10, which landed earlier this week. While testing Kafka 0.10, we uncovered what seemed like a 33% throughput drop relative to the prior release. As others have noted , “it’s slow” is the hardest problem you’ll ever debug, and debugging this turned out to be very tricky indeed. We had to dig deep into Kafka’s configuration and operation to uncover what…

Heroku Metrics

Engineering
Last Updated: May 26, 2016
Andrew Gwozdziewycz

For almost two years now, the Heroku Dashboard has provided a metrics page to display information about memory usage and CPU load for all of the dynos running an application. Additionally, we've been providing aggregate error metrics, as well as metrics from the Heroku router about incoming requests: average and P95 response time, counts by status, etc.

Almost all of this information is being slurped out of an application's log stream via the Log Runtime Metrics labs feature. For applications that don't have this flag enabled, which is most applications on the platform, the relevant logs are still…

Simulate Third-Party Downtime

Engineering
Last Updated: March 01, 2016
Damien Mathieu

I spend most of my time at Heroku working on our support tools and services; help.heroku.com is one such example. Heroku's help application depends on the Platform API to, amongst other things, authenticate users, authorize or deny access, and fetch user data.

So, what happens to tools and services like help.heroku.com during a platform incident? They must remain available to both agents and customers—regardless of the status of the Platform API. There is simply no substitute for communication during an outage.

To ensure this is the case, we use api-maintenance-sim , an app we recently open-sourced, to…

Speeding up Sprockets

Engineering
Last Updated: February 22, 2016
Richard Schneeman

The asset pipeline is the slowest part of deploying a Rails app. How slow? On average, it's over 20x slower than installing dependencies via $ bundle install. Why so slow? In this article, we're going to take a look at some of the reasons the asset pipeline is slow and how we were able to get a 12x performance improvement on some apps with Sprockets version 3.3+ .

The Rails asset pipeline uses the sprockets library to take your raw assets such as javascript or Sass files and pre-build minified, compressed assets that are ready to be served…

Subscribe to the full-text RSS feed for Marc Sibson.