Mission Bay Conference Center
1675 Owens St
San Francisco, CA
United States
View on Map

Apr 25–28 , 2017
10:00am - 6:00pm PDT
Add to Calendar

  • Heroku sponsored
  • Heroku presenter
  • PostgreSQL
  • Kafka
  • Data Services

DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts. Conference talks focus on examples of real-world architectures, data pipelines and plumbing systems, and applied, practical examples of data science.

Software Engineer Jeff Chao is presenting the following talk at this year's event:

Beyond 50,000 Partitions: How Heroku Operates and Pushes the Limits of Kafka at Scale

At Heroku, we offer Apache Kafka as a service to a large number of distinct users, each with a varying number of use cases. Our goal is to provide this service to users while removing many of the operational headaches that comes with running infrastructure like this at scale.

While we run a robust service today, it took a lot of effort to get here. Early on, we uncovered a variety of failures scenarios. One such problem was surfaced when a large number of concurrent admin commands, such as topic or consumer group creation, would cause a cascading broker failure. Our users would then observe not only enormous latencies around these admin commands, but also see a considerable drop in throughput until our infrastructure automatically recovered. It was important for us to address this because it would only get worse as our user base grew.

Once we overcame this hurdle, we discovered another problem later on where brokers would fail when a large number of users would constantly go above their quota. This was in part a result of our decision to leverage quotas to ensure our users could coexist and behave among each other. However, it turned out to be another case of cascading broker failures since the underlying quota implementation, based on traffic shaping, had a memory leak that caused out-of-memory errors one broker at a time.

In this talk, we will continue with our stories above in greater detail and share more anecdotes of how we discovered and addressed interesting behaviors and gotchas as we scaled our infrastructure to ensure a robust and trustworthy service for our growing user base.