Who are you?
We’re looking for someone who’s interested in the oversight of complex distributed systems- how they work, how they can work better, how we even know if they’re working at all. Since the hard problems in computing are the human problems, we’re also looking for someone who’s into improving inter-team collaboration, from a technical and personal point of view. Someone who thrives in finding inefficiencies and influencing others to make performance improvements. Moved on from the day to day coding and development of applications.
This is a good role for a generalist with one or more areas of focus or special interest. There are lots of career paths that might lead you here! You could come from product management, technical program management, operations, development, technical writing, sales engineering, developer relations, or others we haven’t thought of.
- Join the on-call rotation as Incident Commander and serve as an after hours point of escalation within the Leadership pager rotation (which you would help define). We’ll train you in the incident command system, but you should have some experience responding to critical production outages.
- Interest in complex distributed systems and familiarity with how the internet and web applications work. You don’t have to have built a data center or run a large cloud service, but you do need to be familiar with the OSI model or equivalents and be able to talk ways to make a system more resistant to failure.
- Willing to work on a distributed (currently all-remote) team spanning multiple time zones (none of us lives in the same place or currently works out of the San Francisco headquarters; all of us are experienced remote workers)
About Heroku Production Engineering Governance
Heroku, a subsidiary of Salesforce, operates the world’s largest Platform As A Service (PaaS), continuously delivering millions of apps with a high volume of deploys per day. Heroku’s vision is for developers to focus on their applications and leave operations to us. Our work environment is collaborative, flexible and fun. We’re focused on technical and operational excellence and customer success.
The Production Engineering Governance team cares about the holistic health of the Heroku platforms. Our primary focus is serving as Incident Commander on call. We own the incident response process, which means we’re responsible for making sure it’s effective and updating it to address changing needs. We consult on best practices with engineering teams throughout Salesforce. We work on how engineering work gets done - what processes engineers follow, how we know we’re following them, and how we can tell if those processes are working.
What’s this job like?
This job can be done from anywhere in the world. You can work at a Salesforce office or work from home, you can work flexible hours (we of course have meetings, but we schedule them based on the time zones of the folks who need to attend, and we record them and share recordings with people who can’t be present or awake at that time).
On the job, you’ll champion the reliability of Heroku production services. That means:
- Respond to production incidents as an Incident Commander.
- Follow up on production incidents with after action reviews, contributing factor analysis, incident response analysis, and remediation plans.
- Retrospective facilitation and analysis of system failures, design and operation practices for platform components operating on AWS.
- With the rest of the team, you’ll own the incident response process, review it for effectiveness, and update it for changing conditions.
- Oversee the process for communicating maintenance with customer impact, including setting maintenance black periods and process improvements for change management.
- Make it easy to do the right thing in high-stakes situations with guidelines, processes, and policies and oversee their organizational adoption.
- Collaborate with multiple engineering teams and engineering leadership to create organizational change in support of improving operational excellence
- Ensure that Heroku engineering is continuously raising our standard of operational excellence by giving guidance on testing, eliminating, automating, and planning for known and anticipated failure scenarios.
- Connect and collaborate with geographically distributed teams and people of various backgrounds.
- Partner directly with our most important vendors to ensure the reliability of our service.
If you’re interested in these things, you might work on:
- Designing and conducting incident command system training for new hires, both incident commanders and responders.
- Operational finance- understanding and tracking the cost of operating the Heroku platform and planning responses to changes in vendor pricing or offerings.
How do I know if I should apply?
If you are interested in or have experience with any of the following topics, you should apply!
- Design of engineering and collaborative processes for actual people, not ideal people who never make mistakes (see Kathy Sierra’s discussion of humans vs humanoids)
- Large scale service-oriented infrastructure and the design of scalable, highly available systems in the real world
- Product management for software, platforms, or infrastructure as a service
- Performance characteristics of distributed systems
- Cloud services like Amazon Web Services
- Heroku, RESTful web services, Linux, Ruby, and Rails
- Virtualization and containerization (Xen, LXC, cgroups, Docker, Kubernetes)
The role requires periodic travel to the United States, approximately four times per year.
Apply for Performance Engineer, EMEA You will be taken to the listing on Salesforce’s career site.