Location: US Remote - We are a highly distributed team looking for candidates comfortable working remotely.
About Heroku Lifecycle
Heroku, a subsidiary of Salesforce, operates the world’s largest Platform as a Service (PaaS), continuously delivering millions of apps with a high volume of deploys per day. Heroku's vision is for developers to focus on their applications and leave operations to us.
Our team is an established development team within Heroku. Our team is responsible for the systems that support application development, build and deployment onto the Heroku platform. This includes features like continuous integration, continuous deployment, auto-scaling, application metrics, pipelines and review apps.
You will be a member of our team, we are hiring an SRE on our team to guide us through hardening and supporting our production services. Because you will be part of the Lifecycle team, we will be looking to you for guidance around SRE best practices, multiplying the team's skills, and improving our operational posture. You will be aligning with a nascent SRE team, who is currently developing their charter. You will be an essential part of a feedback loop while the SRE team defines and grows their team.
Who are you?
We’re looking for people who are interested in complex distributed systems how they work, how they can work better, how we even know if they’re working at all. We need someone who's spent time working as a developer (writing code with a team to fix operational issues or build features), but who has also spent time on operational concerns (investigating production incidents, creating or updating monitoring and alerting plans for production systems, or investigating performance issues, for instance).
You don't need to have “SRE” in your job title in order to have appropriate skills for this position. You might come from a DevOps environment or have been one of a handful of engineers in a shop so small that everyone does a little of everything. The important thing is that you have experience in both writing code and maintaining systems, and that you're willing to do both of those things in the future. If you're stronger in one area than the other, that's okay.
Be sure to read or skim the Site Reliability Engineering (https://landing.google.com/sre/books/) book, which we are modeling our team structure around.
You need to have worked with complex distributed systems and be familiar with how the internet and web applications work. You don’t have to have built a datacenter or run a large cloud service at a major provider, but you do need to have used cloud services. Running LAN infrastructure or doing client-side system administration is not enough for this role.
Willing to work on a distributed team (majority from home) spanning multiple time zones. Prior remote work experience not required, as many of us learned how to work remotely on the job.
Comfortable reading and writing code with a team in at least one of Ruby, Go, Python, or Erlang. It's fine if you know more than one of those languages and/or other languages, but they are the four most important languages at Heroku. We need people who are comfortable with them, and open to switching between them.
What's this job like?
This job is open to people anywhere in North America. You can work at a Salesforce office or work from home. Aspects of the role include:
- Actively developing Service Level Objectives (SLOs) for our most critical services of the platform where they don't currently exist and defining minimum standards for service health metrics
- We are readying ourselves for the SRE Entrance process in order to hand off the operation of a production service to the SRE team. We expect your role will help define this process and handoff.
- The team will be on call for multiple production services, this includes:
- Responding to pages generated by automated monitoring and alerting
- Responding to pages created manually by other engineers and support personnel
- Joining an incident response team as a Subject Matter Expert and working with other SMEs and an Incident Commander to resolve the issue (we'll train you for this)
Other likely projects include:
- Automated data and service management tooling
- Instrumenting for observability for troubleshooting
- Implementing performance improvements
- Hardening for resilience in the face of operational events and customer behavior
How do I know if I should apply?
If you have experience with any of the following topics, you should apply!
- Experience with AWS services like EC2, ELB, EKS, S3 (or their Azure or GCP equivalents OpenStack experience is fine too)
- Databases and big data stores, especially Postgres or Kafka REST APIs
- REST APIs
- Monitoring, instrumentation, or observability
- Standard parts of a web app's stack, such as TCP/IP, DNS, HTTP, etc.
- Cloud computing patterns (and how they're different than using hardware)
Salesforce.com and Salesforce.org are Equal Employment Opportunity and Affirmative Action Employers. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status. Headhunters and recruitment agencies may not submit resumes/CVs through this Web site or directly to managers. Salesforce.com and Salesforce.org do not accept unsolicited headhunter and agency resumes. Salesforce.com and Salesforce.org will not pay fees to any third-party agency or company that does not have a signed agreement with Salesforce.com or Salesforce.org.
Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.
Apply for Sr., Lead, Principal, Site Reliability Engineer, Lifecycle, Heroku You will be taken to the listing on Salesforce’s career site.