Site Reliability Engineer (India)


Who are you?

We’re looking for people who are interested in complex distributed systems- how they work, how they can work better, how we even know if they’re working at all. We need someone who's spent time working as a developer (writing code with a team to fix operational issues or build features), but who has also spent time on operational concerns (investigating production incidents, creating or updating monitoring and alerting plans for production systems, or investigating performance issues, for instance).

You don't need to have “SRE” in your job title in order to have appropriate skills for this position. You might come from a DevOps environment or have been one of a handful of engineers in a shop so small that everyone does a little of everything. The important thing is that you have experience in both writing code and maintaining systems, and that you're willing to do both of those things in the future. If you're stronger in one area than the other, that's okay.

Requirements

  • You need to have worked with complex distributed systems and be familiar with how the internet and web applications work. You don’t have to have built a datacenter or run a large cloud service at a major provider, but you do need to have used cloud services. Running LAN infrastructure or doing client-side system administration is not enough for this role.
  • Willing to join a 24/7 on-call rotation as Incident Commander (we’ll train you -- no experience required).
  • Willing to work on a distributed (currently all-remote) team spanning multiple time zones. None of us currently lives in the same place or works out of the San Francisco headquarters; all of us are experienced remote workers.
  • Comfortable reading and writing code with a team in at least one of Ruby, Go, or Python. It's fine if you know more than one of those languages and/or other languages, but they are the three most important languages at Heroku. We need people who are comfortable with them, and open to switching between them.
  • Interested in doing the cross-organizational interpersonal work of breaking down cultural silos and increasing efficiency or velocity.

About our SRE team

We partner with Engineering teams across the Salesforce Platform continuously delivering millions of apps with a high volume of deploys per day. The overarching vision of our platform is for developers to focus on their applications and leave operations to us.

Our Site Reliability Engineering team’s current focus involves integrating across the Salesforce Platform and serving as a liaison between Engineering, Product, and/or Marketing teams. The overall goal is to improve the stability and reliability of our services and components by reducing the toil and increasing maturity standards in a meaningful way. Seeking to achieve a healthy sustainable infrastructure that all Salesforce customers can trust.

What's this job like?

You can work at a Salesforce office or work from home. Here is more information about our goals/work:

  • Currently we own and manage the full Incident Management System lifecycle within our platform
  • All SREs will serve as Incident Commanders serving a weekly 24/7 pager rotation
  • No two days are the same we work in a fast paced interrupt driven part of the organization
  • Working to mature our SRE Engagement work that supports the overarching mission to reduce toil within identified components and services
  • The team will potentially be on call for multiple production services. This includes:
    • Responding to pages generated by automated monitoring and alerting
    • Responding to pages created manually by other engineers and support personnel
    • Joining an incident response team as a Subject Matter Expert and working with other SMEs and an Incident Commander to resolve the issue (we'll train you for this)

How do I know if I should apply?

If you have experience with any of the following topics, you should apply!

  • Incident Management and Retrospective Facilitation
  • Containers and container management technologies such as lxc, Docker and Kubernetes
  • Experience with AWS services like EC2, ELB, DynamoDB, S3 (or their Azure or GCP equivalents- OpenStack experience is fine too)
  • Databases and big data stores, especially Postgres or Kafka
  • REST APIs
  • Load balancing technologies, including L4 or L7 routing and CDNs
  • Monitoring, instrumentation, or observability
  • Standard parts of a web app's stack, such as TCP/IP, DNS, HTTP, etc.
  • Cloud computing patterns (and how they're different than using hardware)
  • Infrastructure as code (Terraform, Chef, Puppet, Ansible, CloudFormation, etc)

Apply for Site Reliability Engineer (India) You will be taken to the listing on Salesforce’s career site.