Site Reliability Engineer


Heroku SRE - Operations Analytics Engineer (PST)

Who are you?

We’re looking for people who are interested in complex distributed systems- how they work, how they can work better, how we even know if they’re working at all. Someone interested in analyzing the data behind incidents to help solve operational problems and streamline processes. We need someone who's spent time working as a developer (writing code with a team to fix operational issues or build features), but who has also spent time on operational concerns (investigating production incidents, creating or updating monitoring, and alerting plans for production systems, or investigating performance issues, for instance) that understand trends and patterns that impact the operational health of systems.

You might come from a Data Analyst or even a DevOps environment or have been one of a handful of engineers in a shop so small that everyone does a little of everything. The important thing is that you have experience in both writing code and maintaining systems and that you're willing to do both of those things in the future. If you're stronger in one area than the other, that's okay.

Requirements

  • Operational mindset someone seeking to continuously improve and search for improved efficiency over the long term using data-driven development as the cornerstone.
  • Willing to join an on-call rotation as Incident Commander (we’ll train you -- no experience required). 1 week on and 3 weeks off the pager.
  • Willing to work on a distributed (currently all-remote) team spanning multiple time zones. None of us currently lives in the same place or works out of the San Francisco headquarters; all of us are experienced, remote workers.
  • Create reports for internal teams and stakeholders. Collaborate with team members to collect and analyze data.
  • Build dashboards, graphs, and other methods to visualize data.
  • Comfortable reading and writing code with a team in at least one of Ruby, Go, or Python. It's fine if you know more than one of those languages and/or other languages. We need people who are comfortable with them, and open to switching between them.
  • Interested in doing the cross-organizational interpersonal work of breaking down cultural silos and increasing efficiency or velocity.

About our OPAL (OPerations analytics) team

This is a newly rebranded team within our platform.

We partner with Engineering teams across the Salesforce Platform. Our team strives to ensure customer success through continuous service improvement via incident management analysis and promote cross-organizational collaboration. Our success is measured in fewer repeat incidents, faster resolution time, and recorded service availability that meets or exceeds the company SLA.

What's this job like?

You can work at a Salesforce office or work from home. Here is more information about our goals/work:

  • Currently, we own and manage the full Incident Management System lifecycle within our platform
  • Serve as Incident Commanders
  • No two days are the same we work in a fast-paced interrupt-driven part of the organization that is always evolving. As a new team things will continue to improve and evolve over the course of time.

  • Incident reporting, analytics, data mining, and incident management are the primary expectations and deliverables for this team

  • Facilitate and prepare weekly engineering wide Operations Review meeting

  • Collaborate and pair with team members to build/refactor internal tools to automate and optimize internal team workflows

How do I know if I should apply?

If you have experience with any of the following topics, you should apply!

  • Incident Management and Retrospective Facilitation
  • Coding skills in languages such as SQL, Python, and/or R
  • Analytical and problem-solving skills
  • Cloud computing patterns (and how they're different than using hardware)
  • Experience with statistical software (e.g., Stata, SPSS)
  • Knowledge of data gathering, cleaning, and transforming techniques
  • Reporting and data visualization skills using software like Tableau
  • Understanding of data warehousing and ETL techniques
  • Proficiency in Microsoft Excel
  • Ability to set and meet deadlines
  • Ability to work in high-pressure situations
  • Technical writing skills
  • Excellent attention to detail
  • Strong written/verbal communication skills
  • Ability to QA and troubleshoot data

Apply for Site Reliability Engineer You will be taken to the listing on Salesforce’s career site.