Responsibilities:
- Strategic Leadership: Develop and execute a company-wide reliability engineering strategy, including the management of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- System Resiliency: Implement chaos engineering practices to identify and strengthen system weaknesses. Design and maintain robust disaster recovery and business continuity plans.
- Observability: Lead the development of our observability platform to ensure comprehensive insights into system performance, logs, tracing, and alerting.
- Automation: Champion the adoption of infrastructure-as-code and leverage automation to streamline operations and eliminate manual tasks.
- Incident Management: Establish and refine incident response processes, including a blameless post-mortem culture, to learn from every event and prevent recurrence.
- Collaboration: Partner closely with product, development, and operations teams to embed reliability throughout the entire software development lifecycle.
Qualifications:
- Proven experience in reliability engineering or a similar role, with 8+ years of total experience and at least 3 years in a leadership or management position.
- Strong understanding of system architecture, design principles, and cloud platforms (AWS, Azure, or GCP) with a proven track record of working on large-scale distributed systems.
- Proficiency in scripting languages (Python, Go, or Java) for automation and tool development.
- Familiarity with monitoring tools (Prometheus, Grafana, Datadog) and incident management systems.
- Effective communication skills to convey complex technical concepts to non-technical stakeholders.
- Adaptability to learn and implement new technologies and tools as required.