Responsibilities:

Strategic Leadership: Develop and execute a company-wide reliability engineering strategy, including the management of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
System Resiliency: Implement chaos engineering practices to identify and strengthen system weaknesses. Design and maintain robust disaster recovery and business continuity plans.
Observability: Lead the development of our observability platform to ensure comprehensive insights into system performance, logs, tracing, and alerting.
Automation: Champion the adoption of infrastructure-as-code and leverage automation to streamline operations and eliminate manual tasks.
Incident Management: Establish and refine incident response processes, including a blameless post-mortem culture, to learn from every event and prevent recurrence.
Collaboration: Partner closely with product, development, and operations teams to embed reliability throughout the entire software development lifecycle.

Qualifications:

Proven experience in reliability engineering or a similar role, with 8+ years of total experience and at least 3 years in a leadership or management position.
Strong understanding of system architecture, design principles, and cloud platforms (AWS, Azure, or GCP) with a proven track record of working on large-scale distributed systems.
Proficiency in scripting languages (Python, Go, or Java) for automation and tool development.
Familiarity with monitoring tools (Prometheus, Grafana, Datadog) and incident management systems.
Effective communication skills to convey complex technical concepts to non-technical stakeholders.
Adaptability to learn and implement new technologies and tools as required.

Head of Reliability Engineering