返回查询:Head Of / 广州市

Responsibilities:

  • Strategic Leadership: Develop and execute a company-wide reliability engineering strategy, including the management of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
  • System Resiliency: Implement chaos engineering practices to identify and strengthen system weaknesses. Design and maintain robust disaster recovery and business continuity plans.
  • Observability: Lead the development of our observability platform to ensure comprehensive insights into system performance, logs, tracing, and alerting.
  • Automation: Champion the adoption of infrastructure-as-code and leverage automation to streamline operations and eliminate manual tasks.
  • Incident Management: Establish and refine incident response processes, including a blameless post-mortem culture, to learn from every event and prevent recurrence.
  • Collaboration: Partner closely with product, development, and operations teams to embed reliability throughout the entire software development lifecycle.

Qualifications:

  • Proven experience in reliability engineering or a similar role, with 8+ years of total experience and at least 3 years in a leadership or management position.
  • Strong understanding of system architecture, design principles, and cloud platforms (AWS, Azure, or GCP) with a proven track record of working on large-scale distributed systems.
  • Proficiency in scripting languages (Python, Go, or Java) for automation and tool development.
  • Familiarity with monitoring tools (Prometheus, Grafana, Datadog) and incident management systems.
  • Effective communication skills to convey complex technical concepts to non-technical stakeholders.
  • Adaptability to learn and implement new technologies and tools as required.