Location:
Remote, with willingness to travel overseas

Core Responsibilities

Strategic Design & Architecture Planning

Lead the end-to-end architecture design of overseas AI compute clusters, covering compute, network, storage, and liquid-cooling systems.

Deeply understand clients' AI workload requirements and translate them into advanced, reliable, and scalable technical solutions.

End-to-End Construction & Delivery Management

Take full responsibility for the entire lifecycle of overseas, multi-thousand-GPU AI cluster deployments — from planning, equipment procurement, acceptance testing, installation, cabling, and commissioning to final go-live.

Lead and continuously optimize cluster deployment processes to ensure delivery within strict timelines and budget constraints.

Coordinate with data center facility teams, hardware vendors, and liquid-cooling suppliers to ensure seamless integration across all stages.

Operations Management & Incident Response

Build and lead a high-performance, multicultural operations team overseas; establish a 24/7 operations framework, standard operating procedures (SOPs), and emergency response protocols.

Develop comprehensive monitoring, alerting, logging, and performance analysis platforms for full observability and health management of the clusters.

Serve as the senior escalation point for complex, high-impact technical issues; lead root cause analysis and drive systematic improvements.

Client & Technical Interface

Act as the technical authority interfacing directly with clients' engineering teams, delivering technical presentations, POC support, and deep-dive technical discussions.

Ensure cluster services meet or exceed client SLA expectations, enhancing customer satisfaction and long-term partnership.

Operations Efficiency & Cost Optimization

Continuously improve operational efficiency of clusters, focusing on key metrics such as PUE, WUE, and compute utilization.

Manage the operations department's budget; pursue cost optimization opportunities while maintaining service excellence.

Qualifications

Experience:

Bachelor's degree or above in Computer Science, Electrical Engineering, or a related field.

Minimum 10 years of experience in large-scale data center or HPC/AI cluster operations and management.

Overseas Project Experience:

Proven track record in the successful delivery of advanced AI compute clusters or hyperscale data centers abroad.

Deep understanding of overseas project operational models, compliance requirements, and cultural differences.

Architecture Expertise:

Proficient in AI cluster architectures (e.g., NVIDIA DGX/SuperPOD, GPU-as-a-Service).

Strong understanding of InfiniBand/RoCE networking and distributed storage systems.

Liquid Cooling Technology:

Hands-on experience deploying or operating immersion or cold-plate liquid-cooled clusters.

Familiarity with their principles, operational challenges, and associated risks.

Systems Operations:

Expertise in Linux environments, cluster schedulers (e.g., Slurm, Kubernetes), monitoring tools (e.g., Prometheus, Grafana), and automation frameworks (e.g., Ansible, Python).

Leadership:

Minimum 5 years of management experience leading technical teams.

Ability to build, lead, and motivate high-performing engineering teams in multicultural environments.

Customer Orientation:

Excellent communication and presentation skills for effective technical discussions with internal and external stakeholders.

Language Proficiency:

Fluent English communication skills (both written and spoken) are preferred.

去原网站上申请

Director of AI Cluster Operation and Maintenance