返回查询:Sr System / 上海

The Role
Compute is the most important driver in accelerating the maturation of AI enabled products. Today, Tesla is at the forefront of creating meaningful real world products using AI. We design, build and run large scale GPU clusters that enable our teams to build better products faster. We are an extremely small team, and the work of every member carries an immense amount of weight. Working with the team, you will build out performance testing tools, build health check tools, create tools for better metric collection and all other fun projects.

Responsibilities
You'll be working in a cross-functional and highly versatile team that designs, implements, and maintains HPC technical stacks.

Leverage and improve upon existing cluster management solutions to ensure rapid deployment and scalability.

Ensure the reliability of the existing systems to guarantee uptime and availability of core foundational services.

Influence architectural decisions with focus on security, scalability and high-performance. Work with engineering teams to understand useful metrics to collect and implement such monitoring and alerting with existing monitoring solutions.

Improve root cause analysis and corrective action for problems large and small – identify patterns and design task automations.

Help develop automated tools to collect information that can be directly used to assist users creating root cause analysis for issues in their job submissions.

Organize and document implemented solutions for long term information retention with our internal ticketing and documentation system.

Take part in a 24 x 7 on-call rotation

Must Qualifications
Experience with cluster deployment and operations on Linux Operating System flavors (Ubuntu/RHEL).

Advanced experience with configuration management systems such as Ansible.

Demonstrable knowledge of TCP/IP, RoCE, Linux Operating System internals, filesystems, disk/storage technologies and storage protocols

Experience with design, deploy middle to large scale of InfiniBand network.

Proficiency in high-level programming language and/or scripting with (Python, Go, Bash).

Experience with containers (Docker, Kubernetes)

Familiar with Prometheus, Grafana, Splunk for monitoring and alerting.

Administering HPC workload managers (SLURM, BCM etc.).

Experience with high-throughput low-latency network and GPU-based computing systems

Fluently in reading, writing and speaking in English

Preferred Qualifications
Previous experience at the large-scale data center running HPC workloads

Experience with parallel filesystems

Bachelor's degree in computer science, electrical engineering, Math or related fields with 5+ years of additional equivalent experience or evidence of exceptional ability related to the position.

This job application may involve an interview with an interviewer outside of Tesla China. If you complete your application, you agree Tesla provides your application information to overseas interviewers in Tesla, Inc. for recruitment purposes. More details and contact information please see here. (here hyperlink: )