The Role
Tesla is seeking a full-time IT AI/ML DevOps Engineer to join the Information Technology Department at Tesla Gigafactory Shanghai, with a strategic focus on building and scaling our next-generation AIOps and MLOps platform. As AI becomes increasingly central to our internal systems - particularly in powering the GenAI platform - we need an expert who can bridge the gap between AI research and production-grade infrastructure. This role will be responsible for designing and operating a unified, scalable AI model lifecycle platform that supports end-to-end workflows from training to deployment, with a strong emphasis on high-performance LLM inference, fine-tuning automation, RAG-as-a-Service, and hybrid inference gateway architecture.

This hire will directly enable faster, more reliable delivery of AI capabilities across Tesla's enterprise systems in China.

Responsibilities

Design, build, and maintain a scalable MLOps platform to streamline the full lifecycle of AI models - from training and versioning to deployment and monitoring.
Develop and optimize LLM inference pipelines using high-performance frameworks such as vLLM, TensorRT-LLM, or TGI (Text Generation Inference) on large-scale GPU clusters.
Build a hybrid inference gateway platform that seamlessly integrates on-premise GPU models with cloud-based LLM APIs, enabling intelligent routing, load balancing, and cost-performance optimization.
Implement LLM fine-tuning pipelines (e.g., LoRA, QLoRA) with automated data preprocessing, distributed training orchestration, and checkpoint management.
Operationalize Retrieval-Augmented Generation (RAG) as a Service, including integration and management of vector databases such as Pinecone, Milvus, or Weaviate.
Establish comprehensive observability and monitoring solutions for AI systems using Prometheus, Grafana, OpenTelemetry, and custom metrics dashboards.
Collaborate closely with AI scientists and application engineers to optimize models for production (e.g., quantization, pruning, distillation) and ensure smooth model-to-platform handoff.
Define KPIs and SLAs for model serving performance, latency, throughput, and reliability to measure business impact.
Drive CI/CD automation for AI workloads using GitLab CI, Jenkins, ArgoCD, or similar tools, ensuring reproducibility and auditability.
Champion DevOps best practices within AI teams and mentor engineers on scalable, secure, and maintainable AI infrastructure patterns.

Requirements

Educational Background: Bachelor's degree or above in Computer Science, Artificial Intelligence, Software Engineering, or related disciplines.
Work Experience:
Minimum 5 years of DevOps/SRE experience, experience in MLOps, AI infrastructure (such as model serving, pipeline automation, and monitoring), and deploying large-scale models for inference services is highly preferred.
Proven track record in deploying and operating LLMs or deep learning models in production environments.
Technical Competencies:
Strong proficiency in Python, with experience writing scalable data pipelines, model serving scripts, and automation tools.
Deep expertise in Kubernetes (K8s) and Docker, including managing GPU-accelerated workloads via KubeFlow, NVIDIA GPU Operator, or custom operators.
Hands-on experience with MLOps platforms and tools such as MLflow, Kubeflow, BentoML, Seldon Core, or Vertex AI.
Experience building high-throughput, low-latency model serving systems using vLLM, TGI, Triton Inference Server, or similar technologies.
Familiarity with vector databases (e.g., Pinecone, Milvus, FAISS, Weaviate) and their integration into RAG pipelines.
Solid understanding of CI/CD for machine learning, including automated testing, model registry, and canary/blue-green deployments.
Experience with infrastructure-as-code (IaC) tools such as Terraform, Ansible, or Pulumi for reproducible environment provisioning.
Knowledge of model optimization techniques — including quantization (INT8, FP8, GGUF), pruning, and distillation — to improve inference efficiency.
Understanding of hybrid cloud architectures, capable of designing systems that integrate on-premise AI infrastructure with public cloud LLM services (e.g., Alibaba Cloud, AWS, Azure).
Familiarity with service mesh, API gateways (e.g., Kong, Istio), and authentication mechanisms for secure model access.
Soft Skills:
Highly self-motivated with the ability to drive complex projects independently in a fast-paced environment.
Excellent communication and cross-functional collaboration skills — able to work effectively with AI researchers, backend engineers, and business stakeholders.
Strong analytical mindset with a passion for solving system-level challenges in scalability, reliability, and performance.
Proactive problem solver with a focus on operational excellence and continuous improvement.

岗位概述:

特斯拉信息技术部门(工作地点:特斯拉上海超级工厂)正在招聘一名全职 IT AI/ML DevOps工程师,专注于构建和扩展下一代 AIOps与MLOps平台。随着人工智能技术在企业核心系统(尤其是GenAI平台)中的深入应用,亟需一位能够打通AI研发与生产部署之间壁垒的工程专家。该岗位将负责从模型训练、版本管理、自动化部署到高性能推理服务的完整MLOps体系建设,并主导构建支持本地GPU与云端LLM API融合的混合式推理网关平台,以实现低延迟、高吞吐的企业级AI服务能力。推动GenAI平台及未来AI服务落地的关键力量,将显著提升AI功能的交付效率与稳定性。

岗位职责:

设计、构建和维护可扩展的 MLOps平台,实现AI模型从训练、版本控制、部署到监控的全生命周期管理。
基于 vLLM、TensorRT-LLM 、TGI 等框架,在大规模GPU集群上开发并优化大语言模型(LLM)推理流水线。
构建融合本地GPU模型与云上LLM API 的混合推理网关平台,实现智能路由、负载均衡与成本性能的优化。
搭建自动化 LLM微调(Fine-Tuning)流水线,支持LoRA、QLoRA等参数高效训练方法,涵盖数据预处理、分布式训练与检查点管理。
推动 RAG(检索增强生成)能力服务化(RAG-as-a-Service),集成并运维主流向量数据库(如 Pinecone、Milvus、Weaviate)。
通过 Prometheus、Grafana、OpenTelemetry 及自研监控方案,保障AI系统的可观测性与稳定性。
与AI科学家和应用工程师协作进行模型优化(量化、剪枝、蒸馏),提升推理效率与资源利用率。
支持 GenAI CN平台的高性能模型服务需求,确保低延迟、高并发的服务能力。
制定AI模型服务的关键性能指标(KPI)与服务等级协议(SLA),量化业务价值与系统表现。
使用 GitLab CI、Jenkins、ArgoCD 等工具实现AI工作流的CI/CD自动化,确保可复现性与可审计性。
在AI团队中推广DevOps最佳实践,指导工程师掌握可扩展、安全且可持续维护的AI基础设施设计模式。

岗位要求:

教育背景:计算机科学、人工智能、软件工程或相关专业本科及以上学历。
工作经验:
至少 5年DevOps/SRE经验,具备MLOps、AI基础设施或大规模模型推理服务经验。
有在生产环境中成功部署和运维大语言模型(LLM)或深度学习模型的实际项目经验。
技术能力:
精通 Python编程,能独立开发可扩展的数据处理管道、模型服务脚本及自动化工具。
深入掌握 Kubernetes 和 Docker 技术,具备通过 KubeFlow、NVIDIA GPU Operator 或自定义控制器管理GPU加速工作负载的实践经验。
熟悉主流 MLOps平台与工具链,如 MLflow、Kubeflow、BentoML、Seldon Core 等。
有使用 vLLM、TGI、Triton Inference Server 等构建高并发、低延迟模型推理系统的真实案例。
熟悉向量数据库(如 Pinecone、Milvus、FAISS、Weaviate)及其在RAG系统中的工程化集成。
掌握面向机器学习的 CI/CD流程,包括模型测试自动化、模型注册中心、灰度发布与版本回滚机制。
熟练使用 Terraform、Ansible 或 Pulumi 等基础设施即代码(IaC)工具,实现环境标准化与快速交付。
了解模型压缩与加速技术,如量化(INT8/FP8/GGUF)、剪枝、蒸馏,并能在实际场景中应用。
具备混合云架构设计能力,能够整合本地GPU集群与阿里云、AWS、Azure等公有云LLM服务。
熟悉服务网格、API网关(如Kong、Istio) 及鉴权机制,保障模型服务的安全性与可管理性。
软技能:
极强的自我驱动力,能在快节奏环境中独立推动复杂项目落地。
出色的沟通与跨团队协作能力,能高效对接AI科学家、后端工程师及业务方。
具备系统性思维,关注可扩展性、可靠性与性能优化。
主动发现问题并持续改进,追求运维卓越与平台稳定性。

去原网站上申请

Sr. DevOps Engineer,AI/MLOps Platform