The Role
Tesla is seeking a full-time IT AI/ML DevOps Engineer to join the Information Technology Department at Tesla Gigafactory Shanghai, with a strategic focus on building and scaling our next-generation AIOps and MLOps platform. As AI becomes increasingly central to our internal systems - particularly in powering the GenAI platform - we need an expert who can bridge the gap between AI research and production-grade infrastructure. This role will be responsible for designing and operating a unified, scalable AI model lifecycle platform that supports end-to-end workflows from training to deployment, with a strong emphasis on high-performance LLM inference, fine-tuning automation, RAG-as-a-Service, and hybrid inference gateway architecture.
This hire will directly enable faster, more reliable delivery of AI capabilities across Tesla's enterprise systems in China.
Responsibilities
- Design, build, and maintain a scalable MLOps platform to streamline the full lifecycle of AI models - from training and versioning to deployment and monitoring.
- Develop and optimize LLM inference pipelines using high-performance frameworks such as vLLM, TensorRT-LLM, or TGI (Text Generation Inference) on large-scale GPU clusters.
- Build a hybrid inference gateway platform that seamlessly integrates on-premise GPU models with cloud-based LLM APIs, enabling intelligent routing, load balancing, and cost-performance optimization.
- Implement LLM fine-tuning pipelines (e.g., LoRA, QLoRA) with automated data preprocessing, distributed training orchestration, and checkpoint management.
- Operationalize Retrieval-Augmented Generation (RAG) as a Service, including integration and management of vector databases such as Pinecone, Milvus, or Weaviate.
- Establish comprehensive observability and monitoring solutions for AI systems using Prometheus, Grafana, OpenTelemetry, and custom metrics dashboards.
- Collaborate closely with AI scientists and application engineers to optimize models for production (e.g., quantization, pruning, distillation) and ensure smooth model-to-platform handoff.
- Define KPIs and SLAs for model serving performance, latency, throughput, and reliability to measure business impact.
- Drive CI/CD automation for AI workloads using GitLab CI, Jenkins, ArgoCD, or similar tools, ensuring reproducibility and auditability.
- Champion DevOps best practices within AI teams and mentor engineers on scalable, secure, and maintainable AI infrastructure patterns.
Requirements
- Educational Background: Bachelor's degree or above in Computer Science, Artificial Intelligence, Software Engineering, or related disciplines.
- Work Experience:
- Minimum 5 years of DevOps/SRE experience, experience in MLOps, AI infrastructure (such as model serving, pipeline automation, and monitoring), and deploying large-scale models for inference services is highly preferred.
- Proven track record in deploying and operating LLMs or deep learning models in production environments.
- Technical Competencies:
- Strong proficiency in Python, with experience writing scalable data pipelines, model serving scripts, and automation tools.
- Deep expertise in Kubernetes (K8s) and Docker, including managing GPU-accelerated workloads via KubeFlow, NVIDIA GPU Operator, or custom operators.
- Hands-on experience with MLOps platforms and tools such as MLflow, Kubeflow, BentoML, Seldon Core, or Vertex AI.
- Experience building high-throughput, low-latency model serving systems using vLLM, TGI, Triton Inference Server, or similar technologies.
- Familiarity with vector databases (e.g., Pinecone, Milvus, FAISS, Weaviate) and their integration into RAG pipelines.
- Solid understanding of CI/CD for machine learning, including automated testing, model registry, and canary/blue-green deployments.
- Experience with infrastructure-as-code (IaC) tools such as Terraform, Ansible, or Pulumi for reproducible environment provisioning.
- Knowledge of model optimization techniques — including quantization (INT8, FP8, GGUF), pruning, and distillation — to improve inference efficiency.
- Understanding of hybrid cloud architectures, capable of designing systems that integrate on-premise AI infrastructure with public cloud LLM services (e.g., Alibaba Cloud, AWS, Azure).
- Familiarity with service mesh, API gateways (e.g., Kong, Istio), and authentication mechanisms for secure model access.
- Soft Skills:
- Highly self-motivated with the ability to drive complex projects independently in a fast-paced environment.
- Excellent communication and cross-functional collaboration skills — able to work effectively with AI researchers, backend engineers, and business stakeholders.
- Strong analytical mindset with a passion for solving system-level challenges in scalability, reliability, and performance.
- Proactive problem solver with a focus on operational excellence and continuous improvement.
岗位概述:
特斯拉信息技术部门(工作地点:特斯拉上海超级工厂)正在招聘一名全职 IT AI/ML DevOps工程师,专注于构建和扩展下一代 AIOps与MLOps平台。随着人工智能技术在企业核心系统(尤其是GenAI平台)中的深入应用,亟需一位能够打通AI研发与生产部署之间壁垒的工程专家。该岗位将负责从模型训练、版本管理、自动化部署到高性能推理服务的完整MLOps体系建设,并主导构建支持本地GPU与云端LLM API融合的混合式推理网关平台,以实现低延迟、高吞吐的企业级AI服务能力。推动GenAI平台及未来AI服务落地的关键力量,将显著提升AI功能的交付效率与稳定性。
岗位职责:
- 设计、构建和维护可扩展的 MLOps平台,实现AI模型从训练、版本控制、部署到监控的全生命周期管理。
- 基于 vLLM、TensorRT-LLM 、TGI 等框架,在大规模GPU集群上开发并优化大语言模型(LLM)推理流水线。
- 构建融合本地GPU模型与云上LLM API 的混合推理网关平台,实现智能路由、负载均衡与成本性能的优化。
- 搭建自动化 LLM微调(Fine-Tuning)流水线,支持LoRA、QLoRA等参数高效训练方法,涵盖数据预处理、分布式训练与检查点管理。
- 推动 RAG(检索增强生成)能力服务化(RAG-as-a-Service),集成并运维主流向量数据库(如 Pinecone、Milvus、Weaviate)。
- 通过 Prometheus、Grafana、OpenTelemetry 及自研监控方案,保障AI系统的可观测性与稳定性。
- 与AI科学家和应用工程师协作进行模型优化(量化、剪枝、蒸馏),提升推理效率与资源利用率。
- 支持 GenAI CN平台的高性能模型服务需求,确保低延迟、高并发的服务能力。
- 制定AI模型服务的关键性能指标(KPI)与服务等级协议(SLA),量化业务价值与系统表现。
- 使用 GitLab CI、Jenkins、ArgoCD 等工具实现AI工作流的CI/CD自动化,确保可复现性与可审计性。
- 在AI团队中推广DevOps最佳实践,指导工程师掌握可扩展、安全且可持续维护的AI基础设施设计模式。
岗位要求:
- 教育背景:计算机科学、人工智能、软件工程或相关专业本科及以上学历。
- 工作经验:
- 至少 5年DevOps/SRE经验,具备MLOps、AI基础设施或大规模模型推理服务经验。
- 有在生产环境中成功部署和运维大语言模型(LLM)或深度学习模型的实际项目经验。
- 技术能力:
- 精通 Python编程,能独立开发可扩展的数据处理管道、模型服务脚本及自动化工具。
- 深入掌握 Kubernetes 和 Docker 技术,具备通过 KubeFlow、NVIDIA GPU Operator 或自定义控制器管理GPU加速工作负载的实践经验。
- 熟悉主流 MLOps平台与工具链,如 MLflow、Kubeflow、BentoML、Seldon Core 等。
- 有使用 vLLM、TGI、Triton Inference Server 等构建高并发、低延迟模型推理系统的真实案例。
- 熟悉向量数据库(如 Pinecone、Milvus、FAISS、Weaviate)及其在RAG系统中的工程化集成。
- 掌握面向机器学习的 CI/CD流程,包括模型测试自动化、模型注册中心、灰度发布与版本回滚机制。
- 熟练使用 Terraform、Ansible 或 Pulumi 等基础设施即代码(IaC)工具,实现环境标准化与快速交付。
- 了解模型压缩与加速技术,如量化(INT8/FP8/GGUF)、剪枝、蒸馏,并能在实际场景中应用。
- 具备混合云架构设计能力,能够整合本地GPU集群与阿里云、AWS、Azure等公有云LLM服务。
- 熟悉服务网格、API网关(如Kong、Istio) 及鉴权机制,保障模型服务的安全性与可管理性。
- 软技能:
- 极强的自我驱动力,能在快节奏环境中独立推动复杂项目落地。
- 出色的沟通与跨团队协作能力,能高效对接AI科学家、后端工程师及业务方。
- 具备系统性思维,关注可扩展性、可靠性与性能优化。
- 主动发现问题并持续改进,追求运维卓越与平台稳定性。