Software Engineer, AI and DL Kernel Libraries

NVIDIA · China, Shanghai

We're looking for outstanding AI systems software engineers to develop groundbreaking technologies across the inference systems software stack. Our team builds core AI systems software that accelerates high-impact workloads on NVIDIA GPUs, from deep learning primitives and kernel libraries to LLM inference runtimes, serving abstractions, and code generation technologies. As a member of the team, you will help design, build, optimize, and ship production-quality software that powers NVIDIA's AI software stack.

This role spans both foundational library engineering and next-generation inference systems work, with opportunities to contribute across the stack from low-level kernels and performance primitives to serving runtimes and developer-facing abstractions. You may work on GPU-accelerated deep learning primitives, efficient attention kernel implementations, LLM serving components, just-in-time compilation systems, software abstractions, and performance-critical runtime infrastructure for large language models, agents, and other advanced AI workloads. You will collaborate with world-class engineers across deep learning software, compilers, GPU architecture, and open-source inference ecosystems, and your work will directly impact NVIDIA's AI platform and the performance of real-world workloads at scale.

What you'll be doing:

  • Develop production-quality software that ships as part of NVIDIA's AI software stack, including cuDNN, FlashInfer, and optimized support for large language model inference workloads.

  • Innovate and develop new AI systems technologies for efficient inference, with a focus on performance, scalability, maintainability, and usability.

  • Design, implement, and optimize kernels for high-impact AI workloads across LLM inference, generative AI, computer vision, autonomous driving, and recommender systems.

  • Design and implement extensible software abstractions for deep learning libraries, LLM serving engines, and runtime systems.

  • Build and improve just-in-time compilation, code generation, and runtime technologies for performance-critical GPU workloads.

  • Analyze workload performance, tune current software, and propose improvements to future software and hardware-software interfaces.

  • Collaborate closely with engineers across deep learning frameworks, libraries, kernels, compilers, and GPU architecture teams at NVIDIA.

  • Contribute to open-source communities and ecosystem integrations where relevant, including projects such as FlashInfer, vLLM, and SGLang.

What we need to see:

  • Master's degree in Computer Science, Electrical Engineering, or a related field, or equivalent experience.

  • 3+ years of relevant industry, research, or systems software development experience in machine learning, deep learning systems, compilers, or GPU software. More experience is expected for senior-level candidates.

  • Strong programming skills in C/C++ and Python, with hands-on experience developing high-performance software.

  • Solid experience with CUDA development and GPU programming fundamentals.

  • Strong experience developing or using deep learning frameworks such as PyTorch, JAX, TensorFlow, or ONNX.

  • Good understanding of linear algebra, performance analysis, profiling, and code optimization.

  • Experience designing software abstractions, APIs, or higher-level system architecture for performance-sensitive systems.

  • Familiarity with modern machine learning and inference system trends, especially around LLMs and generative AI.

  • For senior candidates, strong experience in GPU kernel development and performance optimization, especially using CUDA C/C++, cuTile, Triton, or similar technologies, is expected.

Ways to stand out from the crowd:

  • Hands-on experience with inference engines and runtimes such as vLLM, SGLang, MLC, TensorRT-LLM, or similar systems.

  • Background in domain-specific compiler, code generation, or library solutions for LLM inference and training.

  • Expertise in machine learning compilers or IR systems such as MLIR, Apache TVM, TensorIR, or related technologies.

  • Practical experience with GPU performance modeling, computer architecture, or accelerator-oriented software design.

  • Open-source project ownership or meaningful contributions in deep learning systems, compilers, kernels, or inference infrastructure.

Software pay context

Based on 7,865 disclosed Software salaries on RoleSuite, the role pays a median of $156K/year, with most offers between $123K and $197K (10th–90th percentile: $101K–$233K).

See the full Software salary breakdown →
Apply →