AI Product Performance Engineer
Advanced Micro Devices View all jobs
- Beijing
- Permanent
- Full-time
- High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization.
- Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence.
- Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups.
- Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies.
- Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics).
- Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines.
- GPU Architecture Mastery: In-depth understanding of modern GPU underlying architectures, including streaming multiprocessors (SMs/CUs), memory hierarchy (registers, shared memory, L1/L2 cache, HBM), and warp/wavefront execution models.
- Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming.
- Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads, interpreting low-level metrics to drive architectural-aware optimizations.
- Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers.
- Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs for agile kernel development and auto-tuning.
- Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM.
- Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions).
- Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries.
- BS required. MS preferred with several years of relevant industry experience