Software Development Engineer
Advanced Micro Devices View all jobs
- Shanghai
- Permanent
- Full-time
- Deep Learning & LLM Framework Optimization: Experience with optimizing major DL/LLM frameworks (PyTorch, vLLM, SGLang) for AMD GPUs and contribute improvements upstream.
- Model-Aware Implementation: Build features that interact closely with LLMs and multimodal architectures (e.g., Llama, Qwen-VL, Wan), requiring understanding of attention mechanisms, cross-modal fusion, KV caching, and quantization.
- Performance-Conscious Coding: Write efficient, scalable code while considering memory usage, concurrency, and bottlenecks in multi-GPU environments.
- Profiling: Use profiling tools to evaluate the impact of your changes, identify regressions, and validate performance improvements as part of the development cycle.
- End-to-End Performance Engineering: Perform comprehensive profiling to identify bottlenecks and implement system, memory, and communication optimizations across multi-GPU and multi-node setups.
- Compiler & Pipeline Acceleration: Leverage compiler technologies and graph compilers to enhance the full deep learning and inference pipeline.
- Research & Advanced Techniques: Prototype and integrate emerging optimization methods such as speculative decoding and weight-only quantization into production systems.
- Cross-Team & Open-Source Collaboration: Collaborate with internal GPU library teams and open-source maintainers to align improvements and ensure seamless upstream integration.
- Software Engineering Excellence: Apply robust engineering practices to deliver maintainable, reliable, and production-quality performance optimizations.
- Software Engineering Skills: Familiarity in Python. Familiarity with C++ or async programming is a plus.
- Understanding of LLM or multimodal model concepts: Knowledge of transformer architectures, attention mechanisms, vision-language alignment, and inference pipelines (e.g., image + text input handling). Have theoretical grounding in Transformer/Attention/MoE/KV Cache, and quantization (FP8/FP4).
- Linux development environment: Comfortable using command-line tools, Git, and standard debugging/profiling utilities.
- End-to-End LLM Performance Engineering: Experience with profiling and diagnosing compute, memory, and communication bottlenecks across multi-GPU and multi-node environments.
- Software Engineering Excellence & Community Contribution is a plus: Solid Python/C++ coding skills and experience debugging and testing practices, proven ability to deliver maintainable performance-critical software, and a track record of open-source contributions with strong self-motivation.
- GPU Kernel Development & Optimization is a plus: Knowlege of high-performance GPU kernels tuning for AMD GPUs using HIP, CUDA, ASM, and tools like CK, CUTLASS, and Triton.
- Compiler & System-Level Optimization is a plus: Foundational knowledge of LLVM, ROCm, and compiler-driven techniques for improving kernel and system performance.
- Model Architectures & Optimization Expertise: Knowledge with multimodal models (e.g., Qwen-VL, Qwen-Image-Edit, Wan) or diffusion-based generative models.
- Development Skills: Exposure to GPU computing (ROCm, CUDA) or performance profiling tools (e.g., PyTorch Profiler).
- Distributed Systems Experience: Experience with distributed inference for large-scale models (e.g., Tensor Parallel, Pipeline Parallel).
- Bachelor’s in Computer Science, Computer Engineering, Electrical Engineering, or a related field.