
AI Model Training Engineer
- Beijing
- Permanent
- Full-time
This is an engineering-focused role centered around improving the performance, stability, and scalability of distributed training systems. You will work closely with internal model and platform teams to push the boundaries of generative AI model training.Key Responsibilities:
- Participate in the development and maintenance of AMD’s internal training framework.
- Optimize distributed training pipelines and parallelism strategies (Data Parallelism, Tensor Parallelism, Pipeline Parallelism, ZeRO, etc.).
- Improve communication scheduling and kernel overlap to reduce training latency and maximize GPU utilization.
- Tune the performance of core operators using HIP/CUDA and low-level profiling tools.
- Integrate and adapt open-source training frameworks such as Megatron-LM, TorchTitan, DeepSpeed, etc.
- Support internal model training workloads with performance, reliability, and scalability improvements.
- Collaborate across teams to investigate and resolve system-level bottlenecks in large-scale training.
- Solid engineering background and familiarity with end-to-end deep learning training workflows.
- Hands-on experience with training framework internals (e.g., Megatron-LM, TorchTitan, DeepSpeed, FairScale).
- Strong debugging and performance analysis skills (profiling, tracing, etc.).
- Understanding of distributed training techniques such as data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization.
- Excellent communication and cross-functional collaboration skills.
- Experience with large-scale model training (e.g., LLMs, ViTs, MoE).
- Hands-on experience with CUDA or HIP kernel development.
- Familiarity with communication libraries such as NCCL/RCCL and techniques like kernel overlap.
- Prior involvement in high-performance ML infrastructure projects.