AI Training Reliability Engineer
Advanced Micro Devices View all jobs
- Beijing
- Permanent
- Full-time
- Own reliability governance (standards, runbooks, SLIs/SLOs) and deliver KPI improvements (goodput/badput).
- Productionize fast recovery paths: fault detection, isolation, membership change, and continuation without stop-the-world restarts.
- Establish fault-injection/chaos and regression gates to prevent reliability regressions (GPU/NIC/node, comms, storage, maintenance).
- Drive day-to-day incident response and root-cause analysis, converting learnings into preventative fixes.
- Strong software + systems engineering; can debug complex distributed failures end-to-end (Linux, networking, concurrency).
- Hands-on large-scale distributed training experience (PyTorch Distributed/torchrun; common parallelism patterns).
- Solid accelerator fundamentals and operational debugging (GPU/NPU, drivers/runtime, profiling tooling).
- RDMA networking and collective communication fundamentals (all-reduce/all-gather/all-to-all) and related failure modes.
- TorchFT (or similar) per-step fault tolerance / checkpointless recovery experience.
- Experience with large cluster operations and automated remediation (health checks, drain/replace, topology-aware placement).
- Training stability hardening experience (hang watchdogs, NaN/Inf containment, OOM/memory fragmentation mitigation).
- Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent