Senior AI Engineer
AstraZeneca View all jobs
- Beijing
- Permanent
- Full-time
- Design and validate multi-node multi-GPU training templates (DDP, FSDP) for NVIDIA H20 GPUs
- Build operational runbooks covering common failure modes, checkpointing, recovery
- Establish baseline performance benchmarks (throughput, step time, scaling efficiency)
- Optimize data loading pipelines to eliminate I/O bottlenecks in distributed settings
- Define training method standards: naming conventions, experiment configuration, model registry requirements, reproducibility criteria
- Create scheduling policies: GPU quota rules, priority tiers, job templates for the center's Kubernetes/Run:AI platform
- Establish compute triage process: how science teams request and receive GPU allocation
- MLOps: experiment tracking, model registry, CI/CD for ML, Kubernetes
- Streamlining non-AI pipeline and workflow dependencies
- Build reusable fine-tuning pipeline templates for protein language models and scientific AI workloads
- Optimize training code for H20 GPU to boost efficiency and throughput
- Collaborate with NVIDIA on hardware-specific optimizations
- Translate Discovery team workload needs into infrastructure requirements for IT
- Operate as "business owner" for AI compute, while IT operates as "system owner"
- Participate in weekly coordination meetings across Discovery, AISI, and IT
- 5+ years of experience in distributed deep learning training (DDP, FSDP, DeepSpeed, or equivalent)
- Strong PyTorch expertise with production-grade model training
- Experience with GPU workload optimization and multi-node cluster management
- Kubernetes job scheduling experience (Kubeflow, Slurm, Run:AI, or equivalent)
- Experience setting AI/ML engineering standards for teams (not just personal projects)
- Hands-on experience improving Tranformer-based models (ex: FlashAttention or related optimizations)
- Ability to work full-time in Beijing
- Parameter-efficient fine-tuning methods (QLoRA, LoRA, adapters)
- Reinforcement learning training infrastructure
- AWS China or Alibaba Cloud experience
- NVIDIA H20 or H100-series GPU familiarity
- Biopharma domain knowledge (molecular simulation, protein folding, drug discovery)
- Experience working across organizational boundaries (e.g., AI platform team serving multiple science groups)