Senior AI Infrastructure Engineer

IO TECH SOLUTIONS LIMITED

Beijing
Permanent
Full-time

17 days ago
Apply easily

Responsibilities1. Full-Stack AI Infrastructure Architecture & Development:- Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes, unifying the management of heterogeneous computing resources (e.g., GPU pooling).- Integrate high-performance communication layers (e.g., RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.- Streamline the end-to-end workflow from resource scheduling to model deployment, enhancing system efficiency and stability.2. Intelligent Computing Power Scheduling System Design:- Design a global scheduling mechanism supporting multi-task types and priority strategies, leveraging Volcano scheduler capabilities.- Lead the customization and maintenance of Volcano and core Operators, optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.3. Hardware-Software Co-Optimization & System Reliability:- Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).- Build GPU elastic resource pools, fault self-healing mechanisms, and unified observability platforms (e.g., monitoring dashboards).- Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.4. Technical Foresight & Architecture Evolution:- Drive long-term AI Infra roadmap planning, anticipating quantitative business needs in computing scale, training efficiency, and cost control.- Explore and validate cutting-edge architectures (e.g., heterogeneous computing fusion, compute-storage separation, Serverless AI) to enhance infrastructure capabilities and technical barriers.*Qualifications*1. Bachelors/Masters in Computer Science or related fields, 510 years of experience, with strong self-motivation and execution ability to identify and resolve technical bottlenecks.2. Deep expertise in AI infrastructure: Kubernetes, GPU resource management, RDMA/high-performance networking, and large-scale distributed AI system design/deployment.3. Proficient in *Golang/Python* with solid system programming and automation skills. Priority given to candidates with experience in *Volcano/Kueue schedulers, K8s Operator development, or open-source contributions*.4. Familiar with core resource scheduling principles, GPU lifecycle management (allocation, isolation, elasticity, fault tolerance), and designing high-availability, low-latency strategies for quantitative tasks.5. Knowledge of mainstream AI frameworks (PyTorch/TensorFlow), with experience in training/inference performance optimization and cross-team collaboration for framework-infra co-optimization.6. *Preferred: Experience in **FinTech/quantitative AI infrastructure*, understanding of business-critical computing demands, and ability to drive cross-team collaboration and value delivery.

IO TECH SOLUTIONS LIMITED