Senior AI Infrastructure Engineer
IO TECH SOLUTIONS LIMITED
- Beijing
- Permanent
- Full-time
- Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes, unifying the management of heterogeneous computing resources (e.g., GPU pooling).
- Integrate high-performance communication layers (e.g., RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.
- Streamline the end-to-end workflow from resource scheduling to model deployment, enhancing system efficiency and stability.
- Design a global scheduling mechanism supporting multi-task types and priority strategies, leveraging Volcano scheduler capabilities.
- Lead the customization and maintenance of Volcano and core Operators, optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.
- Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).
- Build GPU elastic resource pools, fault self-healing mechanisms, and unified observability platforms (e.g., monitoring dashboards).
- Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.
- Drive long-term AI Infra roadmap planning, anticipating quantitative business needs in computing scale, training efficiency, and cost control.
- Explore and validate cutting-edge architectures (e.g., heterogeneous computing fusion, compute-storage separation, Serverless AI) to enhance infrastructure capabilities and technical barriers.