
Site Reliability Engineer (SRE)
- Shanghai
- Permanent
- Full-time
- Design and maintain scalable, fault-tolerant systems for AWS/Alicloud.
- Implement monitoring, alerting, and automation tools (Prometheus, Grafana, K8s).
- Optimize infrastructure for high availability and minimal latency.
- Lead incident response, root cause analysis, and post-mortem documentation.
- Partner with Development teams to embed reliability practices.
- Advocate for automation, observability, and performance engineering.
- Forecast resource needs and plan for scaling infrastructure.
- Identify and mitigate risks to ensure service stability.
- Education: Bachelor's in Computer Science/Engineering or equivalent.
- Cloud Proficiency: AWS/Alicloud with experience in containerization (Docker, K8s, ACK).
- OS Knowledge: Solid understanding of operating systems (Linux preferred) and networking fundamentals.
- Automation Tools: Terraform, Ansible, Jenkins, Git, Harness.
- Programming: Python, Go, Java, Javascript/Typescript or similar scripting languages.
- Monitoring: Prometheus, Grafana, ELK Stack, etc.
- Problem-Solving: Strong debugging skills and ability to work under pressure.
- Communication: Fluent English (written/verbal) for cross-team collaboration.
- SRE/DevOps experience in high-traffic environments (preferred).
- AWS Certified DevOps Engineer, CKA, or equivalent Certifications (preferred AWS Certified DevOps Engineer, CKA, or equivalent).
- Experience with microservices architecture and large-scale distributed systems (preferred).