We are delivering resilient Kubernetes and Linux platforms optimized for GPU scheduling and large-scale automation in AI compute environments. As a Middle DevOps Engineer, you will operate Kubernetes (including Volcano) and Linux GPU clusters, automate workflows with Python and UNIX shell scripting, and partner with a client-facing delivery team. Apply to help build reliable, high-throughput compute platforms for advanced AI workloads.
Responsibilities
- Deploy, configure, and run GPU-enabled Kubernetes clusters and standalone Linux compute environments while keeping scheduling and performance optimized
- Implement and manage Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement
- Administer Kubernetes end to end, including namespaces, RBAC, resource quotas, and workload isolation approaches
- Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting
- Collaborate with orchestration, optimization, and observability teams to boost scheduling efficiency, improve capacity utilization, and streamline researcher workflows
- Monitor infrastructure health and resource utilization, supplying data and feedback for optimization and reporting needs
- Identify opportunities to improve infrastructure, tooling, and automation workflows to raise performance, scalability, and usability
- Ensure operational processes deliver a smooth and efficient experience for researchers running diverse AI and computational workloads
Requirements
- Hands-on background with 2+ years of experience in DevOps or infrastructure engineering in complex, large-scale environments
- Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management
- Practical experience using the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes
- Proven ability to run GPU cluster environments in Kubernetes and on standalone Linux compute nodes
- Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash
- Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management
- Solid understanding of infrastructure automation and orchestration concepts and related tooling
- Fluent English communication skills (spoken and written) for direct client interaction
Nice to have
- Knowledge of Helm package management for Kubernetes applications
- Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki
- Skills in Infrastructure as Code tools such as Terraform
- Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE
- Understanding of Azure Networking including VPN, ExpressRoute, and network security
- Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude
- Experience with hybrid (cloud and on-premises) scheduling and resource optimization