We are hiring a Middle DevOps Engineer to run Kubernetes GPU orchestration with Volcano and keep Linux compute platforms stable for AI and research teams. You will automate day-to-day operations with Python and UNIX shell scripting, tune scheduling and quotas, and work in a client-facing delivery setup. Apply now to help build efficient, dependable compute infrastructure
Responsibilities
- Provision and support GPU-capable Kubernetes clusters plus independent Linux compute nodes to maximize scheduling effectiveness and system performance
- Operate Volcano scheduling by configuring queues, controlling POD lifecycle, allocating GPU resources, and applying namespace quota controls
- Maintain Kubernetes environments by managing namespaces, RBAC, resource quotas, and workload isolation mechanisms
- Automate operational workflows by writing and updating Python and Shell scripts for job submission, resource allocation, and monitoring
- Partner with orchestration, optimization, and observability teams to improve scheduling performance, utilization, and researcher outcomes
- Analyze and report on infrastructure health and resource usage to drive continuous optimization
- Implement upgrades to infrastructure, tooling, and automation to improve scalability, performance, and user experience
- Assist with operational processes that ensure researchers have an effective environment for AI and computational projects
Requirements
- Hands-on background of 2+ years in DevOps or infrastructure engineering for complex, large-scale environments
- Strong knowledge of Kubernetes operations, including namespaces, POD placement and balancing, PVC, NFS, and resource quota management
- Practical experience operating Volcano for GPU workloads, including queue creation, priority handling, and Kubernetes integration
- Demonstrated experience managing GPU clusters across Kubernetes and standalone Linux setups used for high-performance computing
- Advanced ability in Python scripting to automate infrastructure tasks, job processing, and monitoring workflows
- Solid command of UNIX Shell scripting (Bash or similar) to automate system routines and improve operations
- Strong Linux administration skills with troubleshooting, performance tuning, and configuration management experience
- Deep understanding of automation and orchestration concepts and tools for reliable, scalable infrastructure
- Excellent English communication skills (spoken and written) for direct interaction with clients and cross-functional teams
Nice to have
- Helm experience for Kubernetes application packaging and releases
- Observability knowledge with Prometheus, Grafana, and Loki for infrastructure monitoring
- Terraform familiarity for Infrastructure as Code and cloud resource automation
- Experience with Amazon EKS and Google GKE in multi-cloud Kubernetes setups
- Azure networking skills including VPN, ExpressRoute, and network security
- Use of AI coding assistants such as GitHub Copilot, ChatGPT, and Claude to boost code quality and productivity
- Knowledge of hybrid scheduling and optimization across cloud and on-premises compute