We are seeking a Lead DevOps Engineer to design, operate, and continuously improve the AWS platform that powers a custom VDI platform and cloud playtesting/streaming platform. This is a primarily individual contributor role that requires strong ownership and the ability to work independently while collaborating with one other team member and customer stakeholders. You will be responsible for infrastructure-as-code, container platforms, automation, CI/CD standardization, cost/performance optimization (including GPU instances), and leading troubleshooting during platform-wide degradations.
Responsibilities
- Design, build, and maintain AWS infrastructure using Terraform
- Management of Terraform workflows and remote state using HashiCorp Cloud Platform (HCP)
- Ownership of the infrastructure lifecycle including provisioning, upgrades, decommissioning and operational hygiene
- Operation of ECS clusters to deploy and operate microservices supporting the platforms
- Operation of EKS clusters used to host and enable GitHub Actions runners, including required platform customizations
- Right-size and tune GPU-enabled EC2 capacity to balance user experience with strict cloud cost controls
- Continuous assessment of scaling behavior, utilization and performance bottlenecks
- Implementation and maintenance of AWS Lambda functions for automation such as cleanup tasks, on-demand provisioning and operational workflows
- Standardize and optimize GitHub Actions pipelines for Terraform plan/apply workflows, infrastructure releases and container image build/publish/deploy processes
- Lead troubleshooting and restoration efforts for platform-wide issues such as VDI session drops, authentication issues and machine/storage failures
- Coordination of incident resolution across teams through investigation, mitigation and follow-up actions
- Creation and maintenance of run books, operational documentation and onboarding materials
Requirements
- 5+ years of experience in DevOps or platform engineering roles
- Expertise in AWS infrastructure design, provisioning and lifecycle management
- Proficiency in Terraform and HashiCorp Cloud Platform (HCP)
- Skills in container orchestration with ECS and EKS
- Knowledge of GPU-enabled EC2 capacity right-sizing, cost management and performance tuning
- Competency in AWS Lambda for event-driven automation
- Background in CI/CD standardization with GitHub Actions pipelines
- Capability to lead reliability engineering, troubleshooting and incident resolution
- High ownership and accountability with the ability to work independently and deliver without close supervision
- Strong troubleshooting and systems thinking, remaining calm and structured during incidents
- Clear communication with both technical and non-technical stakeholders
- Practical prioritization in a Kanban environment balancing planned work and urgent interruptions
- English proficiency at B2 level or higher
Nice to have
- Familiarity with Amazon GameLift Streams
- Understanding of streaming and playtesting platform needs
- Skills in triaging urgent ad-hoc requests outside the standard Kanban flow