We are seeking a Senior Site Reliability Engineer to ensure the operational excellence and reliability of our production services. This role combines core SRE responsibilities with a specialization in generative AI technologies, focusing on AWS infrastructure, Kubernetes orchestration and observability platforms to support mission-critical systems.
Participation in the on-call support rotation is required for this role. The schedule is organized on a rotating basis, with each engineer covering one calendar week approximately once per month.
Responsibilities
- Provide operational support for production services, including on-call rotation and major incident handling
- Define, monitor and maintain Service Level Objectives (SLOs) and Indicators (SLIs) to ensure reliability
- Manage and operate AWS infrastructure, particularly Kubernetes clusters, using Infrastructure as Code
- Ensure the reliability and performance of microservices and event-driven architectures
- Manage, tune and optimize search and observability platforms, with a specific focus on OpenSearch performance
- Conduct root cause analysis (RCA) and drive problem management to prevent recurring issues
- Take ownership of production environments and reliability outcomes
- Collaborate with engineering teams to embed a reliability mindset across the organization
Requirements
- 3+ years of experience in Site Reliability Engineering or related operational roles
- Expertise in AWS services including EC2, EKS and ECS
- Proficiency in AWS Bedrock and OpenSearch
- Knowledge of IAM and AWS infrastructure management
- Skills in Infrastructure as Code using Terraform
- Background in container orchestration with Kubernetes
- Familiarity with observability tools such as Instana, CloudWatch and ELK
- Understanding of microservices, APIs and event-driven processing
- Capability to perform strong RCA and problem management
- Competency in SLO/SLI definition and reliability engineering practices
- Upper-Intermediate English language proficiency (B2)
Nice to have
- Familiarity with Ansible for configuration management