We are looking for a Senior Site Reliability Engineer to join our dynamic and growing team supporting the Customer Last Mile area and Order Services. In this role, you will bring deep expertise in AWS Bedrock and OpenSearch (index and performance tuning) to ensure the reliability, scalability, and performance of our critical microservices ecosystem.
Responsibilities
- Own production environments, including on-call coverage and major incident handling
- Lead root cause analysis and drive problem management to closure
- Define and maintain SLOs/SLIs while promoting a reliability-first mindset across teams
- Operate and optimize Kubernetes workloads in AWS (EKS/ECS)
- Manage infrastructure as code using Terraform and Ansible
- Implement and maintain monitoring, alerting, and observability solutions with Instana, CloudWatch, and ELK
- Perform log analysis, alert hygiene, and capacity planning
- Support reliability patterns for CLM microservices, including APIs and async/event-driven processing
- Tune and maintain AWS Bedrock and OpenSearch indexes for optimal performance
- Apply secure-by-design principles across all infrastructure and services
- Drive automation-first practices, documentation, and cross-team collaboration
- Participate in the on-call support rotation, covering one calendar week approximately once per month
Requirements
- 3+ years of experience in Site Reliability Engineering or related operations roles
- Expertise in AWS Bedrock and OpenSearch with a focus on index and performance tuning
- Proficiency in AWS fundamentals, including EC2, EKS/ECS and IAM/networking
- Background in Kubernetes operations at production scale
- Skills in infrastructure as code with Terraform
- Competency in observability tooling such as Instana, CloudWatch, and ELK
- Understanding of microservices reliability patterns, APIs, and async/event-driven processing
- Knowledge of SLO/SLI definition, RCA methodologies, and problem management practices
- Familiarity with secure-by-design principles and operational security
- Capability to handle production ownership, on-call duties, and major incident response
- Strong collaboration, documentation, and automation-first mindset
- English proficiency at a B2 level to ensure effective communication and documentation
Nice to have
- Flexibility to use Ansible for configuration management
- Showcase of advanced capacity planning and alert hygiene practices
- Qualifications in tuning large-scale search and AI/ML platform workloads