Senior Site Reliability Engineer

EPAM·Ukraine·Удалённо·1 мес. назад

We are looking for a Senior Site Reliability Engineer to join our dynamic and growing team supporting the Customer Last Mile area and Order Services. In this role, you will bring deep expertise in AWS Bedrock and OpenSearch (index and performance tuning) to ensure the reliability, scalability, and performance of our critical microservices ecosystem.

Responsibilities

Own production environments, including on-call coverage and major incident handling
Lead root cause analysis and drive problem management to closure
Define and maintain SLOs/SLIs while promoting a reliability-first mindset across teams
Operate and optimize Kubernetes workloads in AWS (EKS/ECS)
Manage infrastructure as code using Terraform and Ansible
Implement and maintain monitoring, alerting, and observability solutions with Instana, CloudWatch, and ELK
Perform log analysis, alert hygiene, and capacity planning
Support reliability patterns for CLM microservices, including APIs and async/event-driven processing
Tune and maintain AWS Bedrock and OpenSearch indexes for optimal performance
Apply secure-by-design principles across all infrastructure and services
Drive automation-first practices, documentation, and cross-team collaboration
Participate in the on-call support rotation, covering one calendar week approximately once per month

Requirements

3+ years of experience in Site Reliability Engineering or related operations roles
Expertise in AWS Bedrock and OpenSearch with a focus on index and performance tuning
Proficiency in AWS fundamentals, including EC2, EKS/ECS and IAM/networking
Background in Kubernetes operations at production scale
Skills in infrastructure as code with Terraform
Competency in observability tooling such as Instana, CloudWatch, and ELK
Understanding of microservices reliability patterns, APIs, and async/event-driven processing
Knowledge of SLO/SLI definition, RCA methodologies, and problem management practices
Familiarity with secure-by-design principles and operational security
Capability to handle production ownership, on-call duties, and major incident response
Strong collaboration, documentation, and automation-first mindset
English proficiency at a B2 level to ensure effective communication and documentation