Lead Site Reliability Engineer

EPAM·Ukraine·Удалённо·2 мес. назад

We are seeking a highly skilled and motivated Lead Site Reliability Engineer to oversee the reliability, scalability, and security of our cloud-native identity and profile management platform, enabling personalized experiences across various digital touchpoints.

Responsibilities

Ensure system reliability, availability, and performance
Automate infrastructure and operational processes using IaC tools like Terraform and CDK (TypeScript)
Develop and maintain CI/CD pipelines using Jenkins and GitHub Actions
Set up and enhance observability with Prometheus, Grafana, and OpenSearch
Define and monitor SLOs, SLIs, and Error Budgets
Lead incident response, perform root cause analysis, and drive post-mortem reviews
Support Kubernetes deployments and manage Helm charts
Drive scalability and capacity planning efforts
Optimize cloud infrastructure costs while maintaining performance
Ensure security and compliance across systems
Provide documentation and mentorship to foster team growth
Participate in a 24/7 on-call support rotation, estimated at one week per month

Requirements

5+ years of experience in Site Reliability Engineering or related fields
Knowledge of AWS, including Serverless (Lambda, Step Function, EventBridge), IAM, CloudWatch, and Networking Services
Proficiency in IaC tools such as Terraform and CDK (TypeScript)
Expertise in CI/CD tools like Jenkins and GitHub Actions
Competency in monitoring and observability tools such as Prometheus, Grafana, and OpenSearch
Background in Kubernetes and Helm for container orchestration
Capability to lead incident management and operational excellence
Strong communication skills and fluency in English (B2+)