Lead Site Reliability Engineer

EPAM·Ukraine·Удалённо·1 мес. назад

We are seeking a highly skilled Lead Site Reliability Engineer to join our team in driving system reliability, scalability, and performance in complex cloud and containerized environments.

This is a unique opportunity to lead critical infrastructure initiatives, foster operational excellence, and collaborate across teams to achieve business objectives.

Responsibilities

Design comprehensive monitoring and logging systems using tools like DataDog, Dynatrace, Prometheus, Grafana, Zabbix, and ELK to ensure robust observability
Define and manage SLIs and SLOs to measure and enhance system performance, reliability, and scalability
Lead root cause analysis during incident responses, ensure detailed postmortem evaluations, and develop long-term preventive strategies
Implement infrastructure as code (IaC) using Terraform and cloud CLI (AWS, Azure, GCP) for streamlined management and consistency
Automate workflows and CI/CD pipelines leveraging tools such as Jenkins (Groovy SDK), GitLab CI, and Azure DevOps
Manage containerized environments with expertise in Docker and Kubernetes orchestration for seamless application deployment
Collaborate with engineering and DevOps teams to standardize observability practices and proactively address issues before they escalate
Lead and facilitate post-incident reviews and operational drilling exercises to identify areas for improvement and increase system resilience
Focus optional on-call support hours for rapid issue resolution and the maintenance of system stability

Requirements

Residence in Ukraine, with remote work eligibility limited to candidates based within the country
Advanced proficiency in scripting automations with Python, Go, Bash, or PowerShell
Strong knowledge of monitoring systems and tools like Prometheus, Grafana, DataDog, Dynatrace, Zabbix, or ELK
Experience with cloud platforms (AWS, Azure, or GCP) and expertise in IaC with Terraform
Solid understanding of configuration management systems like Ansible
Background in automating CI/CD pipelines and delivery lifecycles using Jenkins, GitLab CI, and Azure DevOps
Practical experience deploying and orchestrating applications in Docker and Kubernetes environments
Exceptional problem-solving capability for incident reconstruction and identifying root causes
Proven track record in leading post-incident reviews and operational improvement exercises
Strong collaboration skills to work effectively with engineering teams and stakeholders to maintain reliability and performance
English level B2 or higher

Nice to have

Knowledge of advanced security and compliance strategies in observable environments
Familiarity with chaos engineering approaches for resilience and fault tolerance testing
Experience integrating observability into development workflows to accelerate issue resolution
Familiarity with additional cloud monitoring services like AWS CloudWatch, Azure Monitor, or GCP Operations Suite

Lead Site Reliability Engineer

Responsibilities

Requirements

Nice to have

Похожие вакансии