We are seeking a highly skilled Lead Site Reliability Engineer to join our team in driving system reliability, scalability, and performance in complex cloud and containerized environments.
This is a unique opportunity to lead critical infrastructure initiatives, foster operational excellence, and collaborate across teams to achieve business objectives.
Responsibilities
- Design comprehensive monitoring and logging systems using tools like DataDog, Dynatrace, Prometheus, Grafana, Zabbix, and ELK to ensure robust observability
- Define and manage SLIs and SLOs to measure and enhance system performance, reliability, and scalability
- Lead root cause analysis during incident responses, ensure detailed postmortem evaluations, and develop long-term preventive strategies
- Implement infrastructure as code (IaC) using Terraform and cloud CLI (AWS, Azure, GCP) for streamlined management and consistency
- Automate workflows and CI/CD pipelines leveraging tools such as Jenkins (Groovy SDK), GitLab CI, and Azure DevOps
- Manage containerized environments with expertise in Docker and Kubernetes orchestration for seamless application deployment
- Collaborate with engineering and DevOps teams to standardize observability practices and proactively address issues before they escalate
- Lead and facilitate post-incident reviews and operational drilling exercises to identify areas for improvement and increase system resilience
- Focus optional on-call support hours for rapid issue resolution and the maintenance of system stability
Requirements
- Residence in Ukraine, with remote work eligibility limited to candidates based within the country
- Advanced proficiency in scripting automations with Python, Go, Bash, or PowerShell
- Strong knowledge of monitoring systems and tools like Prometheus, Grafana, DataDog, Dynatrace, Zabbix, or ELK
- Experience with cloud platforms (AWS, Azure, or GCP) and expertise in IaC with Terraform
- Solid understanding of configuration management systems like Ansible
- Background in automating CI/CD pipelines and delivery lifecycles using Jenkins, GitLab CI, and Azure DevOps
- Practical experience deploying and orchestrating applications in Docker and Kubernetes environments
- Exceptional problem-solving capability for incident reconstruction and identifying root causes
- Proven track record in leading post-incident reviews and operational improvement exercises
- Strong collaboration skills to work effectively with engineering teams and stakeholders to maintain reliability and performance
- English level B2 or higher
Nice to have
- Knowledge of advanced security and compliance strategies in observable environments
- Familiarity with chaos engineering approaches for resilience and fault tolerance testing
- Experience integrating observability into development workflows to accelerate issue resolution
- Familiarity with additional cloud monitoring services like AWS CloudWatch, Azure Monitor, or GCP Operations Suite