Senior Operational Intelligence Developer

EPAM·Ukraine·Удалённо·2 мес. назад

We are seeking a Senior Operational Intelligence Developer to join our dynamic team, focusing on maintaining and enhancing the Elastic & Observability Platform deployed across GCP and Elastic Cloud. This role involves managing platform operations, developing self-service capabilities, and collaborating with stakeholders to ensure optimal performance and reliability.

As part of this role, the successful candidate will participate in an on-call rotation dedicated to monitoring platform health and functionality. Weekday on-call duty spans business hours (Monday to Friday, 09:00–18:00), while weekend on-call involves one 48-hour shift every four weeks. Weekend on-call is passive by default, requiring action only if issues arise that affect platform health and performance.

Responsibilities

Ensure the availability, functionality, performance, and security of observability and search platforms in alignment with business SLAs
Provide incident response and resolution as the first point of escalation during on-call periods
Manage platform documentation, SOPs, and operational guidelines
Coordinate with internal stakeholders and vendors for installation, upgrades, and operational requirements
Design and develop platform features and self-service capabilities for customers
Deliver proofs-of-concept to improve platform operations, such as integrating AI-driven enhancements or Kubernetes migration
Maintain and evolve Infrastructure-as-Code automation for platform deployment and lifecycle management
Deploy, operate, and maintain scalable, highly available Elastic clusters
Plan and execute upgrades of Elastic Beats, Logstash, and other components, in coordination with the Image Factory team
Manage SSL certificate rotations, cluster capacity planning, cost optimization, and performance tuning
Configure and manage the ELK stack at all layers, including ingestion, indexing, and query performance
Implement alerting workflows, including Kibana Rules, Watchers, and PagerDuty integrations
Support data ingestion, enrichment, backup, and restoration processes

Requirements

Proven expertise in the implementation, operation, and maintenance of Elastic clusters, with at least 3 years of experience in related roles
Solid understanding of Infrastructure-as-Code and automation tools, including Terraform, Ansible, and Jenkins CI, paired with Python scripting
Advanced troubleshooting and problem-solving skills to diagnose and resolve complex technical issues
Strong communication skills to convey technical concepts to both technical and non-technical stakeholders
English proficiency at B2 level or higher