We are seeking a Senior Operational Intelligence Developer to join our dynamic team, focusing on maintaining and enhancing the Elastic & Observability Platform deployed across GCP and Elastic Cloud. This role involves managing platform operations, developing self-service capabilities, and collaborating with stakeholders to ensure optimal performance and reliability.
As part of this role, the successful candidate will participate in an on-call rotation dedicated to monitoring platform health and functionality. Weekday on-call duty spans business hours (Monday to Friday, 09:00–18:00), while weekend on-call involves one 48-hour shift every four weeks. Weekend on-call is passive by default, requiring action only if issues arise that affect platform health and performance.
Responsibilities
- Ensure the availability, functionality, performance, and security of observability and search platforms in alignment with business SLAs
- Provide incident response and resolution as the first point of escalation during on-call periods
- Manage platform documentation, SOPs, and operational guidelines
- Coordinate with internal stakeholders and vendors for installation, upgrades, and operational requirements
- Design and develop platform features and self-service capabilities for customers
- Deliver proofs-of-concept to improve platform operations, such as integrating AI-driven enhancements or Kubernetes migration
- Maintain and evolve Infrastructure-as-Code automation for platform deployment and lifecycle management
- Deploy, operate, and maintain scalable, highly available Elastic clusters
- Plan and execute upgrades of Elastic Beats, Logstash, and other components, in coordination with the Image Factory team
- Manage SSL certificate rotations, cluster capacity planning, cost optimization, and performance tuning
- Configure and manage the ELK stack at all layers, including ingestion, indexing, and query performance
- Implement alerting workflows, including Kibana Rules, Watchers, and PagerDuty integrations
- Support data ingestion, enrichment, backup, and restoration processes
Requirements
- Proven expertise in the implementation, operation, and maintenance of Elastic clusters, with at least 3 years of experience in related roles
- Solid understanding of Infrastructure-as-Code and automation tools, including Terraform, Ansible, and Jenkins CI, paired with Python scripting
- Advanced troubleshooting and problem-solving skills to diagnose and resolve complex technical issues
- Strong communication skills to convey technical concepts to both technical and non-technical stakeholders
- English proficiency at B2 level or higher
Nice to have
- Familiarity with chargeback automation and Elastic Synthetics enhancements
- Understanding of AI-driven observability enhancements or Kubernetes migration
- Background in integrating Uptrends and PagerDuty with Elastic components