Lead Observability & SLI Engineer

EPAM·Poland·Удалённо·вчера

We are seeking a Lead Observability & SLI Engineer to design and implement observability and Service Level Indicators (SLIs) for real-time distributed platforms. The role focuses on engineering meaningful telemetry, embedding SLI checks into CI/CD and turning metrics, logs and traces into actionable reliability insights. This is an engineering role, not a traditional Ops or monitoring setup position.

Responsibilities

Define and validate SLIs/SLOs for real-time platform services (EFX / RDD / ECB)
Embed SLI checks and observability gates into CI/CD and GitOps workflows
Build end-to-end platform insights by correlating metrics, logs and traces
Improve telemetry instrumentation across distributed services
Support incident analysis and root cause identification using telemetry data
Deliver production-ready observability components together with SRE teams

Requirements

5+ years of hands-on experience with SLIs/SLOs (p95/p99 latency, error rates, error budgets)
Deep understanding of observability signals (metrics, logs, traces) and how they work together
Background in integrating observability into automated pipelines (CI/CD, GitOps)
Expertise in OpenTelemetry, Prometheus, Grafana or similar tools such as Datadog
Cloud-native proficiency in Kubernetes, containers, Terraform and Helm
Strong system-level reasoning and troubleshooting skills