We are seeking a highly skilled and motivated Lead Site Reliability Engineer to oversee the reliability, scalability, and security of our cloud-native identity and profile management platform, enabling personalized experiences across various digital touchpoints.
Responsibilities
- Ensure system reliability, availability, and performance
- Automate infrastructure and operational processes using IaC tools like Terraform and CDK (TypeScript)
- Develop and maintain CI/CD pipelines using Jenkins and GitHub Actions
- Set up and enhance observability with Prometheus, Grafana, and OpenSearch
- Define and monitor SLOs, SLIs, and Error Budgets
- Lead incident response, perform root cause analysis, and drive post-mortem reviews
- Support Kubernetes deployments and manage Helm charts
- Drive scalability and capacity planning efforts
- Optimize cloud infrastructure costs while maintaining performance
- Ensure security and compliance across systems
- Provide documentation and mentorship to foster team growth
- Participate in a 24/7 on-call support rotation, estimated at one week per month
Requirements
- 5+ years of experience in Site Reliability Engineering or related fields
- Knowledge of AWS, including Serverless (Lambda, Step Function, EventBridge), IAM, CloudWatch, and Networking Services
- Proficiency in IaC tools such as Terraform and CDK (TypeScript)
- Expertise in CI/CD tools like Jenkins and GitHub Actions
- Competency in monitoring and observability tools such as Prometheus, Grafana, and OpenSearch
- Background in Kubernetes and Helm for container orchestration
- Capability to lead incident management and operational excellence
- Strong communication skills and fluency in English (B2+)
Nice to have
- Familiarity with GitHub Actions for building CI/CD pipelines
- Understanding of Helm chart management
- Skills in TypeScript