We are seeking a Site Reliability Engineer with a strong programming background to join our Cloud Security and Infrastructure (CSI) team.
CSI provides a single point of entry to enable identity, branding and compliance, as well as a single point of management to support provisioning, monitoring, security and operational support. The ideal candidate will bring hands-on expertise in containerization, orchestration and observability to help build and maintain reliable, scalable systems.
Responsibilities
- Create and manage applications, containerize them and run them using open-source container management tools such as Docker or Podman
- Interpret container logs and trace specific events for troubleshooting purposes
- Create and manage Kubernetes resource manifests for deployment into K8S clusters (e.g., Kind cluster locally or GKE/AKS in a cloud provider)
- Deploy Prometheus agents to monitor infrastructure and application behavior
- Raise and manage alerts based on observability data
- Support provisioning, monitoring, security and operational tasks across distributed systems
- Implement and maintain CI/CD pipelines and GitOps-based continuous deployment workflows
- Collaborate with cross-functional teams to ensure system reliability and performance
Requirements
- At least 2 years of hands-on programming experience
- Proficiency in at least one scripting language
- Hands-on expertise in Kubernetes and Linux
- Knowledge of at least one cloud provider, with experience in Microsoft Azure
- Familiarity with Prometheus or a similar monitoring agent and strong fundamentals of observability
- Skills in Azure DevOps CI/CD pipelines and/or GitOps packaging and continuous deployment tools such as Helm and ArgoCD
- Capability to troubleshoot distributed systems
- Background in Terraform for infrastructure as code
- Fluent communication skills in English at a B2+ level
Nice to have
- Familiarity with Azure DevOps
- Knowledge of Google Cloud Platform
- Expertise in Istio
- Proficiency in Prometheus