Site Reliability Engineer

EPAM·Argentina, Mexico·Удалённо·1 нед. назад

We are seeking a Site Reliability Engineer with a strong programming background to join our Cloud Security and Infrastructure (CSI) team.

CSI provides a single point of entry to enable identity, branding and compliance, as well as a single point of management to support provisioning, monitoring, security and operational support. The ideal candidate will bring hands-on expertise in containerization, orchestration and observability to help build and maintain reliable, scalable systems.

Responsibilities

Create and manage applications, containerize them and run them using open-source container management tools such as Docker or Podman
Interpret container logs and trace specific events for troubleshooting purposes
Create and manage Kubernetes resource manifests for deployment into K8S clusters (e.g., Kind cluster locally or GKE/AKS in a cloud provider)
Deploy Prometheus agents to monitor infrastructure and application behavior
Raise and manage alerts based on observability data
Support provisioning, monitoring, security and operational tasks across distributed systems
Implement and maintain CI/CD pipelines and GitOps-based continuous deployment workflows
Collaborate with cross-functional teams to ensure system reliability and performance

Requirements

At least 2 years of hands-on programming experience
Proficiency in at least one scripting language
Hands-on expertise in Kubernetes and Linux
Knowledge of at least one cloud provider, with experience in Microsoft Azure
Familiarity with Prometheus or a similar monitoring agent and strong fundamentals of observability
Skills in Azure DevOps CI/CD pipelines and/or GitOps packaging and continuous deployment tools such as Helm and ArgoCD
Capability to troubleshoot distributed systems
Background in Terraform for infrastructure as code
Fluent communication skills in English at a B2+ level