Lead Site Reliability Engineer

EPAM·Argentina, Colombia, Brazil, Chile·Удалённо·вчера

We are seeking an experienced Lead Site Reliability Engineer to spearhead our infrastructure reliability initiatives and guide a team of talented engineers. In this role, you will shape technical strategy, mentor team members and drive operational excellence across our cloud-based platforms and distributed services.

Responsibilities

Lead the design and evolution of resilient, scalable infrastructure across multiple cloud providers
Mentor and guide a team of engineers, fostering technical growth and best practices
Define reliability standards, SLOs and operational policies for production environments
Architect automation frameworks to streamline deployments and infrastructure management
Oversee CI/CD strategy and ensure efficient software delivery workflows
Coordinate incident response efforts and lead post-mortem analyses to prevent recurrence
Partner with engineering leadership to align reliability goals with business priorities
Champion observability practices to enhance system visibility and proactive issue detection
Provide technical direction for microservices and event-driven architecture initiatives
Evaluate emerging tools and technologies to enhance the reliability ecosystem
Drive capacity planning, cost optimization and performance tuning across platforms