EPAM Vietnam is hiring a Senior Site Reliability Engineer to support and stabilize a complex, business-critical environment. This is a hands-on, high-ownership role responsible for production incidents, releases, monitoring, alerting and operational excellence.
You will work across Linux, Windows, SQL Server, CI/CD, Kubernetes and Azure while supporting both modern cloud workloads and legacy business-critical systems.
Responsibilities
- Own production incidents end-to-end, from triage to fix and follow-up
- Troubleshoot Linux & Windows systems, services and databases
- Operate and improve monitoring and alerting tools
- Support batch workflows and schedulers
- Work across production and disaster recovery environments
- Improve runbooks, alert quality and operational processes
Requirements
- Strong experience in production operations, SRE or infrastructure support
- Proven expertise in troubleshooting Linux and Windows production systems and operational knowledge of Microsoft SQL Server diagnostics
- Experience with CI/CD pipelines and deployments (e.g., Octopus Deploy, TeamCity and Git/Bitbucket)
- Proficiency in monitoring and alerting tools (e.g., Prometheus and Grafana)
- Familiarity with batch scheduling tools (e.g., Control-M and TeamCity) and messaging systems (e.g., RabbitMQ)
- Working knowledge of Kubernetes and Azure cloud environments
- Clear communication for incident management and stakeholder interaction
- Strong sense of ownership, sound judgment in escalation and a proactive approach to production reliability