About the project
Andersen is hiring a SRE Engineer with Spanish/Portuguese to improve reliability and performance of a large-scale cloud platform, ensuring high availability, fast incident resolution, and stable operations.
The customer is a large international organization operating in the insurance and risk protection domain, providing a wide range of services for both individual and corporate clients. The customer has a strong global presence, delivers diversified insurance solutions across multiple lines of business, and is known for its mature operational model, financial stability, and well-established expertise in service delivery and claims management.
The project focuses on maintaining and improving the reliability, stability, and operational efficiency of a large-scale cloud-based platform. The team follows Site Reliability Engineering (SRE) principles to ensure high system availability, fast incident resolution, and minimal customer impact.
Responsibilities
- Maintaining high system availability and reliability across cloud-based environments.
- Designing and implementing monitoring, logging, and alerting solutions.
- Defining and managing SLIs, SLOs, and error budgets.
- Leading incident response, root cause analysis, and post-mortems.
- Automating operational tasks to reduce manual intervention Improve CI/CD pipelines and deployment reliability.
- Building self-healing and auto-remediation solutions.
- Partnering with engineering teams to improve application resilience.
- Participating in capacity planning, scaling, and disaster recovery planning.
- Promoting SRE best practices across development teams.
Requirements
- Experience as a Site Reliability Engineer / Incident Management Engineer for 5+ years.
- Strong experience in Incident Escalation.
- Experience with Azure cloud platforms.
- Experience with Kubernetes administration (AKS / EKS / GKE).
- Experience with containerization technologies (Docker).
- Experience with Infrastructure as Code for X+ years (Terraform preferred).
- Understanding high availability architectures, auto-scaling, and disaster recovery strategies.
- Experience with monitoring and APM tools for X+ years (Dynatrace, Datadog, Prometheus, Azure Monitor, etc.).
- Experience with log aggregation systems (ELK, Loki, Splunk, etc.).
- Experience with distributed tracing solutions (OpenTelemetry preferred).
- Experience with alert configuration, tuning, and reduction of alert fatigue.
- Experience defining and tracking SLIs and SLOs.
- Level of English – from Intermediate+ and above.
- Level of Spanish/Portuguese – from Upper-Intermediate and above.
Why join us
- Experience in teamwork with leaders in FinTech, Healthcare, Retail, Telecom, and others. Andersen cooperates with such businesses as Samsung, Siemens, Johnson & Johnson, BNP Paribas, Ryanair, Mercedes, TUI, Verivox, Allianz, T-Systems, etc..
- The opportunity to change the project and/or develop expertise in an interesting business domain.
- Guarantee of professional, financial, and career growth! The company has introduced systems of mentoring and adaptation for each new employee.
- The opportunity to earn additional up to 1,000 USD per month by participating in the company's activities.
- Access to the corporate training portal, where the entire knowledge base of the company is collected and which is constantly updated.
- Bright corporate life (parties / pizza days / PlayStation / fruits / coffee / snacks / movies).
- Certification compensation (AWS, PMP, etc).
- Referral program.
- English courses.
- Private health insurance and compensation for sports activities.