Join a high-impact engineering team building resilience frameworks across cloud-native platforms. You will design, execute, and evolve chaos experiments that safeguard platform reliability and drive the development of autonomous, AI-powered testing pipelines at scale.
Join EPAM to engineer solutions that matter. From AI to cloud transformation, you'll collaborate with top-tier innovators, gain autonomy to explore your ideas, and grow your skills in a culture built for tech excellence.
You will be working with an IoT platform, handling millions of devices.
Req# 1008466132
Responsibilities
- Design and manage chaos engineering tests using Azure Chaos Studio, analyze platform architecture to identify failure domains and strengthen system resilience
- Maintain and enhance existing LitmusChaos test suites across Kubernetes environments, ensure consistent coverage and accuracy across all platforms
- Build comprehensive testing suites by integration of LitmusSDK, Azure Management SDK, Chaos SDK and Kubernetes SDK to automate and scale chaos experiments
- Lead HA/DR testing initiatives across all environments, operate independently to validate high availability and disaster recovery readiness
- Establish and standardize chaos engineering frameworks across AKS and EKS platforms, enable scalable and repeatable resilience practices organization-wide
- Integrate AI-driven capabilities into the chaos engineering pipeline to enable touchless experiment creation, automated execution and continuous validation
Requirements
- Hands-on experience with Kubernetes orchestration platforms including AKS or EKS, with deep understanding of container-based infrastructure and cloud-native architecture
- Proficiency in chaos engineering tools including LitmusChaos and Azure Chaos Studio, with demonstrated experience to build and maintain structured test suites
- Experience with Istio service mesh for traffic management, observability and resilience configuration within microservices environments
- Practical experience with LitmusSDK, Azure Management SDK, Chaos SDK and Kubernetes SDK
- Proven ability to conduct HA/DR testing and work autonomously with minimal oversight across complex multi-environment cloud platforms