We are looking for a Senior DevOps / Site Reliability Expert ready to own the reliability and operational maturity of a production AI platform.
You will be the engineering foundation that keeps agentic content workflows running at scale, ensuring services are observable, deployments are automated, and infrastructure is reproducible. Working closely with AI and full-stack engineers, you will shape how the team builds, ships, and operates, with particular focus on the reliability challenges unique to LLM workloads: cost, latency, and non-deterministic failure modes. This is a role for someone who takes pride in building systems that others depend on.
Responsibilities
- Own the reliability, scalability, and performance of the platform's services running on Azure Container Apps and AWS ECS
- Build and maintain CI/CD pipelines (GitHub Actions) for automated build, test, and deployment across multiple microservices, including Docker image management, registries, and deployment config
- Implement and manage infrastructure as code (Terraform, Bicep, or ARM) across Azure and AWS
- Set up and maintain observability — monitoring, alerting, logging, and dashboards (New Relic, Langfuse, CloudWatch)
- Manage Azure Service Bus, Blob Storage, Key Vault, and Container Apps configurations
- Ensure security best practices — secret management, image scanning, vulnerability remediation
- Implement auto-scaling, load balancing, and cost optimisation for AI workloads
- Support incident response and establish runbooks for production services
- Collaborate with AI engineers to optimise LLM API usage, token costs, and latency
Requirements
- 3+ years of SRE, DevOps, or platform engineering experience
- Hands-on expertise in Azure (Container Apps, Service Bus, Key Vault), as well as Blob Storage and Azure OpenAI resource management
- Proficiency in infrastructure as code using Terraform, Bicep, or ARM templates
- Background in CI/CD pipeline design and maintenance (GitHub Actions preferred)
- Skills in Docker and container orchestration, with Kubernetes experience as a strong plus
- Competency in monitoring and observability with New Relic or equivalent, including LLM observability as a plus
- Understanding of security practices, including secret management, vulnerability scanning, and image hardening
- Capability to script in Python or Bash for automation
- Familiarity with AI development tools as a daily user (Cursor, Claude Code, Copilot)
- Excellent command of written and spoken English (B2+ level)
Nice to have
- Familiarity with Amazon Web Services (ECS, S3, Aurora, CloudWatch)