Senior AI Platform Engineer (Domino)

EPAM·Spain·Удалённо, Офис·вчера

We're looking for a Senior AI Platform Engineer (Domino) to join our team in a remote working mode with occasional onsite visits to Barcelona, Spain. In this role, you will design, build and optimize next-generation AI/ML platforms that enable enterprise-scale experimentation, model lifecycle management and production deployment in a secure, high-availability environment. You will work within the AWS cloud ecosystem, leveraging Domino Data Lab as a core platform component while integrating with enterprise data solutions and MLOps best practices.

This role combines technical expertise and architectural insight, giving you the opportunity to influence platform strategy while delivering automation, scalability and compliance to accelerate data science and AI initiatives across R&D, commercial functions and operations.

Responsibilities

Define and implement enterprise AI platform architecture, including experimentation, training, model registry, CI/CD and observability components
Build and maintain reusable services, APIs and automation for scalable platform adoption
Administer and optimize Domino Data Lab for multi-tenant and multi-region usage
Lead integration of the AI platform with enterprise data pipelines, orchestrators and security frameworks
Drive cost optimization, performance tuning and GPU/CPU resource planning for distributed training and inference
Support the development of model pipelines and tooling that streamline experimentation-to-production workflows
Apply DevOps/MLOps practices using Infrastructure as Code for automation and compliance
Ensure robust security, identity management, encryption and regulatory compliance in collaboration with cybersecurity and data privacy teams
Research and drive new capabilities in LLM operations, including RAG patterns, vector databases and safety mechanisms
Foster technical best practices and mentor engineering teams to improve platform maturity

Requirements

Proven hands-on experience with Domino Data Lab administration and customization
Strong background in AWS or equivalent cloud ecosystem (compute, storage, networking, IAM, governance)
Experience deploying and managing EKS clusters, including networking, storage classes, operators, GPU workloads and service mesh
Advanced Python programming skills, including automation and platform tooling development
Proficiency with Infrastructure as Code (e.g., Terraform, CloudFormation)
Experience implementing MLOps frameworks for model lifecycle management and reproducibility
Familiarity with distributed processing and big data tools (e.g., Apache Spark)
Understanding of security best practices and compliance standards in regulated environments
Background in LLM operations and multi-agent orchestration preferred
Excellent communication skills and ability to translate technical concepts for diverse audiences
Degree in Computer Science, Engineering, or a related field