We're looking for a Senior AI Platform Engineer (Domino) to join our team in a remote working mode with occasional onsite visits to Barcelona, Spain. In this role, you will design, build and optimize next-generation AI/ML platforms that enable enterprise-scale experimentation, model lifecycle management and production deployment in a secure, high-availability environment. You will work within the AWS cloud ecosystem, leveraging Domino Data Lab as a core platform component while integrating with enterprise data solutions and MLOps best practices.
This role combines technical expertise and architectural insight, giving you the opportunity to influence platform strategy while delivering automation, scalability and compliance to accelerate data science and AI initiatives across R&D, commercial functions and operations.
Responsibilities
- Define and implement enterprise AI platform architecture, including experimentation, training, model registry, CI/CD and observability components
- Build and maintain reusable services, APIs and automation for scalable platform adoption
- Administer and optimize Domino Data Lab for multi-tenant and multi-region usage
- Lead integration of the AI platform with enterprise data pipelines, orchestrators and security frameworks
- Drive cost optimization, performance tuning and GPU/CPU resource planning for distributed training and inference
- Support the development of model pipelines and tooling that streamline experimentation-to-production workflows
- Apply DevOps/MLOps practices using Infrastructure as Code for automation and compliance
- Ensure robust security, identity management, encryption and regulatory compliance in collaboration with cybersecurity and data privacy teams
- Research and drive new capabilities in LLM operations, including RAG patterns, vector databases and safety mechanisms
- Foster technical best practices and mentor engineering teams to improve platform maturity
Requirements
- Proven hands-on experience with Domino Data Lab administration and customization
- Strong background in AWS or equivalent cloud ecosystem (compute, storage, networking, IAM, governance)
- Experience deploying and managing EKS clusters, including networking, storage classes, operators, GPU workloads and service mesh
- Advanced Python programming skills, including automation and platform tooling development
- Proficiency with Infrastructure as Code (e.g., Terraform, CloudFormation)
- Experience implementing MLOps frameworks for model lifecycle management and reproducibility
- Familiarity with distributed processing and big data tools (e.g., Apache Spark)
- Understanding of security best practices and compliance standards in regulated environments
- Background in LLM operations and multi-agent orchestration preferred
- Excellent communication skills and ability to translate technical concepts for diverse audiences
- Degree in Computer Science, Engineering, or a related field
Nice to have
- Exposure to GxP life sciences environments and governance processes
- Knowledge of AI safety, token-aware scaling and session management
- Familiarity with cost/performance optimization strategies for AI workloads
- Contributions to internal platform strategy and improvement roadmaps