We are seeking a Senior DevOps Engineer to enhance our high-performance computing services and collaborate closely with the scientific community to optimize research computing.
Join our team to build and operate cutting-edge HPC capabilities using automation and infrastructure-as-code. Apply now to contribute to innovative computational solutions in a dynamic environment.
Responsibilities
- Design, implement, and maintain robust platform infrastructure using Infrastructure as Code tools such as Terraform
- Develop, deliver, and operate research computing services and applications
- Apply Site Reliability Engineering principles to manage HPC service deployment, monitoring, and incident response
- Solve complex technical problems related to HPC services and user applications
- Manage large-scale HPC, HTC, or BC computing environments for optimal performance
- Collaborate with scientific users to tailor HPC resources to research needs
- Automate deployment processes to ensure consistency across HPC infrastructure
- Maintain and administer large-scale cluster and server computing software such as Slurm, LSF, or Grid Engine
- Develop and maintain monitoring dashboards using tools like Grafana and Prometheus
- Work within a DevOps team environment following agile methodologies
- Operate and utilize virtualized private cloud resources such as OpenStack
- Administer large-scale parallel filesystems including Weka, GPFS, or Lustre
- Use configuration management tools like Ansible, Salt, or Puppet to manage IT operations
- Develop scripts and tools for HPC and DevOps platform operations using Bash and Python
Requirements
- 3+ years of experience with DevOps processes and automation using Infrastructure as Code tools such as Terraform
- Hands-on experience operating or engineering large-scale HPC or similar computing environments
- Proven expertise in Linux system administration including TCP/IP networking and storage subsystems
- Experience administering large-scale cluster management software such as Slurm, LSF, or Grid Engine
- Knowledge of configuration management tools like Ansible, Salt, or Puppet
- Experience working in agile DevOps teams
- Ability to develop and maintain monitoring tools such as Grafana and Prometheus
- Experience with scripting languages such as Bash and Python for automation and tool development
- Strong experience managing virtualized private cloud environments like OpenStack
- Scientific degree or equivalent experience in computationally intensive scientific data analysis
- Proven ability to manage relationships with third-party suppliers
- Upper-intermediate proficiency in English (B2+)
Nice to have
- Experience with container technologies such as LXD, Singularity, Docker, or Kubernetes
- Operation and configuration experience with public cloud platforms like AWS, Azure, or GCP
- Experience with HashiCorp tools such as Vault, Consul, and Nomad
- Development experience with programming languages such as Java, C++, Python, Ruby, or Perl
- Experience with parallel filesystems like Weka, GPFS, or Lustre