HPC Network Engineering Manager - AI Infrastructure

EPAM·Argentina, Colombia, Mexico, Brazil, Chile·Удалённо·5д. назад

We are seeking an HPC Network Engineering Manager - AI Infrastructure to guide architecture and technical direction for AI research and Kubernetes-based GPU infrastructure. You will steer standards for InfiniBand/RDMA, Ethernet, Kubernetes networking, SmartNIC/DPU, and observability across large programs while mentoring senior engineers. Join us to shape reliable, scalable network platforms for massive distributed AI workloads—apply now.

Responsibilities

Define and own a multi-year architectural vision and roadmap for InfiniBand/RDMA and high-speed Ethernet fabrics supporting massive GPU clusters and distributed AI/LLM workloads across the client portfolio
Govern evaluation and standardization of cluster network topologies such as Fat-tree, Clos, Rail-optimized, and Dragonfly, and set decision frameworks aligned to scale, performance, and cost constraints
Establish and enforce engineering standards for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
Drive strategic performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training, and oversee resolution of the hardest systemic performance issues
Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and lead adoption across programs
Own strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align rollout with the broader infrastructure roadmap
Define enterprise network observability strategy, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methods
Provide technical leadership and mentorship to lead and principal engineers across networking, Kubernetes, storage, GPU infrastructure, observability, and AI research teams to drive cross-functional alignment
Represent the principal technical authority in executive stakeholder forums by shaping direction, negotiating program trade-offs, and ensuring delivery of reliable, scalable network platforms across engagements
Contribute to the engineering community through thought leadership, internal practice building, and representation at industry events

Requirements

9+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 5+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (3+ years)
Proven track record defining architecture and governing delivery for InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-sensitive distributed compute environments
Authoritative expertise in host-side networking (NICs, drivers, firmware) plus PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with demonstrated ability to set enterprise standards and uplift engineering practices
Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with ability to drive workload-network co-design at scale
Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures
Expert-level mastery of RDMA networking, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at very large scale
Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with ability to define repeatable diagnostic methodologies for broader teams
Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving alignment across research and platform stakeholders
English language proficiency at an Advanced level (C1)

Nice to have

Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies
Authoritative command of Grafana and Prometheus, plus Network Administration experience defining observability standards across an engineering organization
Proven ability to set strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs
Proficiency in Python and UNIX shell scripting for automation, tooling, and improving engineering productivity
Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain