We are seeking an HPC Network Engineering Manager - AI Infrastructure to guide architecture and technical direction for AI research and Kubernetes-based GPU infrastructure. You will steer standards for InfiniBand/RDMA, Ethernet, Kubernetes networking, SmartNIC/DPU, and observability across large programs while mentoring senior engineers. Join us to shape reliable, scalable network platforms for massive distributed AI workloads—apply now.
Responsibilities
- Define and own a multi-year architectural vision and roadmap for InfiniBand/RDMA and high-speed Ethernet fabrics supporting massive GPU clusters and distributed AI/LLM workloads across the client portfolio
- Govern evaluation and standardization of cluster network topologies such as Fat-tree, Clos, Rail-optimized, and Dragonfly, and set decision frameworks aligned to scale, performance, and cost constraints
- Establish and enforce engineering standards for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
- Drive strategic performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training, and oversee resolution of the hardest systemic performance issues
- Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and lead adoption across programs
- Own strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align rollout with the broader infrastructure roadmap
- Define enterprise network observability strategy, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methods
- Provide technical leadership and mentorship to lead and principal engineers across networking, Kubernetes, storage, GPU infrastructure, observability, and AI research teams to drive cross-functional alignment
- Represent the principal technical authority in executive stakeholder forums by shaping direction, negotiating program trade-offs, and ensuring delivery of reliable, scalable network platforms across engagements
- Contribute to the engineering community through thought leadership, internal practice building, and representation at industry events
Requirements
- 9+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 5+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (3+ years)
- Proven track record defining architecture and governing delivery for InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-sensitive distributed compute environments
- Authoritative expertise in host-side networking (NICs, drivers, firmware) plus PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with demonstrated ability to set enterprise standards and uplift engineering practices
- Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with ability to drive workload-network co-design at scale
- Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures
- Expert-level mastery of RDMA networking, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at very large scale
- Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with ability to define repeatable diagnostic methodologies for broader teams
- Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
- Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving alignment across research and platform stakeholders
- English language proficiency at an Advanced level (C1)
Nice to have
- Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies
- Authoritative command of Grafana and Prometheus, plus Network Administration experience defining observability standards across an engineering organization
- Proven ability to set strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs
- Proficiency in Python and UNIX shell scripting for automation, tooling, and improving engineering productivity
- Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain