As a Lead ML Systems Engineer, you will own the architecture, performance, and scalability of Krisp Cloud’s real-time Voice AI serving infrastructure.
You will be responsible for transforming state-of-the-art research models into highly optimized, reliable, and cost-efficient production systems that power latency-sensitive, mission-critical Voice AI services.
This role sits at the intersection of machine learning, distributed systems, GPU performance engineering, and large-scale infrastructure, and requires deep systems thinking and long-term architectural ownership.
What you'll do
Model Serving & Production Performance
- Prototype, implement, and benchmark critical components of the serving stack.
- Architect and implement inference and serving strategies defining how models are packaged, deployed, replicated, batched, scheduled, and optimized under real-time constraints.
- Partner with Research and Platform teams to drive deep performance optimization across runtime, precision (FP16/INT8/FP8), batching strategies, and GPU execution.
- Design scaling behavior under variable real-time load (burst handling, replica strategy, workload partitioning).
- Establish observability standards across inference services (latency metrics, GPU profiling, tracing, performance regression detection).
- Lead root cause analysis of systemic performance regressions and implement structural improvements.
- Partner closely with MLOps and Platform teams to operationalize infrastructure while retaining architectural ownership of the serving layer.
Technical Leadership
- Drive alignment between model design and production constraints, ensuring research translates into performant, scalable, cost-effective systems.
- Mentor senior engineers through design reviews, deep technical discussions, and hands-on collaboration.
- Shape the long-term architectural direction for Voice AI serving infrastructure through both implementation and strategic design.
What we are looking for
Experience
- 5+ years building performance-critical backend or distributed systems.
- Hands-on experience deploying and operating ML inference systems in production environments.
- Experience working on latency-sensitive or real-time services.
- Demonstrated ownership of significant system components or architectural decisions in production environments.
- Track record of improving performance, scalability, or cost efficiency of production systems.
Technical Depth
- Strong systems background (distributed systems, networking, concurrency, performance engineering).
- Hands-on experience deploying and optimizing GPU-based inference systems in production (TensorRT or similar runtimes; graph optimization, precision tuning, memory optimization, CUDA-level profiling).
- Strong experience working with high-performance transformer/LLM inference engines (e.g., vLLM or similar), including continuous batching, KV cache optimization, and throughput tuning.
- Deep understanding of modern transformer inference optimizations (e.g., efficient attention mechanisms, KV caching strategies, memory-efficient attention).
- Experience with model serving frameworks (e.g., Triton, Ray Serve, or custom high-performance serving stacks).
- Experience with quantization (INT8/FP16/FP8), ONNX optimization, and advanced batching strategies.
- Hands-on GPU profiling and performance tuning (memory fragmentation, utilization optimization, latency reduction).
- Strong programming skills in Python and/or C++.
- Experience with Docker, Kubernetes, and cloud-native deployment architectures.
Nice to Have
- Experience optimizing ASR or TTS systems for real-time production workloads.
- Experience with streaming inference and low-latency (<200ms) systems.
- Experience building cost-efficient inference infrastructure at scale.
- Familiarity with CUDA internals or custom kernel optimization.