Lead ML Systems Engineer, Voice AI

Krisp·Armenia·Удалённо, Офис·2 мес. назад

As a Lead ML Systems Engineer, you will own the architecture, performance, and scalability of Krisp Cloud’s real-time Voice AI serving infrastructure.

You will be responsible for transforming state-of-the-art research models into highly optimized, reliable, and cost-efficient production systems that power latency-sensitive, mission-critical Voice AI services.

This role sits at the intersection of machine learning, distributed systems, GPU performance engineering, and large-scale infrastructure, and requires deep systems thinking and long-term architectural ownership.

What you'll do

Model Serving & Production Performance

Prototype, implement, and benchmark critical components of the serving stack.
Architect and implement inference and serving strategies defining how models are packaged, deployed, replicated, batched, scheduled, and optimized under real-time constraints.
Partner with Research and Platform teams to drive deep performance optimization across runtime, precision (FP16/INT8/FP8), batching strategies, and GPU execution.
Design scaling behavior under variable real-time load (burst handling, replica strategy, workload partitioning).
Establish observability standards across inference services (latency metrics, GPU profiling, tracing, performance regression detection).
Lead root cause analysis of systemic performance regressions and implement structural improvements.
Partner closely with MLOps and Platform teams to operationalize infrastructure while retaining architectural ownership of the serving layer.

Technical Leadership

Drive alignment between model design and production constraints, ensuring research translates into performant, scalable, cost-effective systems.
Mentor senior engineers through design reviews, deep technical discussions, and hands-on collaboration.
Shape the long-term architectural direction for Voice AI serving infrastructure through both implementation and strategic design.

What we are looking for

Experience

5+ years building performance-critical backend or distributed systems.
Hands-on experience deploying and operating ML inference systems in production environments.
Experience working on latency-sensitive or real-time services.
Demonstrated ownership of significant system components or architectural decisions in production environments.
Track record of improving performance, scalability, or cost efficiency of production systems.

Technical Depth

Strong systems background (distributed systems, networking, concurrency, performance engineering).
Hands-on experience deploying and optimizing GPU-based inference systems in production (TensorRT or similar runtimes; graph optimization, precision tuning, memory optimization, CUDA-level profiling).
Strong experience working with high-performance transformer/LLM inference engines (e.g., vLLM or similar), including continuous batching, KV cache optimization, and throughput tuning.
Deep understanding of modern transformer inference optimizations (e.g., efficient attention mechanisms, KV caching strategies, memory-efficient attention).
Experience with model serving frameworks (e.g., Triton, Ray Serve, or custom high-performance serving stacks).
Experience with quantization (INT8/FP16/FP8), ONNX optimization, and advanced batching strategies.
Hands-on GPU profiling and performance tuning (memory fragmentation, utilization optimization, latency reduction).
Strong programming skills in Python and/or C++.
Experience with Docker, Kubernetes, and cloud-native deployment architectures.

Nice to Have