Our client is the world’s largest broadline food distributor, specializing in food and non-food products for restaurants, healthcare, educational facilities, lodging, and more. The company serves more than 600,000 clients in 90+ countries and operates approximately 330 distribution facilities worldwide.
As part of the AI Native Dev Adoption POD, you will merge the skills of a Senior Infrastructure Architect with an AI Ops Specialist. Your mission is to build, configure, and scale smart, self-healing systems that automate triage, rightsizing, and system remediation.
Essential functions
- Build Intelligent Agents: Design and implement specialized AI Ops agents, including Root Cause Analysis (RCA) triage agents, Autonomous Remediation agents, FinOps rightsizing systems, and APIC onboarding loops.
- Orchestrate Cluster Automation: Configure alert ingestion services, cross-domain signal ingestion, dynamic runbook configuration, and signal-plan-prove loops per cluster.
- Lead Application Deployment: Drive per-cluster adoption by onboarding legacy enterprise applications into the autonomous operations loop (transitioning apps from standard telemetry to human-in-the-loop (HITL) and ultimately to fully bounded autonomous remediation).
- Collaborate and Document: Co-build runbook curation strategies directly with the client’s operations engineering teams. Establish, validate, and document clear boundaries for safe system autonomy.
- Maturity Evolution: Work seamlessly inside a high-performing dedicated POD alongside a Platform Architect, Product Owner, AI Harness Engineer, and RAG Data Engineer to elevate the organization's overall AI SDLC practices.
Qualifications
- Senior-Level SRE/Architecture Expertise: Proven engineering background at a Senior or Lead level with strong architectural design capabilities in complex enterprise environments.
- Advanced Observability Stack: Deep, hands-on experience with modern cloud observability providers and monitoring ecosystems (Datadog, Splunk, Grafana, Loki, or Prometheus).
- ITSM Integration: Solid familiarity with enterprise incident management frameworks and automated communication systems (PagerDuty, ServiceNow, Slack, Teams).
- Automation & Signal Exposure: Exposure to or experience working with alert ingestion engines, automated response tooling, and pattern matching inside large log data structures.
- Language Proficiency: Strong professional English communication skills (both spoken and written) to work closely with global distributed teams.
We offer
- Opportunity to work on bleeding-edge projects
- Work with a highly motivated and dedicated team
- Competitive salary
- Flexible schedule & a hybrid working model
- Benefits package - medical insurance, sports
- Corporate social events
- Professional development opportunities
- Well-equipped office
About us
Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI,
and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical
challenges and enable positive business outcomes for enterprise companies undergoing business transformation.
A key differentiator for Grid Dynamics is our 8 years of experience and leadership in
enterprise AI, supported by profound expertise and ongoing investment in
data,
analytics,
cloud & DevOps,
application modernization
and
customer experience.
Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.