We are looking for an experienced Software Engineer in Performance Infrastructure to join our team. This position is part of a high-impact project for a world-renowned global tech leader and AI innovator. You will be focused on optimizing large-scale machine learning workloads for next-generation hardware within a High-Performance Computing (HPC) and Compiler Infrastructure environment. Your core mission will be to manage the end-to-end health, precision, and reliability of performance benchmarking pipelines, working at the intersection of automated infrastructure and performance engineering.
Essential functions
Responsibilities:
- Performance Analysis & Validation: Evaluate results from automated benchmarking suites to detect and analyze performance shifts and shifts in metrics.
- Root-Cause Analysis: Perform deep-dive root-cause analysis on bisection results to identify specific code changes responsible for performance regressions.
- Infrastructure Automation: Develop and maintain Python-based tooling for benchmark automation, hardware configuration management, and automated data recovery.
- System Debugging: Troubleshoot failures within the benchmarking pipeline, including script errors, environment misconfigurations, and resource allocation issues in distributed clusters.
- Data Pipelines & Dashboards: Maintain and enhance data pipelines and visualization tools to ensure high-fidelity performance metrics are consistently available for engineering teams.
- Technical Documentation: Develop and maintain engineering playbooks and best practices to improve consistency in performance testing and incident investigation.
Qualifications
Min requirements:
- Strong proficiency in Python for systems automation, data processing, and integration.
- Hands-on experience with SQL for querying large datasets and managing performance metrics.
- Deep knowledge of Linux/Unix environments, shell scripting (Bash), and command-line development.
- Exceptional analytical and problem-solving skills with the ability to debug complex system-level issues.
- Clear written communication skills for documenting technical investigations and collaborating across globally distributed teams.
Would be a plus
- Practical experience with distributed build and test systems (e.g., Bazel / CMake).
- Strong familiarity with CI/CD pipelines and automated regression testing.
- Basic understanding of hardware accelerators (GPUs) or machine learning frameworks (e.g., JAX, PyTorch, TensorFlow).
- Background in Performance Engineering or SRE (Site Reliability Engineering).
We offer
- Opportunity to work on bleeding-edge projects
- Work with a highly motivated and dedicated team
- Competitive salary
- Flexible schedule
- Benefits package - medical insurance, sports
- Corporate social events
- Professional development opportunities
- Well-equipped office
About us
Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI,
and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical
challenges and enable positive business outcomes for enterprise companies undergoing business transformation.
A key differentiator for Grid Dynamics is our 8 years of experience and leadership in
enterprise AI, supported by profound expertise and ongoing investment in
data,
analytics,
cloud & DevOps,
application modernization
and
customer experience.
Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.