We are looking for an experienced Site Reliability Engineer to join the RingCentral Operations Observability team. In this role, you will be responsible for the availability and performance of our home-built Monitoring Platform and infrastructure.
Our team provides the mission-critical operational insights used across RingCentral, managing everything from high-scale data collection to our proprietary alert correlation and processing engine. You will play a crucial role in ensuring the reliability and uptime of these systems by identifying bottlenecks, automating recovery, and proactively scaling the environment. The ideal candidate is a Linux-focused SRE who enjoys working on custom-built internal products and has a strong background in distributed systems, containerization, and data-driven observability.
Responsibilities:
- Maintain and Support Platform Availability: Act as the primary owner for the uptime and health of our internal monitoring and alerting infrastructure.
- Incident Management: Represent the team in global incident resolution and participate in a sustainable on-call rotation.
- Evolution of Custom Tooling: Make changes and improvements to the monitoring stack to meet evolving business needs.
- Lifecycle Integration: Collaborate with Dev and Ops teams to integrate our custom observability solutions into the global software development lifecycle.
- Capacity Management: Stay ahead of growth requirements in a high-concurrency, fast-growing SaaS environment.
- Code-Level Contributions: Actively work with the team’s codebase (Go/Python) to extend system integrations and automate routine operational "toil."
- Auditing & Standards: Conduct regular assessments of the monitoring systems to ensure they meet performance benchmarks and security standards.
Skills:
- Experience: 4+ years as an SRE or Systems Engineer in a production environment.
- Linux Expertise: Strong Linux administration and performance tuning skills.
- Problem Solving: A methodical approach to troubleshooting complex, distributed system failures.
- Programming: Experience with at least one language (Go or Python preferred) to interact with our custom-built codebase.
- Observability Mindset: Deep understanding of the monitoring domain, SaaS telemetry, and alerting theory.
- Cloud Platforms: Experience with cloud platforms (AWS/GCP or similar)
- Scalability: Proven experience operating systems in large-scale, heterogeneous environments (a major plus).
- Communication: Ability to work with globally distributed teams and communicate technical issues clearly.
Preferred technology stack:
- OS: Linux (CentOS/RedHat/Oracle Linux).
- Languages: Go, Python, JavaScript/TypeScript.
- Cloud & Containers: AWS, Kubernetes, Docker.
- Data Pipelines: Experience with message brokers, distributed logs, and TSDBs.
- Observability: Custom Alert Processors, Zabbix, Prometheus, Grafana.
- Databases: ClickHouse, VictoriaMetrics, MongoDB, PostgreSQL.
- Automation: Ansible, Terraform, GitLab CI, ArgoCD.
Qualification:
- B.S. in Computer Engineering, Computer Science, or a related field with 5+ years of relevant experience.
We offer:
- Well-coordinated professional team
- Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
- Additional Health and Life Insurance Package
- Employee Assistance Program
- 25 vacation days