SRE engineer(Monitoring tools)

JettyCloud·Удалённо·Удалённо·вчера

We are looking for an experienced Site Reliability Engineer to join the RingCentral Operations Observability team. In this role, you will be responsible for the availability and performance of our home-built Monitoring Platform and infrastructure.

Our team provides the mission-critical operational insights used across RingCentral, managing everything from high-scale data collection to our proprietary alert correlation and processing engine. You will play a crucial role in ensuring the reliability and uptime of these systems by identifying bottlenecks, automating recovery, and proactively scaling the environment. The ideal candidate is a Linux-focused SRE who enjoys working on custom-built internal products and has a strong background in distributed systems, containerization, and data-driven observability.

Responsibilities:

Maintain and Support Platform Availability: Act as the primary owner for the uptime and health of our internal monitoring and alerting infrastructure.
Incident Management: Represent the team in global incident resolution and participate in a sustainable on-call rotation.
Evolution of Custom Tooling: Make changes and improvements to the monitoring stack to meet evolving business needs.
Lifecycle Integration: Collaborate with Dev and Ops teams to integrate our custom observability solutions into the global software development lifecycle.
Capacity Management: Stay ahead of growth requirements in a high-concurrency, fast-growing SaaS environment.
Code-Level Contributions: Actively work with the team’s codebase (Go/Python) to extend system integrations and automate routine operational "toil."
Auditing & Standards: Conduct regular assessments of the monitoring systems to ensure they meet performance benchmarks and security standards.

Skills:

Experience: 4+ years as an SRE or Systems Engineer in a production environment.
Linux Expertise: Strong Linux administration and performance tuning skills.
Problem Solving: A methodical approach to troubleshooting complex, distributed system failures.
Programming: Experience with at least one language (Go or Python preferred) to interact with our custom-built codebase.
Observability Mindset: Deep understanding of the monitoring domain, SaaS telemetry, and alerting theory.
Cloud Platforms: Experience with cloud platforms (AWS/GCP or similar)
Scalability: Proven experience operating systems in large-scale, heterogeneous environments (a major plus).
Communication: Ability to work with globally distributed teams and communicate technical issues clearly.

Preferred technology stack:

OS: Linux (CentOS/RedHat/Oracle Linux).
Languages: Go, Python, JavaScript/TypeScript.
Cloud & Containers: AWS, Kubernetes, Docker.
Data Pipelines: Experience with message brokers, distributed logs, and TSDBs.
Observability: Custom Alert Processors, Zabbix, Prometheus, Grafana.
Databases: ClickHouse, VictoriaMetrics, MongoDB, PostgreSQL.
Automation: Ansible, Terraform, GitLab CI, ArgoCD.

Qualification:

B.S. in Computer Engineering, Computer Science, or a related field with 5+ years of relevant experience.

We offer:

Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days