About RingCentral Video
RingCentral Video(RCV) is a robust AI-powered video conferencing and collaboration platform that provides a full range of solutions for team collaboration at any scale. It offers a comprehensive solution that covers the entire video communications lifecycle.
The platform's extensive suite of capabilities supports video meetings, conferences, hybrid workspaces, and meeting room integration, ensuring a seamless experience whether the audience is working online, in-office, or in a hybrid format. AI capabilities include automatic transcriptions, instant meeting summaries, contextual notes, live captions, and intelligent noise reduction, making every meeting productive and inclusive
Position Overview
As a Site Reliability Engineer for RCV, you'll be responsible for the reliability and performance of the video communications platform. You'll be involved in the process of incident management, proactively addressing observability gaps, supporting software delivery, ensuring the safe and predictable transition of changes from development to production, and building a self-healing infrastructure. To achieve this, we are looking for a responsible and initiative engineer.
- Manage geo-distributed cloud infrastructure on AWS and EKS, using IaC (Terraform) and GitOps (FluxCD) to ensure scalability;
- Participate in 2 weeks on for 12h/daily (primary/backup roles), 3 weeks off on-call shifts to ensure continuous production support and timely response to operational needs;
- Participate in service capacity planning, software performance analysis, and system configuration;
- Design, consult, re-platform, and re-factor observability of current cloud infrastructure (Prometheus, Grafana, VictoriaMetrics, centralized logging and alerting);
- Participate in release management, working closely with development teams to implement GitOps principles in release processes and manage CI/CD pipelines using GitLab CI;
- Conduct blameless post-mortems to learn from incidents and prevent them;
- Develop and test disaster recovery plans and runbooks to ensure business continuity
- Implement security best practices and controls within the infrastructure to meet compliance standards and prepare for audits
- Cloud & Infrastructure: AWS production environments - read and write Terraform manifests, understand IaC principles;
- Kubernetes: Manage Kubernetes clusters - troubleshoot pod failures, set resource limits, work with scaling, understand networking;
- CI/CD: Create and maintain CI/CD pipelines (GitLab CI is preferable);
- Observability: Manage monitoring stacks (Prometheus, Grafana) - write PromQL queries, create dashboards, configure effective alerts;
- Troubleshooting: Debug performance issues in distributed systems - analyze network traces, read application logs for root cause analysis;
- Performance: Identify and eliminate bottlenecks - interpret metrics, optimize resource allocation and costs;
- Incident Management: Participate in incident response - quickly localize problems, coordinate with other teams through war rooms/incident channels, document event timelines.
- A reliability-oriented mindset with a focus on designing and building resilient architectures;
- In-depth troubleshooting - ability to use and implementation profiling tools (APM mostly);
- Previous SRE experience or knowledge, giving you a heightened awareness of what data to collect, how to display it, and how users can benefit from it;
- A deep understanding of Kubernetes. This is one of our core tools, and the better you understand it, the more valuable it is;
- Hands-on practice with Istio/Gloo;
- Knowledge of scripting languages such as Python or Go;
- Understanding the principles and limitations of caching mechanisms (Redis);
- Experience with messaging queues (Strimzi Kafka);
- Familiarity with SQL and noSQL database management systems (Aurora, DocumentDB).
- Well-coordinated professional team
- Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
- Additional Health and Life Insurance Package
- Employee Assistance Program
- 25 vacation days