Site Reliability Engineer (Telco team)

JettyCloud·Удалённо·Офис·1 мес. назад

We are looking for an experienced SRE — someone who takes ownership, solves problems independently, and treats production systems with care and respect.
You’ll join a team that keeps business-critical telephony and communication services running with 99.999% availability. We need someone who not only reacts to incidents but also anticipates them — who improves systems, automates routine tasks, and helps shape how the team works.

Responsibilities

Support and maintain Linux-based servers and telephony services in production.
Investigate and resolve incidents in a high-load, distributed environment.
Participate in on-call shifts and ensure the stability of systems under strict SLAs.
Analyze service performance, reliability, and architecture bottlenecks; propose improvements.
Work with development teams to safely deliver and validate changes before production deployment.
Contribute ideas and help evolve team processes, automation, and monitoring practices.

Requirements

Strong experience with UNIX/Linux systems and using the CLI for troubleshooting.
Good understanding of networking protocols and SIP.
Strong hands-on experience with Kubernetes (k8s) and containerized environments.
Proven track record of working in production environments, with a careful and methodical approach to changes (testing before deployment, rollback planning, risk mitigation).
Understanding of high-availability systems, fault tolerance, and performance optimization.
Experience automating tasks with Python, Golang, or Shell scripts.
Mindset of an SRE: you treat operations as an engineering discipline and continuously look for ways to make systems more reliable and efficient.
Good command of English (B2 or higher) — ability to communicate effectively with distributed international teams (both written and spoken).

Would be a plus

Deep expertise in one or more areas (please highlight your strengths in your application).
Hands-on experience with Kamailio, Apache Kafka, Nginx, ZeroMQ.
Experience with AWS/EKS, Terraform, and Ansible for deployment and infrastructure automation.
Experience with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)

Knowledge of monitoring stacks like Zabbix, TICK, ELK, Grafana.

What you’ll get

Work in a strong, experienced SRE team that maintains global infrastructure across multiple regions.
Hands-on experience debugging Java and C++ applications in large distributed systems (Kafka, Zookeeper, Kamailio, Nginx, etc.).
Opportunity to influence how the team works — your ideas for tools, automation, or process improvements will be heard and implemented.
Real experience achieving five-nines availability (99.999%) in production.
Continuous learning, complex technical challenges, and a supportive environment.

We offer:

Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
ReBenefit Platform Account.
This role requires on-site presence at our office 4 days a week to support effective collaboration and teamwork.