We are looking for an experienced SRE — someone who takes ownership, solves problems independently, and treats production systems with care and respect.
You’ll join a team that keeps business-critical telephony and communication services running with 99.999% availability. We need someone who not only reacts to incidents but also anticipates them — who improves systems, automates routine tasks, and helps shape how the team works.
Responsibilities
- Support and maintain Linux-based servers and telephony services in production.
- Investigate and resolve incidents in a high-load, distributed environment.
- Participate in on-call shifts and ensure the stability of systems under strict SLAs.
- Analyze service performance, reliability, and architecture bottlenecks; propose improvements.
- Work with development teams to safely deliver and validate changes before production deployment.
- Contribute ideas and help evolve team processes, automation, and monitoring practices.
Requirements
- Strong experience with UNIX/Linux systems and using the CLI for troubleshooting.
- Good understanding of networking protocols and SIP.
- Strong hands-on experience with Kubernetes (k8s) and containerized environments.
- Proven track record of working in production environments, with a careful and methodical approach to changes (testing before deployment, rollback planning, risk mitigation).
- Understanding of high-availability systems, fault tolerance, and performance optimization.
- Experience automating tasks with Python, Golang, or Shell scripts.
- Mindset of an SRE: you treat operations as an engineering discipline and continuously look for ways to make systems more reliable and efficient.
- Good command of English (B2 or higher) — ability to communicate effectively with distributed international teams (both written and spoken).
Would be a plus
- Deep expertise in one or more areas (please highlight your strengths in your application).
- Hands-on experience with Kamailio, Apache Kafka, Nginx, ZeroMQ.
- Experience with AWS/EKS, Terraform, and Ansible for deployment and infrastructure automation.
- Experience with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)
- Knowledge of monitoring stacks like Zabbix, TICK, ELK, Grafana.
What you’ll get
- Work in a strong, experienced SRE team that maintains global infrastructure across multiple regions.
- Hands-on experience debugging Java and C++ applications in large distributed systems (Kafka, Zookeeper, Kamailio, Nginx, etc.).
- Opportunity to influence how the team works — your ideas for tools, automation, or process improvements will be heard and implemented.
- Real experience achieving five-nines availability (99.999%) in production.
- Continuous learning, complex technical challenges, and a supportive environment.
We offer:
- Well-coordinated professional team
- Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
- Additional Health and Life Insurance Package
- Employee Assistance Program
- 25 vacation days
- ReBenefit Platform Account.
- This role requires on-site presence at our office 4 days a week to support effective collaboration and teamwork.