Lead Site Reliability Engineer

EPAM · зарплата не указана · локация не указана · сайт компании · опубликовано 5 июня 2026 г.

Компания EPAM

Источник сайт компании

Опубликовано 5 июня 2026 г.

Зарплата зарплата не указана

Описание вакансии

We are seeking a Lead Site Reliability Engineer with substantial expertise in enhancing the reliability, availability, performance and scalability of production environments. The right candidate will bring a strong software engineering mindset paired with deep operational knowledge, cloud expertise, automation capabilities and practical incident management experience.
This position centers on engineering dependable systems, minimizing operational toil, strengthening observability and supporting engineering teams in delivering services that align with established reliability targets.
Responsibilities
Architect and deliver solutions that enhance system reliability, availability and performance
Establish and track SLIs, SLOs and error budgets
Develop automation that eliminates manual operational effort and recurring tasks
Enhance monitoring, logging, tracing and alerting capabilities
Engage in incident response, root cause investigation and postmortems
Partner with development teams to strengthen service resilience and operability
Maintain production systems and assist in resolving complex technical problems
Contribute to capacity planning, performance tuning and disaster recovery efforts
Advocate reliability engineering practices across teams
Requirements
5+ years of experience in SRE, DevOps, Platform Engineering or Production Engineering roles
At least 1 year of relevant leadership experience
Practical experience operating production systems at scale
Familiarity with cloud platforms including AWS, Azure or GCP
Deep knowledge of observability tooling covering monitoring, logging, tracing and alerting
Proven experience with incident management, postmortems and root cause analysis
Solid scripting or programming abilities in Python, Go, Bash or comparable languages
Working experience with Linux systems, networking and distributed systems fundamentals
Familiarity with containers and orchestration platforms including Docker and Kubernetes
Sound understanding of CI/CD, automation and Infrastructure as Code
Excellent problem-solving abilities and capacity to perform under pressure
Proficient communication skills in English (B2 level or higher)
Nice to have
Background in defining SLIs, SLOs and error budgets
Hands-on experience with Prometheus, Grafana, Datadog, New Relic, Splunk, ELK or comparable tools
Familiarity with Terraform or other IaC technologies
Exposure to chaos engineering or resilience testing
Experience with high-availability systems and disaster recovery planning
Certifications in cloud or Kubernetes

Навыки

site reliability engineering
devops
platform engineering
cloud
python.core
linux
docker
kubernetes
ci/cd
infrastructure as code development and maintenance
prometheus
grafana

Открыть вакансию в ленте