Senior Site Reliability Engineer

EPAM · зарплата не указана · локация не указана · сайт компании · опубликовано 5 июня 2026 г.

Компания EPAM

Источник сайт компании

Опубликовано 5 июня 2026 г.

Зарплата зарплата не указана

Описание вакансии

We are looking for a skilled Senior Site Reliability Engineer to deliver advanced support and reliability engineering for critical cloud-based systems. The role focuses on ensuring reliability, performance and observability across AWS environments, with strong emphasis on Kubernetes, advanced monitoring, database expertise and distributed systems such as Kafka. The position involves incident response, proactive reliability improvements, automation and collaboration with engineering teams to strengthen system resilience.
Responsibilities
Design, implement and maintain observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena and other modern tooling
Monitor and troubleshoot EKS, Aurora RDS (PostgreSQL) and other AWS infrastructure at an advanced level
Implement automated remediations and self-healing mechanisms
Participate in incident response, root-cause analysis and postmortems
Implement security measures impacting cluster reliability (IAM, network policies, Config)
Support and maintain current AWS infrastructure
Collaborate with L3 teams to escalate, troubleshoot and resolve operational issues
Requirements
3+ years of experience in site reliability engineering or advanced support roles
Expert-level proficiency in Grafana, Prometheus and OpenSearch
Expertise in Open Telemetry, Fluent Bit, CloudWatch and CloudTrail
Strong understanding of distributed tracing, metrics pipelines and log aggregation
Advanced troubleshooting and operational experience with EKS, RDS (PostgreSQL) and MSK (Kafka)
Knowledge of AWS Network (VPC, SG, Route Tables) and IAM (Roles and Policies)
Strong understanding of AWS networking, security, scaling and reliability patterns
Advanced Kubernetes knowledge in operations, debugging, networking and scaling
Strong background in incident response, RCA, postmortems and SLA management
Scripting skills in Bash or Python, with automation of cloud operations, observability integrations and incident recovery
Excellent structured problem-solving skills, strong communication across technical and non-technical teams, and comfort working in a fast-paced Agile environment
Nice to have
Familiarity with AKS (Kubernetes), Azure Monitor, Application Insights and Log Analytics
Knowledge of Cosmos DB and PostgreSQL on Azure
Expertise in Azure DevOps
Proficiency in Terraform and ArgoCD

Навыки

devops
amazon web services
grafana
incident management (itsm)
kubernetes
opentelemetry
prometheus
aws cloudtrail
aws iam
aws x-ray
amazon cloudwatch
amazon elastic kubernetes service

Открыть вакансию в ленте