Все вакансии

Senior DevOps Engineer

EPAM · зарплата не указана · локация не указана · сайт компании · опубликовано 5 июня 2026 г.

Компания EPAM
Источник сайт компании
Опубликовано 5 июня 2026 г.
Зарплата зарплата не указана

Описание вакансии

We are seeking a Senior DevOps Engineer to support production Kubernetes-based systems for a large tech company focused on infrastructure that powers AI research.
This role combines site reliability engineering, observability and SQL production support responsibilities, with a strong emphasis on monitoring, metrics, dashboards and operational excellence. The ideal candidate will work closely with existing engineering and research teams to ensure system reliability, troubleshoot production issues and continuously improve visibility into system health and performance within an Azure Stack environment.
Responsibilities
Build, maintain and continuously enhance observability solutions, including dashboards and visualizations using Grafana or similar monitoring tools
Define, implement and manage metrics, SLIs, SLOs and alerting strategies to ensure reliability and visibility across production systems
Provide business-hours operational support for Kubernetes-based production environments, covering basic troubleshooting, log analysis and metric-driven investigations
Support and troubleshoot SQL-based systems as part of production operations, assisting with issue analysis and performance investigations
Analyze incidents and system behaviors to identify root causes, contribute to post-incident reviews and recommend improvements to monitoring and reliability practices
Collaborate closely with engineering, platform and research teams to improve observability standards, operational processes and overall system reliability
Contribute to documentation, knowledge sharing and continuous improvement initiatives within the team
Requirements
A minimum of 3 years of relevant professional experience
Proven background in Site Reliability Engineering (SRE), DevOps, Production Support or similar roles supporting production systems
Hands-on experience with observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack, Datadog or equivalent tools
Solid understanding of Linux systems, combined with strong troubleshooting and log analysis skills
Practical experience supporting Kubernetes-based environments in production
Experience providing SQL production support, including query troubleshooting and basic performance analysis
Proficiency in scripting with Python, Bash or similar languages for automation and operational tasks
Ability to analyze incidents, uncover root causes and contribute to continuous improvement initiatives
Strong communication and collaboration skills to work effectively with distributed and cross-functional teams
Excellent oral and written communication skills in English at a B2+ level or higher
Nice to have
Experience working with APIs and integration patterns to connect services and support system interoperability
Familiarity with databases, including administration, optimization and production-level support
Background in Infrastructure as Code development and maintenance for automating the provisioning and configuration of environments
Hands-on experience with Microsoft Azure for managing cloud resources and deploying production workloads

Навыки

  • devops
  • kubernetes
  • linux
  • monitoring tools
  • observability and troubleshooting in distributed systems
  • sql
  • scripting languages
  • apis and integration
  • bash
  • databases
  • elastic stack
  • grafana
Открыть вакансию в ленте