Все вакансии

DevOps Engineer

EPAM · зарплата не указана · локация не указана · сайт компании · опубликовано 5 июня 2026 г.

Компания EPAM
Источник сайт компании
Опубликовано 5 июня 2026 г.
Зарплата зарплата не указана

Описание вакансии

We are looking for a DevOps Engineer to help maintain production Kubernetes-based systems for a major technology company that specializes in infrastructure supporting AI research.
This position brings together site reliability engineering, observability and SQL production support duties, with a clear focus on monitoring, metrics, dashboards and operational excellence. The right candidate will partner with established engineering and research teams to uphold system reliability, resolve production issues and steadily strengthen visibility into system health and performance across an Azure Stack environment.
Responsibilities
Design, maintain and progressively improve observability solutions, including dashboards and visual reports built with Grafana or comparable monitoring tools
Set up, implement and oversee metrics, SLIs, SLOs and alerting approaches to guarantee reliability and transparency across production systems
Deliver business-hours operational support for Kubernetes-based production environments, involving initial troubleshooting, log review and metric-based investigations
Assist with SQL-based systems as part of production operations, contributing to issue examination and performance diagnostics
Examine incidents and system behavior to pinpoint root causes, take part in post-incident reviews and suggest enhancements for monitoring and reliability practices
Work hand in hand with engineering, platform and research teams to raise observability standards, refine operational processes and strengthen overall system stability
Add to documentation, knowledge-sharing activities and ongoing improvement initiatives within the team
Requirements
At least 2 years of relevant hands-on professional experience
Demonstrated track record in Site Reliability Engineering (SRE), DevOps, Production Support or equivalent roles working with production systems
Practical exposure to observability and monitoring stacks including Grafana, Prometheus, Elastic Stack, Datadog or similar tools
Strong command of Linux systems, supported by solid troubleshooting and log analysis capabilities
Working experience supporting Kubernetes-based environments in production settings
Background in delivering SQL production support, including query troubleshooting and basic performance diagnostics
Confident scripting skills in Python, Bash or similar languages for automation and day-to-day operational activities
Capability to investigate incidents, determine underlying causes and drive continuous improvement efforts
Effective communication and teamwork skills for working successfully with distributed and cross-functional teams
Proficient English communication skills, both spoken and written, at a B2+ level or higher
Nice to have
Experience handling APIs and integration patterns to link services together and enable system interoperability
Knowledge of databases, covering administration, tuning and production-level support activities
Exposure to Infrastructure as Code development and maintenance for automating environment provisioning and configuration
Practical experience using Microsoft Azure to manage cloud resources and run production workloads

Навыки

  • devops
  • kubernetes
  • linux
  • monitoring tools
  • observability and troubleshooting in distributed systems
  • sql
  • scripting languages
  • apis and integration
  • bash
  • databases
  • elastic stack
  • grafana
Открыть вакансию в ленте