Все вакансии

Senior/Lead AI DevOps/SRE

EPAM · зарплата не указана · локация не указана · сайт компании · опубликовано 5 июня 2026 г.

Компания EPAM
Источник сайт компании
Опубликовано 5 июня 2026 г.
Зарплата зарплата не указана

Описание вакансии

We are currently seeking an experienced Lead AI DevOps/SRE to join our team.
In this pivotal role, you will collaborate closely with data scientists and software developers to ensure seamless integration and optimize the operational efficiency of our AI deployments. Your expertise will be pivotal in deploying, maintaining, and scaling our cutting-edge AI solutions, encompassing LLMs and RAG systems.
As a key team member, you will spearhead both traditional DevOps responsibilities and innovative approaches to MLOps. Your proactive involvement will be essential in driving the success of our AI initiatives and maximizing their impact across the organization.
Responsibilities
Implement and maintain CI/CD pipelines for AI and machine learning projects, ensuring robust deployment strategies and continuous integration
Monitor and ensure the reliability, availability, and performance of AI applications, particularly those involving LLMs and RAG
Collaborate with AI research teams to operationalize machine learning models and systems efficiently
Develop and enforce best practices for version control, configuration management, and testing of AI-driven software solutions
Utilize MLOps tools such as Kubeflow, MLflow, or TensorFlow Extended (TFX) to streamline the machine learning lifecycle from experimentation to production
Implement monitoring solutions that track both system metrics and model performance to facilitate proactive issue resolution
Participate in on-call rotations to support the operational health of critical systems, employing SRE principles to meet service-level objectives (SLOs) and reduce downtime
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field
Proven experience as a DevOps Engineer or SRE, with a strong background in software development and automation
Knowledge of Generative AI Operations
Expertise in deployment and management of LLMs, including technologies like RAG
Proficient in CI/CD tools (Jenkins, GitLab CI, CircleCI) and infrastructure as code (Terraform, Ansible)
Solid knowledge of container orchestration technologies (Kubernetes, Docker)
Familiarity with MLOps tools and practices to support machine learning lifecycle management
Nice to have
Experience with cloud services (AWS, GCP, Azure), particularly in AI/ML deployments
Background in monitoring tools like Prometheus, Grafana, and ELK stack
Understanding of Python, particularly in data science and machine learning contexts
Certification in Kubernetes, AWS/GCP/Azure, or similar technologies

Навыки

  • generative ai operations
  • devops
  • CI/CD
  • Jenkins
  • GitLab
  • Ansible
  • Kubernetes
  • Docker
  • AWS
  • Python
Открыть вакансию в ленте