Senior Site Reliability Engineer, Reliability Team - USDS
TikTok · зарплата не указана · San Jose, California, United States of America · сайт компании · опубликовано 6 апреля 2026 г.
Описание вакансии
Team Introduction
The Site Reliability Engineering (SRE) team at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. Our team is dedicated to ensuring that TikTok’s core services remain stable, efficient, and resilient at a global scale. We focus on enhancing the observability and operability of our infrastructure, using data-driven insights to safeguard business stability 24/7.
Responsibilities
As a Site Reliability Engineer, you will be responsible for the end-to-end reliability of our production ecosystem. You will balance traditional SRE functions—such as automation and performance tuning—with a specialized focus on disaster recovery and rapid incident response.
- System Design & Optimization: Participate in the full lifecycle of high-concurrency distributed systems. Collaborate with development teams to ensure services are designed for scalability, reliability, and high availability.
- Automation & Efficiency: Build and maintain robust automation tools to eliminate "toil," streamline service deployments, and manage infrastructure as code.
- Observability & Monitoring: Develop and refine monitoring, alerting, and logging systems (SLIs/SLOs) to provide deep visibility into service health and performance.
- Disaster Recovery (DR) & Resilience: Lead the design, implementation, and execution of global disaster recovery drills. You will simulate complex failure scenarios and validate failover mechanisms to ensure the platform remains operational under extreme conditions.
- Incident Management & Response: Serve as a key responder for high-priority production incidents. You will coordinate cross-functional "war rooms," drive technical troubleshooting, and lead the path to service restoration.
- Continuous Improvement: Facilitate blameless post-mortems and perform root-cause analysis (RCA). You will transform incident insights into engineering requirements to harden our systems against future outages.
- Capacity Planning: Manage resource allocation and performance bottlenecks to ensure the platform can handle organic growth and massive traffic surges.
Requirements:
Minimum Qualifications:
- Bachelor’s degree in Computer Science, related technical field, or equivalent practical experience.
- Proficiency in one or more programming languages (e.g., Go, Python, Java, or C++).
- Strong understanding of Linux system internals, networking (TCP/IP, DNS, Load Balancing), and distributed systems.
- Experience managing containerized environments (e.g., Kubernetes, Docker).
Preferred Qualifications:
- Proven experience in a high-traffic production environment with a focus on incident response and site stability.
- Hands-on experience with Disaster Recovery strategies, including multi-region failover and data consistency in distributed databases.
- Familiarity with observability and monitoring tools.
- Experience with Infrastructure as Code.