Tech Lead Site Reliability Engineer, TikTok Generalized Arch USTO

TikTok · зарплата не указана · San Jose, California, United States of America · сайт компании · опубликовано 12 августа 2025 г.

Компания TikTok

Источник сайт компании

Опубликовано 12 августа 2025 г.

Зарплата зарплата не указана

Описание вакансии

TikTok’s Generalized Architecture US Tech and Operations team is dedicated to ensuring that TikTok’s core services run stable, efficient, and cost-effective at global scale. We focus on enhancing the observability and operability of our infrastructure and services, using data-driven insights to safeguard business stability 24/7.
- Ensure the stability and reliability of TikTok’s core services; respond quickly to production incidents and build mechanisms and platforms to continuously improve incident handling efficiency.
- Define and maintain system quality SLAs through continuous, comprehensive data operations; identify and manage system risks to improve reliability, scalability, and performance.
- Participate in TikTok’s disaster recovery initiatives, including risk assessment, disaster recovery design, capacity planning, and contingency plan development, to strengthen system resilience and fault tolerance.
- Develop and accumulate best practices, tools, and frameworks for operations and maintenance; provide guidance on system architecture design and component selection; produce high-quality technical and operational documentation.
Requirements:
Minimum Qualifications
- Bachelor’s degree or above in Computer Science or a related field.
- Solid foundation in computer science and software engineering, with understanding of operating systems (especially Linux), storage systems, and network I/O principles.
- Proficiency in one or more programming languages, such as Python, Go, Java, PHP, C, or C++.
- Strong problem-solving skills with a systematic approach, effective communication abilities, and a strong sense of ownership and responsibility.
Preferred Qualifications
- 5+ years of relevant experience in a large-scale internet or cloud-based business environment.
- Hands-on experience building AI-powered tools to improve SRE/operations efficiency (e.g., intelligent runbooks, automated incident response, anomaly detection, or self-healing systems).

Навыки

Go
Linux
Python
Java
PHP
C++

Открыть вакансию в ленте