Big Data Site Reliability Engineering (SRE) Manager

Cupertino, CA 95014
  • Job Code
    200208059
Summary

Summary

Posted: Mar 25, 2021

Weekly Hours: 40

Role Number:200208059

Imagine what you could do here! At Apple, great ideas have a way of becoming phenomenal products, services, and...Summary

Summary

Posted: Mar 25, 2021

Weekly Hours: 40

Role Number:200208059

Imagine what you could do here! At Apple, great ideas have a way of becoming phenomenal products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish!

Apple's Applied Machine Learning team has built systems for a number of large-scale data science applications. We work on many high-impact projects that serve various Apple lines of business. We utilize the latest in open source technology and as committers on some of these projects, we are pushing the envelope. Working with multiple lines of business, we manage many streams of Apple-scale data. We bring it all together and extract the value. We do all this with an outstanding group of software engineers, data scientists, SRE/DevOps engineers and managers.

Key Qualifications

  • We are looking for 10 Years of Management experience leading team of engineers.
  • We wish to see a hands on manager who loves troubleshooting complex performance and scale problems.
  • Excellent problem solving, critical thinking, and communication skills - Lead by example to motivate and challenge the team to deliver their best.
  • 5+ years of experience in Hadoop based technologies - HDFS/Yarn cluster administration, Hive, Spark
  • Strong Experience leading cross functional initiatives and thought leadership
  • Zoom in and zoom out to clear out ambiguity and set a clear path forward
  • Experience managing Hadoop/YARN clusters with thousands of nodes and 10's of petabytes of data running 10's of thousands of jobs
  • Have a passion for automation by creating tools using Python, Java or other JVM languages
  • Strong expertise in troubleshooting complex production issues.
  • The candidate should be adapt at prioritizing multiple issues in a high pressure environment.
  • Should be able to understand sophisticated architectures and be comfortable working with multiple teams.
  • Ability to conduct performance analysis and troubleshoot large scale distributed systems.
  • Should be highly proactive with a keen focus on improving uptime availability of our mission-critical services.

Description

We are seeking Hands on SRE Manager who has experience managing large big data environments spread across thousands of nodes and petabytes of data.

You have grown into leadership roles after proving your technical skills in individual contributor roles but still enjoy hands on work when the situation calls for it.

You have designed and built large data environments for availability, security and reliability.

You keep yourself informed about the choices and trade off as the new technology evolves in big data landscape.

You have an eye for talent and hire and grow your engineers by mentoring and challenging them.

You will collaborate across many teams to deliver on projects related to big data platform and data pipeline and provide SRE support for reliability of these managed services.

You will have significant opportunity to influence and shape our big data platform strategy and data products as we work on the next generation of our architecture, platform and processes.

Education & Experience

BS in computer science with 7-10 years or MS plus 5-7 years experience or related experience.

Additional Requirements

  • Experience with running infrastructure in AWS and Kubernetes.
  • Experience building and operating large scale Hadoop/spark/Kafka data infrastructure used for machine learning in a production environment.
  • Experience in tuning complex hive and spark queries.
  • Expertise in debugging Hadoop/spark/hive issues using Namenode, datanode, Nodemanager, spark executor logs.
  • Experience in Capacity management on multi tenant Hadoop cluster.
  • Experience in Workflow and data pipeline orchestration (Airflow, Oozie, Jenkins).
  • Experience in Jupyter based notebook infrastructure.


Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Big Data Site Reliability Engineering (SRE) Manager

Apple, Inc.
Cupertino, CA 95014

Join us to start saving your Favorite Jobs!

Sign In Create Account