AI Cluster Software Development and Validation Engineer

Intel
Hillsboro, OR 97123
  • Job Code
    JR0189251
  • Jobs Rated
    11th
Job Description

In this role you are joining a highly dynamic team geared to build a best-in-class, fastest scale-up and scale-out cluster to support the AI workloads of today and the future, on Intel.

You will join a team to pull together software components and systems to validate and deliver a complete cluster/paradigm that can morph dynamically to the needs of AI workloads.
This cluster will need to have all the requisite frameworks, libraries, system drivers, and software to visualize results. Beyond that, telemetry of all parts of the stack (system, network/fabric, storage) will need to be collected and consolidated in dashboards, where you'll analyze performance issues so that you can fine-tune the cluster based on AI workload needs and bottlenecks.

Because this is a brand new endeavor, you will find there are missing components in the software stacks - or elements that aren't fully functional; for these you will collaborate with partner teams and external providers (e.g. Network/Fabric, Storage vendors; Software vendors) to report features and file bug - while at the same time you'll have the opportunity to create shim layers, or adopt workarounds, so that you can keep moving quickly

From a software development perspective, you may need to develop a variety of solutions like:

  • Scripts to extract telemetry and validate the performance of cluster components and the cluster as a whole
  • Methods to ingest and visualize the performance of the cluster to find problems
  • Automatic analysis of telemetry data to anticipate and/or address performance problems
  • Create shim layers to account for capabilities that don't exist yet to support AI workloads
  • Containerization of workloads
  • Implementation of software components (especially if open source) to address interoperability gaps


To do this well you will need these superpowers:

  • System/Cluster and software performance analysis skills
  • Agility and creativity in finding solutions to performance and interoperability issues
  • Strong discipline to deliver flexible and reliable reference designs that customers will want to use
  • Great technical curiosity


Your integration and testing work will address many combinations of hardware components (ethernet or fabric) connectivity or other node connectivity (GPU direct RDMA); multiple storage options; scale-out (across systems) and scale-up (within systems)


Qualifications

Minimum qualifications are required to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.

Minimum Qualifications:


Master's degree in Computer Science, Computer Engineering or any other related field and 3+ years of experience OR PhD degree in Computer Science, Computer Engineering or any other related field.

Experience in:

  • 3+ years of Linux experience supporting complex servers
  • 3+ years of experiencing managing heterogeneous clusters with high performance fabric interconnects and high-speed storage capabilities
  • 3+ years of Experience with workload and system performance analysis in complex clusters
  • 3+ years of Experience in AI software workloads


Preferred Qualifications:

Experience in:

  • Programming in at least one of the following languages (C, Python or Bash)
  • Managing cluster systems with 100+ nodes
  • Gigabit Ethernet
  • High performance fabric interconnects
  • Managing AI and HPC clusters with discrete GPUs (Nvidia, AMD or Intel)
  • Containers (Singularity, Podman, Charliecloud, Docker, Kubernetes, others)
  • Administering high performance cluster file systems (Lustre, GPFS, DDN, Others)
  • Supporting AI frameworks (TensorFlow, others)
  • MPI libraries, preferably Intel MPI
  • AI applications and using AI frameworks
  • Containerization as it pertains to HPC / AI workloads
  • Provisioning capabilities (Kubernetes, Openshift, etc)
  • Collecting and analyzing telemetry in all parts of the HW/SW stack in a cluster
  • OneDNN, oneVPL, oneMKL oneCCL,

Inside this Business Group

Intel Architecture, Graphics, and Software (IAGS) brings Intel's technical strategy to life. We have embraced the new reality of competing at a product and solution levelnot just a transistor one. We take pride in reshaping the status quo and thinking exponentially to achieve what's never been done before. We've also built a culture of continuous learning and persistent leadership that provides opportunities to practice until perfection and filter ambitious ideas into execution.



Other Locations

US, Arizona, Phoenix;US, California, Santa Clara;US, New Mexico, Albuquerque


Intel Corporation will require all new U.S. employees to be fully-vaccinated for Covid-19 as a condition of hire unless they have an approved accommodation in place under applicable law. Newly-hired employees will be required to provide proof of vaccination prior to their start date.



Posting Statement

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.

Jobs Rated Reports for Software Developer

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

AI Cluster Software Development and Validation Engineer

Intel
Hillsboro, OR 97123

Join us to start saving your Favorite Jobs!

Sign In Create Account
Software Developer
11th2019 - Software Developer
Overall Rating: 11/199
Median Salary: $103,560

Work Environment
Good
68/220
Stress
Very Low
26/220
Growth
Very Good
18/220