High Performance Computing (HPC) Administrator- Remote

RTN 2 (Raytheon IDMS)
Tucson, AZ 85701 Work Remotely
This is a multi-level position based on qualifications as they relate to the skills, experience, and requirements for this position.

The RMD Digital Technology organization is looking for a high performance computing (HPC) administrator to support our machine learning (ML) platforms and ecosystems. This i candidate would support a product team including data scientists and machine learning engineers to support ML products and services for our Manufacturing and Engineering customers.
The role includes the running and maintenance of the current environment including incident and change management as well as capacity and release management associated with the development and execution of the ML solutions. A successful candidate will have a solid understanding of operation systems and databases as well as some exposure with HPC environments utilizing graphics processing units (GPU). An understanding of basic programing skills covering UNIX Shell scripting and Python is also required. As a member of the team, some service support may require non-standard business hour availability.
Responsibilities
  • Assists in the day-to-day operations including incident management working w/the team to resolve the associated issues with the infrastructure and systems.
  • Develop and maintain automation and orchestration software and scripting to assist with Machine Learning code release management.
  • Prioritize and efficiently manage deployment and configuration tasks
  • Monitor health of machines and help with necessary maintenance if needed. This includes capacity management for planning and execution for growth.
  • Liaison to our outsourced datacenter support and any other operational level agreement (OLA) partners in the support of the platform
  • Ownership and support of the Security Support Plan (SSP) and annual authorization to proceed (ATO) certification.
  • Coordinating all patches both OS related and application related.
  • Communication to user base for planned and unplanned outages
  • Support of the containerized ecosystem
  • Workflow processes for
    • Data ingest (Python)
    • API provisioning service
    • Scheduling (HTCondor)
    • Watchdog jobs monitoring for data and performing data ingest
  • Support of the hardware specific configurations, tuning, GPU support
  • Database patching and versioning coordination with datacenter support
  • Coordination of changes/maintenance to Schema/table (Production) with datacenter support
  • Monitor system logs associated with the application aspects of the system
Required Education, Experience and Skills:
  • Bachelors degree in IT or STEM and 2 years of experience or a Master degree in IT or STEM and 0 years of experience or lieu of experience 8 years of additional experience is required.
  • Experience of networking concepts; routing, switching, firewalls, load balancers, proxy services, & protocols (TCP, UPD, HTTP, TLS, SIP, SMTP, SNMP, LDAP)
  • Experience with Linux Server operating systems
  • Experience with enterprise standard Database systems (i.e. Oracle, Microsoft SQL Server) Experience at working both independently and in a team-oriented, collaborative environment is essential
  • Experience in Unix shell scripting, Python, PL/SQL, PowerShell, and Bash
  • Experience with application development experience using tools such as Jenkins, Gradle/Ant, SVN/Git, Artifactory, Automation
  • Experience with computer HW and architecture specifically high performance compute (HPC) utilizing GPU
Desired Education, Experience and Skills:
  • Customer service focused with excellent communication (written and verbal) and interpersonal skills. Must be able to effectively work with customers, coworkers, vendors and management
  • Working knowledge of containers and/or orchestration platforms (i.e. Docker, Singularity, Kubernetes, Rancher)
  • Familiarity with agile methodology, ideally Scaled Agile Framework (SAFe)

    This position requires either a U.S. Person or a Non-U.S. Person who is eligible to obtain any required Export Authorization.

178844

Categories

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

High Performance Computing (HPC) Administrator- Remote

RTN 2 (Raytheon IDMS)
Tucson, AZ 85701

Join us to start saving your Favorite Jobs!

Sign In Create Account