Type of Requisition:
RegularClearance Level Must Currently Possess:
NoneClearance Level Must Be Able to Obtain:
NonePublic Trust/Other Required:
NoneJob Family:
IT Infrastructure and OperationsJob Qualifications:
Skills:
High Performance Computing (HPC), lustre, Portable Batch System (PBS), Python Software Development, Red Hat Enterprise Linux (RHEL)Certifications:
NoneExperience:
10 + years of related experienceUS Citizenship Required:
YesJob Description:
GDIT is looking for an HPC Technical Lead that will provide deep technical expertise and leadership for the WCOSS production environment, ensuring stability, performance, and scalability of NOAA’s operational HPC systems. This role focuses on technical execution, troubleshooting, and guiding the team in implementing best practices for HPC operations.
Key Responsibilities
- Technical Leadership
- Serve as the primary technical authority for all HPC operational matters.
- Lead and drive root‑cause analysis and problem resolution for critical incidents across compute, storage, and interconnect components.
- System Performance & Reliability
- Monitor, analyze, and optimize performance across compute nodes, interconnects, and storage subsystems.
- Conduct proactive health checks and performance tuning to ensure system readiness for 24×7 mission‑critical NOAA workloads.
- Change & Configuration Management
- Lead technical planning and readiness reviews for all system upgrades, patches, and enhancements.
- Maintain configuration baselines and ensure compliance with security requirements (e.g., RMF/STIG).
- Collaboration & Customer Technical Interface
- Act as the primary technical point of contact for NWS/NOAA for detailed HPC discussions, system behavior, and operational issues.
- Coordinate with NOAA scientific teams to understand modeling workload needs and optimize HPC resources accordingly.
Required Skills
- 10+ years of hands‑on HPC systems administration and troubleshooting experience, including Cray, SGI, or comparable large‑scale systems.
- Extensive experience supporting Federal HPC environments, demonstrating readiness for NOAA/NWS operational environments.
- Deep Linux expertise, including SLES, RHEL, and CentOS across multiple HPC platforms.
- Strong technical experience with HPC storage (e.g., Lustre), interconnects (e.g., InfiniBand), and performance tuning of large‑scale computing systems.
- Proven leadership of HPC technical teams, including mentoring and directing system administrators and engineers supporting very large core.
- Demonstrated success performing root cause analysis, escalated troubleshooting, and incident recovery in production HPC environments.
- Experience implementing STIG/RMF security controls across HPC systems and applying DoD‑grade configuration compliance.
- Excellent communication skills, capable of translating complex technical issues to customers and stakeholders.
Preferred Qualifications
- Prior experience supporting NOAA/NWS operational HPC systems, especially in real‑time weather or climate modeling environments.
- Experience designing or improving HPC system architectures, monitoring frameworks, or performance analysis pipelines.
- Experience presenting technical findings to scientific, engineering, or federal customer groups.
Scheduled Weekly Hours:
40Travel Required:
Less than 10%Telecommuting Options:
RemoteWork Location:
Any Location / RemoteAdditional Work Locations:
