GCP SuperComputer Solution Support

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a GCP SuperComputer Solution Support contractor, remote (US only), with a contract length of 1 year and a pay rate of "unknown." Key skills include GCP, programming, documentation, API integration testing, and HPC.
🌎 - Country
United States
πŸ’± - Currency
$ USD
-
πŸ’° - Day rate
-
πŸ—“οΈ - Date discovered
August 20, 2025
πŸ•’ - Project duration
Unknown
-
🏝️ - Location type
Remote
-
πŸ“„ - Contract type
Unknown
-
πŸ”’ - Security clearance
Unknown
-
πŸ“ - Location detailed
United States
-
🧠 - Skills detailed
#"ETL (Extract #Transform #Load)" #Security #Compliance #Cloud #API (Application Programming Interface) #GCP (Google Cloud Platform) #ML (Machine Learning) #Documentation #Integration Testing #Monitoring #Storage #Programming #Deployment #AI (Artificial Intelligence)
Role description
GCP Supercomputer Solutions Support Remote (US Only) Contract 1. Project Overview Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables. 1. Scope of Work & Deliverables The supplier will be responsible for the services and deliverables detailed below. 2.1. Ongoing Maintenance ● The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work. 2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud. Ongoing Responsibilities: ● Stability Testing: Test the stability of new products, beginning with A3U. This includes: β—‹ Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster. β—‹ Setting up and running pairwise tests to identify and report bad nodes. ● Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes: β—‹ Monitoring daily failure chats and flake tools. β—‹ Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations. ● Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves: β—‹ Gathering existing documents and identifying information gaps. β—‹ Creating new documentation and updating existing materials. β—‹ Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process. ● Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources. ● Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates. Key Deliverables: ● HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025. ● Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes. 2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale. Key Deliverables: ● API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include: β—‹ HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses. β—‹ Network: NetworkInitialize params. β—‹ Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params. β—‹ Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition. ● Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys: β—‹ Creating a cluster that consumes a reservation. β—‹ Creating a cluster with a new network and new storage. β—‹ Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment. β—‹ Destroying all components of an HCS-created cluster. β—‹ Destroying a cluster while leaving the network and storage intact. β—‹ Updating a Slurm cluster to add a new reservation to both new and existing partitions. About Ampstek Ampstek is a global IT solutions partner serving clients across North America, Europe, APAC, LATAM, and MEA. We specialize in delivering talent and technology solutions for enterprise-level digital transformation, trading systems, data services, and regulatory compliance. Contact: Snehil Mishra πŸ“§ snehil@ampstek.com πŸ“ž Desk: 609-360-2673 Ext. 125 πŸ”— LinkedIn 🌐 www.ampstek.com