

GCP SuperComputer Solution Support
β - Featured Role | Apply direct with Data Freelance Hub
This role is for a GCP SuperComputer Solution Support contractor, remote (US only), with a contract length of 1 year and a pay rate of "unknown." Key skills include GCP, programming, documentation, API integration testing, and HPC.
π - Country
United States
π± - Currency
$ USD
-
π° - Day rate
-
ποΈ - Date discovered
August 20, 2025
π - Project duration
Unknown
-
ποΈ - Location type
Remote
-
π - Contract type
Unknown
-
π - Security clearance
Unknown
-
π - Location detailed
United States
-
π§ - Skills detailed
#"ETL (Extract #Transform #Load)" #Security #Compliance #Cloud #API (Application Programming Interface) #GCP (Google Cloud Platform) #ML (Machine Learning) #Documentation #Integration Testing #Monitoring #Storage #Programming #Deployment #AI (Artificial Intelligence)
Role description
GCP Supercomputer Solutions Support
Remote (US Only)
Contract
1. Project Overview
Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.
1. Scope of Work & Deliverables
The supplier will be responsible for the services and deliverables detailed below.
2.1. Ongoing Maintenance
β The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.
Ongoing Responsibilities:
β Stability Testing: Test the stability of new products, beginning with A3U. This includes:
β Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
β Setting up and running pairwise tests to identify and report bad nodes.
β Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
β Monitoring daily failure chats and flake tools.
β Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
β Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
β Gathering existing documents and identifying information gaps.
β Creating new documentation and updating existing materials.
β Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
β Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
β Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables:
β HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
β Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.
Key Deliverables:
β API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
β HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
β Network: NetworkInitialize params.
β Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
β Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
β Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
β Creating a cluster that consumes a reservation.
β Creating a cluster with a new network and new storage.
β Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
β Destroying all components of an HCS-created cluster.
β Destroying a cluster while leaving the network and storage intact.
β Updating a Slurm cluster to add a new reservation to both new and existing partitions.
About Ampstek
Ampstek is a global IT solutions partner serving clients across North America, Europe, APAC, LATAM, and MEA. We specialize in delivering talent and technology solutions for enterprise-level digital transformation, trading systems, data services, and regulatory compliance.
Contact:
Snehil Mishra
π§ snehil@ampstek.com
π Desk: 609-360-2673 Ext. 125
π LinkedIn
π www.ampstek.com
GCP Supercomputer Solutions Support
Remote (US Only)
Contract
1. Project Overview
Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.
1. Scope of Work & Deliverables
The supplier will be responsible for the services and deliverables detailed below.
2.1. Ongoing Maintenance
β The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.
Ongoing Responsibilities:
β Stability Testing: Test the stability of new products, beginning with A3U. This includes:
β Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
β Setting up and running pairwise tests to identify and report bad nodes.
β Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
β Monitoring daily failure chats and flake tools.
β Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
β Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
β Gathering existing documents and identifying information gaps.
β Creating new documentation and updating existing materials.
β Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
β Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
β Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables:
β HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
β Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.
Key Deliverables:
β API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
β HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
β Network: NetworkInitialize params.
β Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
β Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
β Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
β Creating a cluster that consumes a reservation.
β Creating a cluster with a new network and new storage.
β Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
β Destroying all components of an HCS-created cluster.
β Destroying a cluster while leaving the network and storage intact.
β Updating a Slurm cluster to add a new reservation to both new and existing partitions.
About Ampstek
Ampstek is a global IT solutions partner serving clients across North America, Europe, APAC, LATAM, and MEA. We specialize in delivering talent and technology solutions for enterprise-level digital transformation, trading systems, data services, and regulatory compliance.
Contact:
Snehil Mishra
π§ snehil@ampstek.com
π Desk: 609-360-2673 Ext. 125
π LinkedIn
π www.ampstek.com