GCP SuperComputer Solution Support

⭐ - Featured Role | Apply direct with Data Freelance Hub

This role is for a GCP SuperComputer Solution Support contractor, remote (US only), with a contract length of 1 year and a pay rate of "unknown." Key skills include GCP, programming, documentation, API integration testing, and HPC.

🌎 - Country

United States

💱 - Currency

$ USD

💰 - Day rate

🗓️ - Date discovered

August 20, 2025

🕒 - Project duration

Unknown

🏝️ - Location type

Remote

📄 - Contract type

Unknown

🔒 - Security clearance

Unknown

📍 - Location detailed

United States

🧠 - Skills detailed

#"ETL (Extract #Transform #Load)" #Security #Compliance #Cloud #API (Application Programming Interface) #GCP (Google Cloud Platform) #ML (Machine Learning) #Documentation #Integration Testing #Monitoring #Storage #Programming #Deployment #AI (Artificial Intelligence)

Role description

GCP Supercomputer Solutions Support Remote (US Only) Contract 1. Project Overview Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables. 1. Scope of Work & Deliverables The supplier will be responsible for the services and deliverables detailed below. 2.1. Ongoing Maintenance ● The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work. 2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud. Ongoing Responsibilities: ● Stability Testing: Test the stability of new products, beginning with A3U. This includes: ○ Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster. ○ Setting up and running pairwise tests to identify and report bad nodes. ● Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes: ○ Monitoring daily failure chats and flake tools. ○ Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations. ● Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves: ○ Gathering existing documents and identifying information gaps. ○ Creating new documentation and updating existing materials. ○ Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process. ● Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources. ● Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates. Key Deliverables: ● HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025. ● Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes. 2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale. Key Deliverables: ● API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include: ○ HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses. ○ Network: NetworkInitialize params. ○ Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params. ○ Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition. ● Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys: ○ Creating a cluster that consumes a reservation. ○ Creating a cluster with a new network and new storage. ○ Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment. ○ Destroying all components of an HCS-created cluster. ○ Destroying a cluster while leaving the network and storage intact. ○ Updating a Slurm cluster to add a new reservation to both new and existing partitions. About Ampstek Ampstek is a global IT solutions partner serving clients across North America, Europe, APAC, LATAM, and MEA. We specialize in delivering talent and technology solutions for enterprise-level digital transformation, trading systems, data services, and regulatory compliance. Contact: Snehil Mishra 📧 snehil@ampstek.com 📞 Desk: 609-360-2673 Ext. 125 🔗 LinkedIn 🌐 www.ampstek.com

Apply now Apply with DFH Sign up

← See all roles

Go to role

GCP SuperComputer Solution Support

Premium Members Land Roles Faster—Upgrade today.

Salesforce Technical Business Analyst

Mumps Profile Developer

Developer - Back End II

AI/ML Lead - Dataiku & Azure ML (Independent Visas Only)

Premium Members Land Roles Faster—Upgrade today.

Book a

chat

with us

Company