Stay organized with collections
Save and categorize content based on your preferences.
This document outlines the deployment steps for provisioning an A3 Mega
(a3-megagpu-8g) Google Kubernetes Engine (GKE) cluster that is ideal for running
large-scale artificial intelligence (AI) and machine learning (ML) training
workloads.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a
Cloud Shell
session starts and displays a command-line prompt. Cloud Shell is a shell environment
with the Google Cloud CLI
already installed and with values already set for
your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the a3-megagpu-8g machine type is
available, run the following command:
gcloud compute machine-types list --filter="name=a3-megagpu-8g"
Ensure that you have enough GPU quotas. Each a3-megagpu-8g machine has
8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs
in your selected region.
Ensure that you have enough Filestore quota. You need a minimum of
10,240 GiB of zonal (also known as high scale SSD) capacity.
If you don't have enough quota,
request a quota increase.
Overview
To deploy the cluster, you must complete the following:
NUMBER_OF_VMS: the number of VMs needed for the
cluster.
ZONE: a zone that has a3-megagpu-8g machine
types. To find supported zones for a specific VM machine type, see
Regions and zones.
Create a cluster
Use the following instructions to create a cluster using
Cluster Toolkit.
After you have installed the Cluster Toolkit, ensure that you are in the
Cluster Toolkit directory. To go to the main Cluster Toolkit blueprint's
working directory, run the following command from the CLI.
cd cluster-toolkit
Create a Cloud Storage bucket to store the state of the Terraform deployment:
BUCKET_NAME: the name of the new
Cloud Storage bucket.
PROJECT_ID: your Google Cloud project ID.
COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want
to store the state of the Terraform deployment.
Update the blueprint deployment file. In the
examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml
file, fill in the following settings in the terraform_backend_defaults
and vars sections to match the specific values for your deployment:
DEPLOYMENT_NAME: a unique name for the
deployment. If the deployment name isn't unique within a project,
cluster creation fails.
BUCKET_NAME: the name of the
Cloud Storage bucket you created in the previous step.
PROJECT_ID: your Google Cloud project ID.
COMPUTE_REGION: the compute region for the
cluster.
COMPUTE_ZONE: the compute zone for the node pool
of A3 Mega machines.
NODE_COUNT: the number of A3 Mega nodes in your
cluster.
IP_ADDRESS/SUFFIX: The IP
address range that you want to allow to connect with the cluster. This CIDR
block must include the IP address of the machine to call Terraform. For
more information, see How authorized networks
work.
To get the IP address for your host machine, run the following command.
curl ifconfig.me
For the extended_reservation field, use one of the following,
depending on whether you want to target specific
blocks in a reservation
when provisioning the node pool:
To place the node pool anywhere in the reservation, provide the
name of your reservation (RESERVATION_NAME).
To target a specific block within your reservation, use the
reservation and block names in the following format:
When prompted, select (A)pply to deploy the blueprint.
The blueprint creates VPC networks, service accounts, a cluster, and a
nodepool.
Clean up resources created by Cluster Toolkit
To avoid recurring charges for the resources used on this page, clean up the
resources provisioned by Cluster Toolkit, including the
VPC networks and GKE cluster:
Replace CLUSTER_NAME with the name of your cluster.
For the clusters created with Cluster Toolkit, the cluster names
are based on the DEPLOYMENT_NAME name.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-03 UTC."],[[["\u003cp\u003eThis document details how to deploy an A3 Mega (\u003ccode\u003ea3-megagpu-8g\u003c/code\u003e) GKE cluster, which is optimized for large-scale AI and ML training workloads.\u003c/p\u003e\n"],["\u003cp\u003eBefore deploying the cluster, you must verify the availability of the \u003ccode\u003ea3-megagpu-8g\u003c/code\u003e machine type and confirm you have sufficient GPU and Filestore quotas, including at least 8 NVIDIA H100 80GB GPUs and 10,240 GiB of zonal Filestore capacity.\u003c/p\u003e\n"],["\u003cp\u003eThe deployment process involves installing the Cluster Toolkit, creating a reservation for the required VMs, fetching your machine's IP address, updating the blueprint file with your project details, and then building the Cluster Toolkit binary.\u003c/p\u003e\n"],["\u003cp\u003eTo provision the cluster, run the \u003ccode\u003egke-a3-megagpu.yaml\u003c/code\u003e blueprint file using the Cluster Toolkit, which will take around 5-10 minutes to complete.\u003c/p\u003e\n"],["\u003cp\u003eTo avoid unnecessary costs, after use, you can destroy the GKE cluster and its resources by running the command \u003ccode\u003e./gcluster destroy gke-a3-megagpu.yaml\u003c/code\u003e from Cloud Shell, and to ensure no ongoing charges, you should also delete any reservations created during this process.\u003c/p\u003e\n"]]],[],null,["This document outlines the deployment steps for provisioning an A3 Mega\n(`a3-megagpu-8g`) Google Kubernetes Engine (GKE) cluster that is ideal for running\nlarge-scale artificial intelligence (AI) and machine learning (ML) training\nworkloads.\n\nBefore you begin\n\n1.\n\n\n In the Google Cloud console, activate Cloud Shell.\n\n [Activate Cloud Shell](https://console.cloud.google.com/?cloudshell=true)\n\n\n At the bottom of the Google Cloud console, a\n [Cloud Shell](/shell/docs/how-cloud-shell-works)\n session starts and displays a command-line prompt. Cloud Shell is a shell environment\n with the Google Cloud CLI\n already installed and with values already set for\n your current project. It can take a few seconds for the session to initialize.\n\n \u003cbr /\u003e\n\n2. Identify the regions and zones where the `a3-megagpu-8g` machine type is\n available, run the following command:\n\n ```\n gcloud compute machine-types list --filter=\"name=a3-megagpu-8g\"\n ```\n3. Ensure that you have enough GPU quotas. Each `a3-megagpu-8g` machine has\n 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs\n in your selected region.\n\n 1. To view quotas, see [View the quotas for your project](/docs/quotas/view-manage). In the filter_list **Filter** field, select **Dimensions (e.g. location)** and specify [`gpu_family:NVIDIA_H100_MEGA`](/compute/resource-usage#gpu_quota).\n 2. If you don't have enough quota, [request a higher quota](/view-manage#requesting_higher_quota).\n4. Ensure that you have enough Filestore quota. You need a minimum of\n 10,240 GiB of zonal (also known as high scale SSD) capacity.\n If you don't have enough quota,\n [request a quota increase](/filestore/docs/requesting-quota-increases).\n\nOverview\n\nTo deploy the cluster, you must complete the following:\n\n1. Install Cluster Toolkit\n2. Create a reservation or get a reservation name from your [Technical Account Manager (TAM)](/tam)\n3. Create a cluster\n4. Clean up resources created by Cluster Toolkit\n\nInstall Cluster Toolkit\n\nFrom the CLI, complete the following steps:\n\n1. Install [dependencies](/cluster-toolkit/docs/setup/install-dependencies).\n2. Set up [Cluster Toolkit](/cluster-toolkit/docs/setup/configure-environment).\n\nCreate a reservation\n\nIf you don't have a reservation provided by a\n[Technical Account Manager (TAM)](/tam), we recommend creating a reservation.\nFor more information, see [Choose a reservation type](/compute/docs/instances/choose-reservation-type).\n\nReservations incur ongoing costs even after the GKE cluster is destroyed. To\nmanage your costs, we recommend the following options:\n\n- Track spending by using [budget alerts](/billing/docs/how-to/budgets).\n- Delete reservations when you're done with them. To delete a reservation, see [delete your reservation](/compute/docs/instances/reservations-delete).\n\nTo create a reservation, run the\n[`gcloud compute reservations create` command](/sdk/gcloud/reference/compute/reservations/create)\nand ensure that you specify the `--require-specific-reservation` flag. \n\n```\ngcloud compute reservations create RESERVATION_NAME \\\n --require-specific-reservation \\\n --project=PROJECT_ID \\\n --machine-type=a3-megagpu-8g \\\n --vm-count=NUMBER_OF_VMS \\\n --zone=ZONE\n```\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003eRESERVATION_NAME\u003c/var\u003e: a name for your reservation.\n- \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: your project ID.\n- \u003cvar translate=\"no\"\u003eNUMBER_OF_VMS\u003c/var\u003e: the number of VMs needed for the cluster.\n- \u003cvar translate=\"no\"\u003eZONE\u003c/var\u003e: a zone that has `a3-megagpu-8g` machine types. To find supported zones for a specific VM machine type, see [Regions and zones](/compute/docs/regions-zones).\n\nCreate a cluster\n\nUse the following instructions to create a cluster using [Cluster Toolkit](/cluster-toolkit/docs/overview).\n| **Note:** If you create multiple clusters using these same cluster blueprints, ensure that all VPC and subnet names are unique per project to prevent errors.\n\n1. After you have installed the Cluster Toolkit, ensure that you are in the\n Cluster Toolkit directory. To go to the main Cluster Toolkit blueprint's\n working directory, run the following command from the CLI.\n\n ```\n cd cluster-toolkit\n ```\n2. Create a Cloud Storage bucket to store the state of the Terraform deployment:\n\n gcloud storage buckets create gs://\u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e \\\n --default-storage-class=STANDARD \\\n --project=\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e \\\n --location=\u003cvar translate=\"no\"\u003eCOMPUTE_REGION_TERRAFORM_STATE\u003c/var\u003e \\\n --uniform-bucket-level-access\n gcloud storage buckets update gs://\u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e --versioning\n\n Replace the following variables:\n - \u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e: the name of the new Cloud Storage bucket.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: your Google Cloud project ID.\n - \u003cvar translate=\"no\"\u003eCOMPUTE_REGION_TERRAFORM_STATE\u003c/var\u003e: the compute region where you want to store the state of the Terraform deployment.\n3. Update the blueprint deployment file. In the\n [`examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml`](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml)\n file, fill in the following settings in the `terraform_backend_defaults`\n and `vars` sections to match the specific values for your deployment:\n\n - \u003cvar translate=\"no\"\u003eDEPLOYMENT_NAME\u003c/var\u003e: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.\n - \u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e: the name of the Cloud Storage bucket you created in the previous step.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: your Google Cloud project ID.\n - \u003cvar translate=\"no\"\u003eCOMPUTE_REGION\u003c/var\u003e: the compute region for the cluster.\n - \u003cvar translate=\"no\"\u003eCOMPUTE_ZONE\u003c/var\u003e: the compute zone for the node pool of A3 Mega machines.\n - \u003cvar translate=\"no\"\u003eNODE_COUNT\u003c/var\u003e: the number of A3 Mega nodes in your cluster.\n - \u003cvar translate=\"no\"\u003eIP_ADDRESS\u003c/var\u003e`/`\u003cvar translate=\"no\"\u003eSUFFIX\u003c/var\u003e: The IP\n address range that you want to allow to connect with the cluster. This CIDR\n block must include the IP address of the machine to call Terraform. For\n more information, see [How authorized networks\n work](/kubernetes-engine/docs/concepts/network-isolation#how_authorized_networks_work).\n To get the IP address for your host machine, run the following command.\n\n curl ifconfig.me\n\n - For the `extended_reservation` field, use one of the following,\n depending on whether you want to target specific\n [blocks](/ai-hypercomputer/docs/terminology#block) in a reservation\n when provisioning the node pool:\n\n - To place the node pool anywhere in the reservation, provide the name of your reservation (\u003cvar translate=\"no\"\u003eRESERVATION_NAME\u003c/var\u003e).\n - To target a specific block within your reservation, use the\n reservation and block names in the following format:\n\n \u003cvar translate=\"no\"\u003eRESERVATION_NAME\u003c/var\u003e/reservationBlocks/\u003cvar translate=\"no\"\u003eBLOCK_NAME\u003c/var\u003e\n\n If you don't know which blocks are available in your reservation,\n see [View a reservation\n topology](/ai-hypercomputer/docs/request-capacity#view-capacity-topology).\n4. To modify advanced blueprint settings, edit the\n [`examples/gke-a3-megagpu/gke-a3-megagpu.yaml`](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a3-megagpu/gke-a3-megagpu.yaml)\n file.\n\n5. Deploy the blueprint to provision the GKE infrastructure\n using A3 Mega machine types:\n\n cd ~/cluster-toolkit\n ./gcluster deploy -d \\\n examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml \\\n examples/gke-a3-megagpu/gke-a3-megagpu.yaml\n\n6. When prompted, select **(A)pply** to deploy the blueprint.\n\n - The blueprint creates VPC networks, service accounts, a cluster, and a nodepool.\n\nClean up resources created by Cluster Toolkit\n\nTo avoid recurring charges for the resources used on this page, clean up the\nresources provisioned by Cluster Toolkit, including the\nVPC networks and GKE cluster: \n\n cd ~/cluster-toolkit\n ./gcluster destroy \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e/\n\nReplace \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e with the name of your cluster.\nFor the clusters created with Cluster Toolkit, the cluster names\nare based on the \u003cvar translate=\"no\"\u003eDEPLOYMENT_NAME\u003c/var\u003e name."]]