Stay organized with collections
Save and categorize content based on your preferences.
Cloud TPU inference
Serving refers to the process of deploying a trained machine learning model to a
production environment, where it can be used for inference. Inference is
supported on TPU v5e and newer versions. Latency SLOs are a priority for serving.
This document discusses serving a model on a single-host TPU. TPU slices with
8 or less chips have one TPU VM or host and are called single-host TPUs.
Get started
You will need a Google Cloud account and project to use Cloud TPU. For more
information, see Set up a Cloud TPU environment.
You need to request the following quota for serving on TPUs:
On-demand v5e resources: TPUv5 lite pod cores for serving per project per zone
Preemptible v5e resources: Preemptible TPU v5 lite pod cores for serving per project per zone
On-demand v6e resources: TPUv6 cores per project per zone
Preemptible v6e resources: Preemptible TPUv6 cores per project per zone
For more information about TPU quota, see TPU quota.
Serve LLMs using JetStream
JetStream is a throughput and memory optimized engine for large language model
(LLM) inference on XLA devices (TPUs). You can use JetStream with JAX and
PyTorch/XLA models. For an example of using JetStream to serve a JAX LLM, see
JetStream MaxText inference on v6e TPU.
Serve LLM models with vLLM
vLLM is an open-source library designed for fast inference and serving of large
language models (LLMs). You can use vLLM with PyTorch/XLA. For an example of
using vLLM to serve a PyTorch LLM, see Serve an LLM using TPU Trillium on GKE with vLLM.
Profiling
After setting up inference, you can use profilers to analyze the performance and
TPU utilization. For more information about profiling, see:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Cloud TPU inference\n===================\n\n| **Note:** If you are new to Cloud TPUs, see [Introduction to Cloud TPU](/tpu/docs/intro-to-tpu).\n\nServing refers to the process of deploying a trained machine learning model to a\nproduction environment, where it can be used for inference. Inference is\nsupported on TPU v5e and newer versions. Latency SLOs are a priority for serving.\n\nThis document discusses serving a model on a *single-host* TPU. TPU slices with\n8 or less chips have one TPU VM or host and are called *single-host* TPUs.\n\nGet started\n-----------\n\nYou will need a Google Cloud account and project to use Cloud TPU. For more\ninformation, see [Set up a Cloud TPU environment](/tpu/docs/setup-gcp-account).\n\nYou need to request the following quota for serving on TPUs:\n\n- On-demand v5e resources: `TPUv5 lite pod cores for serving per project per zone`\n- Preemptible v5e resources: `Preemptible TPU v5 lite pod cores for serving per project per zone`\n- On-demand v6e resources: `TPUv6 cores per project per zone`\n- Preemptible v6e resources: `Preemptible TPUv6 cores per project per zone`\n\n| **Note:** There is no v6e quota specific to serving.\n\nFor more information about TPU quota, see [TPU quota](/tpu/docs/quota).\n\nServe LLMs using JetStream\n--------------------------\n\nJetStream is a throughput and memory optimized engine for large language model\n(LLM) inference on XLA devices (TPUs). You can use JetStream with JAX and\nPyTorch/XLA models. For an example of using JetStream to serve a JAX LLM, see\n[JetStream MaxText inference on v6e TPU](/tpu/docs/tutorials/LLM/jetstream-maxtext-inference-v6e).\n\nServe LLM models with vLLM\n--------------------------\n\nvLLM is an open-source library designed for fast inference and serving of large\nlanguage models (LLMs). You can use vLLM with PyTorch/XLA. For an example of\nusing vLLM to serve a PyTorch LLM, see [Serve an LLM using TPU Trillium on GKE with vLLM](/kubernetes-engine/docs/tutorials/serve-vllm-tpu).\n\nProfiling\n---------\n\nAfter setting up inference, you can use profilers to analyze the performance and\nTPU utilization. For more information about profiling, see:\n\n- [Profiling on Cloud TPU](/tpu/docs/profile-tpu-vm)\n\n- [TensorFlow profiling](https://www.tensorflow.org/guide/profiler)\n\n- [PyTorch profiling](/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)\n\n- [JAX profiling](https://jax.readthedocs.io/en/latest/profiling.html#profiling-jax-programs)"]]