This page describes how to perform supervised
fine-tuning on open models
such as Llama 3.1. Low-Rank Adaptation (LoRA): LoRA is a parameter-efficient
tuning
method that only adjust subset of parameters. It's more cost efficient and
require less training data than full fine-tuning. On the other hand, full
fine-tuning has higher quality potential by adjusting all parameters. In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI and Cloud Storage APIs.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI and Cloud Storage APIs.
A training dataset is required for tuning. You are recommended to prepare an
optional validation dataset if you'd like to evaluate your tuned model's
performance. Your dataset must be in one of the following supported JSON Lines (JSONL)
formats, where each line contains a single tuning example. Upload your JSONL files to Cloud Storage. You can tune from: A model that has the same architecture as one of the supported base models.
This could be either a custom model checkpoint from a repository such as
Hugging Face or a previously tuned model from a Vertex AI tuning
job. This lets you continue tuning a model that has already been tuned. You can initiate fine tuning in the following ways: Go the model card and click Fine tune and choose Managed
tuning. or Go to the Tuning page and click Create tuned model. Fill out the parameters and click Start tuning. This starts a tuning job, which you can see in the Tuning page under the
Managed tuning tab. Once the tuning job has finished, you can view the information about the
tuned model in the Details tab. Replace the parameter values with your own and then run the following code
to create a tuning job: When the job finishes, the model artifacts for the tuned model are stored in
the You can deploy the tuned model to a Vertex AI
endpoint. You can also export the tuned
model from Cloud Storage and deploy it elsewhere. To deploy the tuned model to a Vertex AI endpoint: Go to the Model Garden page and click Deploy model with custom
weights. Fill out the parameters and click Deploy. Deploy a Once deployment succeeds, you can send requests to the endpoint with text
prompts. Note that the first few prompts will take longer to execute. For more details on getting inferences from a deployed model, see Get an online
inference. Notice that managed open models use the
Quota is enforced on the number of concurrent tuning jobs. Every project comes
with a default quota to run at least one tuning job. This is a global quota,
shared across all available regions and supported models. If you want to run
more jobs concurrently, you need to request additional
quota for You are billed for tuning based on pricing for Model
tuning. You are also billed for related services, such as Cloud Storage and
Vertex AI Prediction. Learn about Vertex AI pricing,
Cloud Storage pricing, and use the Pricing
Calculator to generate a cost estimate based on your
projected usage.Supported tuning methods
Supported models
meta/llama3_1@llama-3.1-8b
meta/llama3_1@llama-3.1-8b-instruct
meta/llama3-2@llama-3.2-1b-instruct
: supports only full fine-tuningmeta/llama3-2@llama-3.2-3b-instruct
: supports only full fine-tuningmeta/llama3-3@llama-3.3-70b-instruct
Before you begin
import os
import time
import uuid
import vertexai
vertexai.init(project=PROJECT_ID, location=REGION)
from google.cloud import aiplatform
from vertexai.preview.tuning import sft, SourceModel
Prepare dataset for tuning
Turn based chat format
{"messages": [
{"content": "You are a chatbot that helps with scientific literature and generates state-of-the-art abstracts from articles.",
"role": "system"},
{"content": "Summarize the paper in one paragraph.",
"role": "user"},
{"content": " Here is a one paragraph summary of the paper:\n\nThe paper describes PaLM, ...",
"role": "assistant"}
]}
Create tuning job
Cloud Console
Vertex AI SDK for Python
sft_tuning_job = sft.preview_train(
source_model=SourceModel(
base_model="meta/llama3_1@llama-3.1-8b",
# Optional, folder that either a custom model checkpoint or previously tuned model
custom_base_model="gs://{STORAGE-URI}",
),
tuning_mode="FULL", # FULL or PEFT_ADAPTER
epochs=3,
train_dataset="gs://{STORAGE-URI}", # JSONL file
validation_dataset="gs://{STORAGE-URI}", # JSONL file
output_uri="gs://{STORAGE-URI}",
)
<output_uri>/postprocess/node-0/checkpoints/final
folder.Deploy tuned model
Cloud Console
Vertex AI SDK for Python
G2 machine
using a prebuilt
container:from vertexai.preview import model_garden
MODEL_ARTIFACTS_STORAGE_URI = "gs://{STORAGE-URI}/postprocess/node-0/checkpoints/final"
model = model_garden.CustomModel(
gcs_uri=MODEL_ARTIFACTS_STORAGE_URI,
)
# deploy the model to an endpoint using GPUs. Cost will incur for the deployment
endpoint = model.deploy(
machine_type="g2-standard-12",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)
Get an inference
# Loads the deployed endpoint
endpoint = aiplatform.Endpoint("projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}")
prompt = "Summarize the following article. Article: Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.. Summary:"
# Define input to the prediction call
instances = [
{
"prompt": "What is a car?",
"max_tokens": 200,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 1,
"raw_response": True,
},
]
# Request the prediction
response = endpoint.predict(
instances=instances
)
for prediction in response.predictions:
print(prediction)
chat.completions
method instead of the
predict
method used by deployed models. For more information on getting inferences from
managed models, see Make a call to a Llama
model.Limits and quotas
Global
concurrent managed OSS model fine-tuning jobs per project
.Pricing
What's next
Tune an open model
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-29 UTC.