Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Tune an open model

This page describes how to perform supervised fine-tuning on open models such as Llama 3.1.

Supported tuning methods

Full fine-tuning
Low-Rank Adaptation (LoRA): LoRA is a parameter-efficient tuning method that only adjust subset of parameters. It's more cost efficient and require less training data than full fine-tuning. On the other hand, full fine-tuning has higher quality potential by adjusting all parameters.

Supported models

meta/llama3_1@llama-3.1-8b
meta/llama3_1@llama-3.1-8b-instruct
meta/llama3-2@llama-3.2-1b-instruct: supports only full fine-tuning
meta/llama3-2@llama-3.2-3b-instruct: supports only full fine-tuning
meta/llama3-3@llama-3.3-70b-instruct

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Cloud Storage APIs.

Enable the APIs

Install and initialize the Vertex AI SDK for Python

Import the following libraries:

import os
import time
import uuid
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

from google.cloud import aiplatform
from vertexai.preview.tuning import sft, SourceModel

Prepare dataset for tuning

A training dataset is required for tuning. You are recommended to prepare an optional validation dataset if you'd like to evaluate your tuned model's performance.

Your dataset must be in one of the following supported JSON Lines (JSONL) formats, where each line contains a single tuning example.

Turn based chat format

{"messages": [
  {"content": "You are a chatbot that helps with scientific literature and generates state-of-the-art abstracts from articles.",
    "role": "system"},
  {"content": "Summarize the paper in one paragraph.",
    "role": "user"},
  {"content": " Here is a one paragraph summary of the paper:\n\nThe paper describes PaLM, ...",
    "role": "assistant"}
]}

Upload your JSONL files to Cloud Storage.

Create tuning job

You can tune from:

A supported base model, such as Llama 3.1
A model that has the same architecture as one of the supported base models. This could be either a custom model checkpoint from a repository such as Hugging Face or a previously tuned model from a Vertex AI tuning job. This lets you continue tuning a model that has already been tuned.

Cloud Console

You can initiate fine tuning in the following ways:
- Go the model card and click Fine tune and choose Managed tuning.
  
  Go to Llama 3.1 model card
  
  or
- Go to the Tuning page and click Create tuned model.
  
  Go to Tuning
Fill out the parameters and click Start tuning.

This starts a tuning job, which you can see in the Tuning page under the Managed tuning tab.

Once the tuning job has finished, you can view the information about the tuned model in the Details tab.

Vertex AI SDK for Python

Replace the parameter values with your own and then run the following code to create a tuning job:

sft_tuning_job = sft.preview_train(
    source_model=SourceModel(
      base_model="meta/llama3_1@llama-3.1-8b",
      # Optional, folder that either a custom model checkpoint or previously tuned model
      custom_base_model="gs://{STORAGE-URI}",
    ),
    tuning_mode="FULL", # FULL or PEFT_ADAPTER
    epochs=3,
    train_dataset="gs://{STORAGE-URI}", # JSONL file
    validation_dataset="gs://{STORAGE-URI}", # JSONL file
    output_uri="gs://{STORAGE-URI}",
)

When the job finishes, the model artifacts for the tuned model are stored in the <output_uri>/postprocess/node-0/checkpoints/final folder.

Deploy tuned model

You can deploy the tuned model to a Vertex AI endpoint. You can also export the tuned model from Cloud Storage and deploy it elsewhere.

To deploy the tuned model to a Vertex AI endpoint:

Cloud Console

Go to the Model Garden page and click Deploy model with custom weights.

Go to Model Garden
Fill out the parameters and click Deploy.

Vertex AI SDK for Python

Deploy a G2 machine using a prebuilt container:

from vertexai.preview import model_garden

MODEL_ARTIFACTS_STORAGE_URI = "gs://{STORAGE-URI}/postprocess/node-0/checkpoints/final"

model = model_garden.CustomModel(
    gcs_uri=MODEL_ARTIFACTS_STORAGE_URI,
)

# deploy the model to an endpoint using GPUs. Cost will incur for the deployment
endpoint = model.deploy(
  machine_type="g2-standard-12",
  accelerator_type="NVIDIA_L4",
  accelerator_count=1,
)

Get an inference

Once deployment succeeds, you can send requests to the endpoint with text prompts. Note that the first few prompts will take longer to execute.

# Loads the deployed endpoint
endpoint = aiplatform.Endpoint("projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}")

prompt = "Summarize the following article. Article: Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.. Summary:"

# Define input to the prediction call
instances = [
    {
        "prompt": "What is a car?",
        "max_tokens": 200,
        "temperature": 1.0,
        "top_p": 1.0,
        "top_k": 1,
        "raw_response": True,
    },
]

# Request the prediction
response = endpoint.predict(
    instances=instances
)

for prediction in response.predictions:
    print(prediction)

For more details on getting inferences from a deployed model, see Get an online inference.

Notice that managed open models use the chat.completions method instead of the predict method used by deployed models. For more information on getting inferences from managed models, see Make a call to a Llama model.

Limits and quotas

Quota is enforced on the number of concurrent tuning jobs. Every project comes with a default quota to run at least one tuning job. This is a global quota, shared across all available regions and supported models. If you want to run more jobs concurrently, you need to request additional quota for Global concurrent managed OSS model fine-tuning jobs per project.

Pricing

You are billed for tuning based on pricing for Model tuning.

You are also billed for related services, such as Cloud Storage and Vertex AI Prediction.

Learn about Vertex AI pricing, Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

What's next

Evaluate the tuned model

Tune an open model Stay organized with collections Save and categorize content based on your preferences.

Supported tuning methods

Supported models

Before you begin

Prepare dataset for tuning

Turn based chat format

Create tuning job

Cloud Console

Vertex AI SDK for Python

Deploy tuned model

Cloud Console

Vertex AI SDK for Python

Get an inference

Limits and quotas

Pricing

What's next

Tune an open model