Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
This page provides prerequisites and detailed instructions for fine-tuning
Gemini on video data using supervised learning.
Supported models
The following Gemini models support video tuning:
Gemini 2.5 Flash
Gemini 2.5 Flash-Lite
Gemini 2.5 Pro
Use cases
Fine-tuning lets you adapt base Gemini models for specialized tasks.
Here are some video use cases:
Automated video summarization: Tuning LLMs to generate concise and
coherent summaries of long videos, capturing the main themes, events, and
narratives. This is useful for content discovery, archiving, and quick
reviews.
Detailed event recognition and localization: Fine-tuning allows LLMs to
identify and pinpoint specific actions, events, or objects within a video
timeline with greater accuracy. For example, identifying all instances of a
particular product in a marketing video or a specific action in sports
footage.
Content moderation: Specialized tuning can improve an LLM's ability to
detect sensitive, inappropriate, or policy-violating content within videos,
going beyond simple object detection to understand context and nuance.
Video captioning and subtitling: While already a common application,
tuning can improve the accuracy, fluency, and context-awareness of
automatically generated captions and subtitles, including descriptions of
nonverbal cues.
Limitations
Maximum video file size: 100MB.
This may not be sufficient for large video files. Some recommended workarounds
are as follows:
If there are very few large files, drop those files from including those in
the JSONL files.
If there are many large files in your dataset and cannot be ignored, reduce
visual resolution of the files. This may hurt performance.
Chunk the videos to limit the files size to 100MB and use the chunked videos
for tuning. Make sure to change any timestamp annotations corresponding to
the original video to the new (chunked) video timeline.
Maximum video length per example: 5 minutes with MEDIA_RESOLUTION_MEDIUM
and 20 minutes with MEDIA_RESOLUTION_LOW.
Dropped examples: If an example contains video that is longer than the
supported maximum length, that example
is dropped from the dataset. Dropped examples are not billed or used for training.
If more than 10% of the dataset is dropped, the job will fail with an error
message before the start of training.
Mixing different media resolutions isn't supported: The value of
mediaResolution for each example in the entire training dataset must be
consistent. All lines in the JSONL files used for training and validation
should have the same value of mediaResolution.
Dataset format
The fileUri field specifies the location of your dataset. It can be the URI
for a file in a Cloud Storage bucket, or it can be a publicly available HTTP
or HTTPS URL.
The mediaResolution field is used to specify the token count per frame for
the input videos, as one of the following values:
MEDIA_RESOLUTION_LOW: 64 tokens per frame
MEDIA_RESOLUTION_MEDIUM: 256 tokens per frame
Model tuning with MEDIA_RESOLUTION_LOW is roughly 4 times faster than the ones
tuned with MEDIA_RESOLUTION_MEDIUM with minimal performance improvement.
When a video segment is used for training and validation, the video segment
is in the videoMetadata field. During tuning, this data point is decoded
to contain information from the segment extracted from the specified video file,
starting from timestamp startOffset (the start offset, in seconds) until
endOffset.
The following sections present video dataset format examples.
JSON schema example for cases where the full video is used for training and validation
This schema is added as a single line in the JSONL file.
{"contents":[{"role":"user","parts":[{"fileData":{"fileUri":"gs://<path to the mp4 video file>","mimeType":"video/mp4"},},{"text":" You are a video analysis expert. Detect which animal appears in the video.The video can only have one of the following animals: dog, cat, rabbit.\n Output Format:\n Generate output in the following JSON format:\n [{\n \"animal_name\": \"<CATEGORY>\",\n }]\n"}]},{"role":"model","parts":[{"text":"```json\n[{\"animal_name\": \"dog\"}]\n```"}]},],"generationConfig":{"mediaResolution":"MEDIA_RESOLUTION_LOW"}}
JSON schema example for cases where a video segment is used for training and validation
This schema is added as a single line in the JSONL file.
{"contents":[{"role":"user","parts":[{"fileData":{"fileUri":"gs://<path to the mp4 video file>","mimeType":"video/mp4"},"videoMetadata":{"startOffset":"5s","endOffset":"25s"}},{"text":" You are a video analysis expert. Detect which animal appears in the video.The video can only have one of the following animals: dog, cat, rabbit.\n Output Format:\n Generate output in the following JSON format:\n [{\n \"animal_name\": \"<CATEGORY>\",\n }]\n"}]},{"role":"model","parts":[{"text":"```json\n[{\"animal_name\": \"dog\"}]\n```"}]},],"generationConfig":{"mediaResolution":"MEDIA_RESOLUTION_LOW"}}
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Video tuning\n\nThis page provides prerequisites and detailed instructions for fine-tuning\nGemini on video data using supervised learning.\n\nSupported models\n----------------\n\nThe following Gemini models support video tuning:\n\n- Gemini 2.5 Flash\n- Gemini 2.5 Flash-Lite\n- Gemini 2.5 Pro\n\nUse cases\n---------\n\nFine-tuning lets you adapt base Gemini models for specialized tasks.\nHere are some video use cases:\n\n- **Automated video summarization**: Tuning LLMs to generate concise and\n coherent summaries of long videos, capturing the main themes, events, and\n narratives. This is useful for content discovery, archiving, and quick\n reviews.\n\n- **Detailed event recognition and localization**: Fine-tuning allows LLMs to\n identify and pinpoint specific actions, events, or objects within a video\n timeline with greater accuracy. For example, identifying all instances of a\n particular product in a marketing video or a specific action in sports\n footage.\n\n- **Content moderation**: Specialized tuning can improve an LLM's ability to\n detect sensitive, inappropriate, or policy-violating content within videos,\n going beyond simple object detection to understand context and nuance.\n\n- **Video captioning and subtitling**: While already a common application,\n tuning can improve the accuracy, fluency, and context-awareness of\n automatically generated captions and subtitles, including descriptions of\n nonverbal cues.\n\nLimitations\n-----------\n\n- **Maximum video file size** : 100MB. This may not be sufficient for large video files. Some recommended workarounds are as follows:\n - If there are very few large files, drop those files from including those in the JSONL files.\n - If there are many large files in your dataset and cannot be ignored, reduce visual resolution of the files. This may hurt performance.\n - Chunk the videos to limit the files size to 100MB and use the chunked videos for tuning. Make sure to change any timestamp annotations corresponding to the original video to the new (chunked) video timeline.\n- **Maximum video length per example** : 5 minutes with `MEDIA_RESOLUTION_MEDIUM` and 20 minutes with `MEDIA_RESOLUTION_LOW`.\n- **Dropped examples**: If an example contains video that is longer than the supported maximum length, that example is dropped from the dataset. Dropped examples are not billed or used for training. If more than 10% of the dataset is dropped, the job will fail with an error message before the start of training.\n- **Mixing different media resolutions isn't supported** : The value of `mediaResolution` for each example in the entire training dataset must be consistent. All lines in the JSONL files used for training and validation should have the same value of `mediaResolution`.\n\nDataset format\n--------------\n\nThe `fileUri` field specifies the location of your dataset. It can be the URI\nfor a file in a Cloud Storage bucket, or it can be a publicly available HTTP\nor HTTPS URL.\n\nThe `mediaResolution` field is used to specify the token count per frame for\nthe input videos, as one of the following values:\n\n- `MEDIA_RESOLUTION_LOW`: 64 tokens per frame\n- `MEDIA_RESOLUTION_MEDIUM`: 256 tokens per frame\n\nModel tuning with `MEDIA_RESOLUTION_LOW` is roughly 4 times faster than the ones\ntuned with `MEDIA_RESOLUTION_MEDIUM` with minimal performance improvement.\n\nWhen a video segment is used for training and validation, the video segment\nis in the `videoMetadata` field. During tuning, this data point is decoded\nto contain information from the segment extracted from the specified video file,\nstarting from timestamp `startOffset` (the start offset, in seconds) until\n`endOffset`.\n\nTo see the generic format example, see\n[Dataset example for Gemini](/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare#dataset-example).\n\nThe following sections present video dataset format examples.\n\n### JSON schema example for cases where the full video is used for training and validation\n\nThis schema is added as a single line in the JSONL file. \n\n {\n \"contents\": [\n {\n \"role\": \"user\",\n \"parts\": [\n {\n \"fileData\": {\n \"fileUri\": \"gs://\u003cpath to the mp4 video file\u003e\",\n \"mimeType\": \"video/mp4\"\n },\n },\n {\n \"text\": \"\n You are a video analysis expert. Detect which animal appears in the\n video.The video can only have one of the following animals: dog, cat,\n rabbit.\\n Output Format:\\n Generate output in the following JSON\n format:\\n\n [{\\n\n \\\"animal_name\\\": \\\"\u003cCATEGORY\u003e\\\",\\n\n }]\\n\"\n }\n ]\n },\n {\n \"role\": \"model\",\n \"parts\": [\n {\n \"text\": \"```json\\n[{\\\"animal_name\\\": \\\"dog\\\"}]\\n```\"\n }\n ]\n },\n ],\n \"generationConfig\": {\n \"mediaResolution\": \"MEDIA_RESOLUTION_LOW\"\n }\n }\n\n### JSON schema example for cases where a video segment is used for training and validation\n\nThis schema is added as a single line in the JSONL file. \n\n {\n \"contents\": [\n {\n \"role\": \"user\",\n \"parts\": [\n {\n \"fileData\": {\n \"fileUri\": \"gs://\u003cpath to the mp4 video file\u003e\",\n \"mimeType\": \"video/mp4\"\n },\n \"videoMetadata\": {\n \"startOffset\": \"5s\",\n \"endOffset\": \"25s\"\n }\n },\n {\n \"text\": \"\n You are a video analysis expert. Detect which animal appears in the\n video.The video can only have one of the following animals: dog, cat,\n rabbit.\\n Output Format:\\n Generate output in the following JSON\n format:\\n\n [{\\n\n \\\"animal_name\\\": \\\"\u003cCATEGORY\u003e\\\",\\n\n }]\\n\"\n }\n ]\n },\n {\n \"role\": \"model\",\n \"parts\": [\n {\n \"text\": \"```json\\n[{\\\"animal_name\\\": \\\"dog\\\"}]\\n```\"\n }\n ]\n },\n ],\n \"generationConfig\": {\n \"mediaResolution\": \"MEDIA_RESOLUTION_LOW\"\n }\n }\n\nWhat's next\n-----------\n\n- To learn more about video tuning, see [How to fine-tune Gemini 2.5 using videos via Vertex AI](https://cloud.google.com/blog/products/ai-machine-learning/how-to-fine-tune-video-outputs-using-vertex-ai?e=48754805).\n- To learn more about the image understanding capability of Gemini, see our [Image understanding](/vertex-ai/generative-ai/docs/multimodal/image-understanding) documentation.\n- To start tuning, see [Tune Gemini models by using supervised fine-tuning](/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning)\n- To learn how supervised fine-tuning can be used in a solution that builds a generative AI knowledge base, see [Jump Start Solution: Generative AI\n knowledge base](/architecture/ai-ml/generative-ai-knowledge-base)."]]