Process ML data using Dataflow and Cloud Storage FUSE
Stay organized with collections
Save and categorize content based on your preferences.
This page describes how to use
Cloud Storage FUSE with Dataflow
to process datasets for machine learning (ML) tasks.
When working with ML tasks, Dataflow can be used for processing large
datasets. However, some common software libraries used for ML, like OpenCV, have
input file requirements. They frequently require files to be accessed as if they
are stored on a local computer's hard drive, rather than from cloud-based
storage. This requirement creates difficulties and delays. As a solution,
pipelines can either use special I/O connectors for input or download files onto
the Dataflow virtual machines (VMs) before processing. These solutions
are frequently inefficient.
Cloud Storage FUSE provides a way to avoid these inefficient solutions.
Cloud Storage FUSE lets you mount your Cloud Storage buckets onto the
Dataflow VMs. This makes the files in Cloud Storage appear as if they
are local files. As a result, the ML software can access them directly without
needing to download them beforehand.
Benefits
Using Cloud Storage FUSE for ML tasks offers the following benefits:
Input files hosted on Cloud Storage can be accessed in the
Dataflow VM using local file system semantics.
Because the data is accessed on-demand, the input files don't have to be
downloaded beforehand.
Support and limitations
To use Cloud Storage FUSE with Dataflow, you must configure worker VMs with external IP addresses so that they meet the internet access requirements.
Specify buckets to use with Cloud Storage FUSE
To specify a Cloud Storage bucket to mount to a VM, use the
--experiments flag. To specify
multiple buckets, use a semicolon delimiter (;) between bucket names.
The format is as follows:
--experiments="gcsfuse_buckets=CONFIG"
Replace the following:
CONFIG: a semicolon-delimited list of
Cloud Storage entries, where each entry is one of the following:
BUCKET_NAME: A Cloud Storage bucket name.
For example, dataflow-samples. If you omit the bucket mode, the bucket
is treated as read-only.
BUCKET_NAME:MODE: A
Cloud Storage bucket name and its associated mode, where MODE is
either ro (read-only) or rw (read-write).
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Process ML data using Dataflow and Cloud Storage FUSE\n\nThis page describes how to use\n[Cloud Storage FUSE](/storage/docs/cloud-storage-fuse/overview) with Dataflow\nto process datasets for machine learning (ML) tasks.\n\nWhen working with ML tasks, Dataflow can be used for processing large\ndatasets. However, some common software libraries used for ML, like OpenCV, have\ninput file requirements. They frequently require files to be accessed as if they\nare stored on a local computer's hard drive, rather than from cloud-based\nstorage. This requirement creates difficulties and delays. As a solution,\npipelines can either use special I/O connectors for input or download files onto\nthe Dataflow virtual machines (VMs) before processing. These solutions\nare frequently inefficient.\n\nCloud Storage FUSE provides a way to avoid these inefficient solutions.\nCloud Storage FUSE lets you mount your Cloud Storage buckets onto the\nDataflow VMs. This makes the files in Cloud Storage appear as if they\nare local files. As a result, the ML software can access them directly without\nneeding to download them beforehand.\n\nBenefits\n--------\n\nUsing Cloud Storage FUSE for ML tasks offers the following benefits:\n\n- Input files hosted on Cloud Storage can be accessed in the Dataflow VM using local file system semantics.\n- Because the data is accessed on-demand, the input files don't have to be downloaded beforehand.\n\nSupport and limitations\n-----------------------\n\n- To use Cloud Storage FUSE with Dataflow, you must configure worker VMs with [external IP addresses](/dataflow/docs/guides/routes-firewall#internet_access_for) so that they meet the internet access requirements.\n\nSpecify buckets to use with Cloud Storage FUSE\n----------------------------------------------\n\nTo specify a Cloud Storage bucket to mount to a VM, use the\n[`--experiments`](/dataflow/docs/reference/pipeline-options) flag. To specify\nmultiple buckets, use a semicolon delimiter (`;`) between bucket names.\n\nThe format is as follows: \n\n --experiments=\"gcsfuse_buckets=\u003cvar translate=\"no\"\u003eCONFIG\u003c/var\u003e\"\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003eCONFIG\u003c/var\u003e: a semicolon-delimited list of\n Cloud Storage entries, where each entry is one of the following:\n\n 1. \u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e: A Cloud Storage bucket name.\n For example, `dataflow-samples`. If you omit the bucket mode, the bucket\n is treated as read-only.\n\n 2. \u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e`:`\u003cvar translate=\"no\"\u003eMODE\u003c/var\u003e: A\n Cloud Storage bucket name and its associated mode, where `MODE` is\n either `ro` (read-only) or `rw` (read-write).\n\n For example: \n\n --experiments=\"gcsfuse_buckets=read-bucket1;read-bucket2:ro;write-bucket1:rw\"\n\n In this example, specifying the mode assures the following:\n - `gs://read-bucket1` is mounted in read-only mode.\n - `gs://read-bucket2` is mounted in read-only mode.\n - `gs://write-bucket1` is mounted in read-write mode.\n\n Beam pipeline code can access these buckets at\n `/var/opt/google/gcs/`\u003cvar translate=\"no\"\u003eBUCKET_NAME\u003c/var\u003e."]]