What is batch inference?

Once a machine learning model is trained, the next step is to use it to make predictions on new data. This process, known as inference, requires a strategy that depends heavily on your application's latency and throughput requirements. Batch inference, also referred to as offline inference or asynchronous processing, is a powerful and highly efficient method for generating predictions on a large volume of data when immediate, real-time responses are not required.

Batch inference defined

Batch inference is the process of using a trained machine learning model to generate predictions on a large set of observations, or a "batch," all at once.

Unlike online inference, where predictions are made on single data points as they arrive, batch inference operates on data that has been collected over a period of time. This approach prioritizes high throughput and computational efficiency over low latency. Because the processing is done offline and not in direct response to a user request, it is also known as static inference—the predictions are generated and stored for later use.

Key characteristics of batch inferencing

Asynchronous processing: Predictions are generated on a predefined schedule (for example, hourly, daily) or on demand, not in real time as new data comes in
High throughput: The system is optimized to process a massive number of data points in a single run, making it highly efficient
Cost-effectiveness: By running on a schedule, you can use compute resources when they are most available or least expensive, significantly lowering operational costs
Latency tolerance: The primary assumption is that the application consuming the predictions does not need an immediate answer; a delay of minutes or hours between data collection and prediction generation is acceptable

Batch inference verses online inference

Choosing between batch and online inference is a fundamental architectural decision in designing a machine learning system. Each approach serves a different purpose and is optimized for different performance characteristics.

Feature	Batch inference	Online inference
Data processing	Processes a large collection of data points together in a single job.	Processes a single data point or a very small group of data points as they arrive.
Primary optimization	High throughput and cost efficiency.	Low latency and immediate responsiveness.
Latency	High latency; predictions are not available immediately (minutes to hours).	Very low latency; predictions are returned in milliseconds.
Invocation	Triggered on a schedule (for example, cron job) or on demand.	Triggered by a direct user request or an event in the system.
Compute utilization	Can use powerful compute resources for a short period, then scale down to zero.	Requires a server or endpoint to be constantly running and ready to accept requests.
Example use case	Generating daily product recommendations for all users of an e-commerce site.	Predicting whether a single credit card transaction is fraudulent as it happens.
Synonymous terms	Offline inference, asynchronous processing, static inference.	Real-time inference, synchronous processing, dynamic inference.

Feature

Batch inference

Online inference

Data processing

Processes a large collection of data points together in a single job.

Processes a single data point or a very small group of data points as they arrive.

Primary optimization

High throughput and cost efficiency.

Low latency and immediate responsiveness.

Latency

High latency; predictions are not available immediately (minutes to hours).

Very low latency; predictions are returned in milliseconds.

Invocation

Triggered on a schedule (for example, cron job) or on demand.

Triggered by a direct user request or an event in the system.

Compute utilization

Can use powerful compute resources for a short period, then scale down to zero.

Requires a server or endpoint to be constantly running and ready to accept requests.

Example use case

Generating daily product recommendations for all users of an e-commerce site.

Predicting whether a single credit card transaction is fraudulent as it happens.

Synonymous terms

Offline inference, asynchronous processing, static inference.

Real-time inference, synchronous processing, dynamic inference.

How does batch inference work?

A batch inference pipeline is a structured, automated workflow that moves data from its raw state to actionable predictions. The process can be broken down into these key steps, which are typically orchestrated by a workflow manager or scheduling system.

Step 1: Data collection and storage

The process begins by accumulating data over time. This input data, which can include user activity logs, transaction records, or sensor readings, is collected from various sources and landed in a centralized storage location. This is often a data lake built on a service like Google Cloud Storage or a data warehouse like BigQuery.

Step 2: Triggering the batch job

The inference pipeline is initiated by a trigger. This trigger can be:

Time-based: A scheduler (like a cron job) kicks off the job at a regular interval, such as every night at 1 a.m.
Event-based: The job starts in response to a specific event, such as the arrival of a new data file in a Cloud Storage bucket

Step 3: Preprocessing the data

Once triggered, the job loads the entire batch of raw input data. It then performs necessary preprocessing and feature engineering steps to transform the data into the precise format the machine learning model expects. This can include tasks like cleaning missing values, scaling numerical features, and encoding categorical variables.

Step 4: Generating predictions

The system retrieves the trained machine learning model from a central repository, such as the Vertex AI Model Registry. The preprocessed batch of data is then fed to the model, which runs inference on every single observation in the set to generate a corresponding prediction.

Step 5: Storing the results

The output of the model—the collection of predictions—is then written to a storage system. This destination is chosen based on how the predictions will be used. Common destinations include loading the results into a BigQuery table for analysis, a Cloud SQL database for quick lookups by an application, or saving them as files in Cloud Storage.

Step 6: Consuming the predictions

With the predictions now stored and ready, downstream systems can use them. A business intelligence tool might query the results to create a dashboard of predicted customer behavior. A web application's backend might load the pre-computed product recommendations to display to users, or a marketing automation platform might pull a list of customers predicted to churn to target them with a new campaign.

Benefits of batch inference

For many enterprise use cases, batch inference can offer significant advantages over real-time processing.

Cost efficiency

Batch processing allows you to optimize your use of compute resources. You can run large jobs on powerful hardware for a short period and then shut the resources down, avoiding the cost of maintaining a continuously running server.

High throughput and scalability

Batch systems are designed to scale and process terabytes of data efficiently. This makes it possible to apply complex models to very large datasets, something that might be too slow or expensive for an online system.

Simplicity of operations

Batch inference pipelines can be simpler to build and maintain than highly available, low-latency online inference systems. They are generally more resilient to transient failures and can be easily rerun if a job fails.

Enables complex feature engineering

Because batch inference is not constrained by low-latency requirements, you can perform more complex and computationally intensive feature engineering on your input data, which can often lead to more accurate models.

Better resource utilization

You can schedule batch jobs to run during off-peak hours, taking advantage of idle compute capacity and potentially lower spot pricing for virtual machines.

Use cases of batch inference

Batch inference is the preferred method for many core business processes where predictions enhance a product or service without needing to be generated in real time. This approach can be highly effective across various industries for solving large-scale data problems.

Industry	Problem to solve	Example solution
E-commerce and retail	Generate personalized product recommendations for the entire user base on a daily basis to ensure they are ready for fast retrieval when users visit the site.	Vertex AI Batch Predictions can run a recommendation model and load the results into a fast lookup database like Cloud SQL or Bigtable.
Telecommunications and SaaS	Identify which customers are at high risk of churning next month by analyzing usage patterns across the entire customer database.	BigQuery ML allows you to run a classification model directly on customer data stored in the data warehouse, with the results written to a new table for the retention team.
Finance and insurance	Forecast financial market trends or calculate risk scores for an entire portfolio of assets, which is a computationally intensive task performed periodically.	Vertex AI Batch Predictions can execute complex time-series models on a schedule, providing the data needed for strategic reports and dashboards.
Logistics and supply chain	Optimize inventory levels across hundreds of warehouses by running a complex demand forecasting simulation based on weekly sales and logistics data.	Google Kubernetes Engine (GKE) provides the custom, high-performance environment needed to run specialized simulation models with specific library and hardware requirements.
Healthcare	Analyze a large daily batch of medical images (such as X-rays or CT scans) to detect potential anomalies for later review by a radiologist.	GKE with GPU accelerators is ideal for running deep learning computer vision models on large sets of images, offering maximum control and performance.
Legal and compliance	Process and classify millions of existing documents to extract key entities, assess sentiment, and make the entire corpus searchable and analyzable.	Dataflow can be used to build a scalable NLP pipeline that preprocesses the text and runs inference, while GKE can be used for more custom model requirements.

Industry

Problem to solve

Example solution

E-commerce and retail

Generate personalized product recommendations for the entire user base on a daily basis to ensure they are ready for fast retrieval when users visit the site.

Vertex AI Batch Predictions can run a recommendation model and load the results into a fast lookup database like Cloud SQL or Bigtable.

Telecommunications and SaaS

Identify which customers are at high risk of churning next month by analyzing usage patterns across the entire customer database.

BigQuery ML allows you to run a classification model directly on customer data stored in the data warehouse, with the results written to a new table for the retention team.

Finance and insurance

Forecast financial market trends or calculate risk scores for an entire portfolio of assets, which is a computationally intensive task performed periodically.

Vertex AI Batch Predictions can execute complex time-series models on a schedule, providing the data needed for strategic reports and dashboards.

Logistics and supply chain

Optimize inventory levels across hundreds of warehouses by running a complex demand forecasting simulation based on weekly sales and logistics data.

Google Kubernetes Engine (GKE) provides the custom, high-performance environment needed to run specialized simulation models with specific library and hardware requirements.

Healthcare

Analyze a large daily batch of medical images (such as X-rays or CT scans) to detect potential anomalies for later review by a radiologist.

GKE with GPU accelerators is ideal for running deep learning computer vision models on large sets of images, offering maximum control and performance.

Legal and compliance

Process and classify millions of existing documents to extract key entities, assess sentiment, and make the entire corpus searchable and analyzable.

Dataflow can be used to build a scalable NLP pipeline that preprocesses the text and runs inference, while GKE can be used for more custom model requirements.

How to set up batch inference in Vertex AI

Vertex AI is Google Cloud's managed machine learning platform, and it provides a streamlined, serverless approach for batch inference. The process focuses on configuring a job and letting the platform handle the underlying infrastructure.

Prepare your assets.

Before starting, you need three key things:

A trained model, which should be uploaded to the Vertex AI Model Registry
Your input data, formatted as required by your model; this data should be located in either Google Cloud Storage (for example, as JSON or CSV files) or a BigQuery table
A destination location for the output, which will also be a Cloud Storage bucket or a BigQuery table

Create a batch prediction job.

You can initiate the job through the Google Cloud console, the gcloud command-line tool, or programmatically using the Vertex AI SDK. When you create the job, you will provide the following configuration:

The specific model from the Model Registry that you want to use
The path to your input data and the location for your output results
The machine type and accelerator (for example, GPU) you want to use for the job; this allows you to balance cost and performance

Execute and monitor the job.

Once you submit the job, Vertex AI takes over. It automatically provisions the compute resources you specified, runs your input data through the model, generates the predictions, and saves them to your designated output location. After the job is complete, Vertex AI automatically scales down all resources to zero, so you only pay for the computation time you use. You can monitor the job's progress and view logs directly in the Google Cloud console.

Access and use the predictions.

Once the job status shows as "Succeeded," your predictions are ready. The output files in Cloud Storage or the new table in BigQuery can now be accessed by your downstream applications, analytics tools, or BI dashboards.

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

How to set up batch inference in GKE

Using Google Kubernetes Engine (GKE) for batch inference offers maximum control and portability, making it ideal for teams with existing Kubernetes expertise or specialized requirements. The setup involves containerizing your inference logic and managing its execution with Kubernetes resources.

Step 1: Containerize the inference application. The first step is to package your prediction code into a container image.

Write a script (for example, in Python) that loads your trained model, reads data from a source, performs inference, and writes the results to a destination
Create a Dockerfile; this file defines the steps to build your container, including specifying a base image, installing dependencies (like tensorflow or pandas), and copying your model file and inference script into the image
Build the image and push it to a container registry like Artifact Registry

Step 2: Define a Kubernetes job. Instead of a long-running deployment, you define the batch task using a Kubernetes Job or CronJob manifest (a YAML file). This file specifies:

The container image to use from Artifact Registry
The compute resources required (CPU, memory, GPUs)
Any necessary configurations, such as environment variables for file paths or secrets for database credentials

Step 3: Execute the job on the GKE cluster. You apply the manifest to your GKE cluster using kubectl. GKE's control plane then schedules a pod to run your inference container on a suitable node in the cluster. For recurring tasks, a CronJob resource automatically creates a new Job based on a predefined schedule (example; 0 2 * * * for 2 a.m. every day).

Step 4: Implement data handling and store the results. Unlike the managed Vertex AI approach, the application code inside your container is responsible for handling all data I/O. Your script must include the logic to connect to and read from the data source (for example, a Cloud Storage bucket) and write the final predictions back to your chosen destination (for example, a Cloud SQL database).

What problem are you trying to solve?