Once a machine learning model is trained, the next step is to use it to make predictions on new data. This process, known as inference, requires a strategy that depends heavily on your application's latency and throughput requirements. Batch inference, also referred to as offline inference or asynchronous processing, is a powerful and highly efficient method for generating predictions on a large volume of data when immediate, real-time responses are not required.
Batch inference is the process of using a trained machine learning model to generate predictions on a large set of observations, or a "batch," all at once.
Unlike online inference, where predictions are made on single data points as they arrive, batch inference operates on data that has been collected over a period of time. This approach prioritizes high throughput and computational efficiency over low latency. Because the processing is done offline and not in direct response to a user request, it is also known as static inference—the predictions are generated and stored for later use.
Choosing between batch and online inference is a fundamental architectural decision in designing a machine learning system. Each approach serves a different purpose and is optimized for different performance characteristics.
Feature | Batch inference | Online inference |
Data processing | Processes a large collection of data points together in a single job. | Processes a single data point or a very small group of data points as they arrive. |
Primary optimization | High throughput and cost efficiency. | Low latency and immediate responsiveness. |
Latency | High latency; predictions are not available immediately (minutes to hours). | Very low latency; predictions are returned in milliseconds. |
Invocation | Triggered on a schedule (for example, cron job) or on demand. | Triggered by a direct user request or an event in the system. |
Compute utilization | Can use powerful compute resources for a short period, then scale down to zero. | Requires a server or endpoint to be constantly running and ready to accept requests. |
Example use case | Generating daily product recommendations for all users of an e-commerce site. | Predicting whether a single credit card transaction is fraudulent as it happens. |
Synonymous terms | Offline inference, asynchronous processing, static inference. | Real-time inference, synchronous processing, dynamic inference. |
Feature
Batch inference
Online inference
Data processing
Processes a large collection of data points together in a single job.
Processes a single data point or a very small group of data points as they arrive.
Primary optimization
High throughput and cost efficiency.
Low latency and immediate responsiveness.
Latency
High latency; predictions are not available immediately (minutes to hours).
Very low latency; predictions are returned in milliseconds.
Invocation
Triggered on a schedule (for example, cron job) or on demand.
Triggered by a direct user request or an event in the system.
Compute utilization
Can use powerful compute resources for a short period, then scale down to zero.
Requires a server or endpoint to be constantly running and ready to accept requests.
Example use case
Generating daily product recommendations for all users of an e-commerce site.
Predicting whether a single credit card transaction is fraudulent as it happens.
Synonymous terms
Offline inference, asynchronous processing, static inference.
Real-time inference, synchronous processing, dynamic inference.
A batch inference pipeline is a structured, automated workflow that moves data from its raw state to actionable predictions. The process can be broken down into these key steps, which are typically orchestrated by a workflow manager or scheduling system.
The process begins by accumulating data over time. This input data, which can include user activity logs, transaction records, or sensor readings, is collected from various sources and landed in a centralized storage location. This is often a data lake built on a service like Google Cloud Storage or a data warehouse like BigQuery.
The inference pipeline is initiated by a trigger. This trigger can be:
Once triggered, the job loads the entire batch of raw input data. It then performs necessary preprocessing and feature engineering steps to transform the data into the precise format the machine learning model expects. This can include tasks like cleaning missing values, scaling numerical features, and encoding categorical variables.
The system retrieves the trained machine learning model from a central repository, such as the Vertex AI Model Registry. The preprocessed batch of data is then fed to the model, which runs inference on every single observation in the set to generate a corresponding prediction.
The output of the model—the collection of predictions—is then written to a storage system. This destination is chosen based on how the predictions will be used. Common destinations include loading the results into a BigQuery table for analysis, a Cloud SQL database for quick lookups by an application, or saving them as files in Cloud Storage.
With the predictions now stored and ready, downstream systems can use them. A business intelligence tool might query the results to create a dashboard of predicted customer behavior. A web application's backend might load the pre-computed product recommendations to display to users, or a marketing automation platform might pull a list of customers predicted to churn to target them with a new campaign.
For many enterprise use cases, batch inference can offer significant advantages over real-time processing.
Cost efficiency
Batch processing allows you to optimize your use of compute resources. You can run large jobs on powerful hardware for a short period and then shut the resources down, avoiding the cost of maintaining a continuously running server.
High throughput and scalability
Batch systems are designed to scale and process terabytes of data efficiently. This makes it possible to apply complex models to very large datasets, something that might be too slow or expensive for an online system.
Simplicity of operations
Batch inference pipelines can be simpler to build and maintain than highly available, low-latency online inference systems. They are generally more resilient to transient failures and can be easily rerun if a job fails.
Enables complex feature engineering
Because batch inference is not constrained by low-latency requirements, you can perform more complex and computationally intensive feature engineering on your input data, which can often lead to more accurate models.
Better resource utilization
You can schedule batch jobs to run during off-peak hours, taking advantage of idle compute capacity and potentially lower spot pricing for virtual machines.
Batch inference is the preferred method for many core business processes where predictions enhance a product or service without needing to be generated in real time. This approach can be highly effective across various industries for solving large-scale data problems.
Industry | Problem to solve | Example solution |
E-commerce and retail | Generate personalized product recommendations for the entire user base on a daily basis to ensure they are ready for fast retrieval when users visit the site. | Vertex AI Batch Predictions can run a recommendation model and load the results into a fast lookup database like Cloud SQL or Bigtable. |
Telecommunications and SaaS | Identify which customers are at high risk of churning next month by analyzing usage patterns across the entire customer database. | BigQuery ML allows you to run a classification model directly on customer data stored in the data warehouse, with the results written to a new table for the retention team. |
Finance and insurance | Forecast financial market trends or calculate risk scores for an entire portfolio of assets, which is a computationally intensive task performed periodically. | Vertex AI Batch Predictions can execute complex time-series models on a schedule, providing the data needed for strategic reports and dashboards. |
Logistics and supply chain | Optimize inventory levels across hundreds of warehouses by running a complex demand forecasting simulation based on weekly sales and logistics data. | Google Kubernetes Engine (GKE) provides the custom, high-performance environment needed to run specialized simulation models with specific library and hardware requirements. |
Healthcare | Analyze a large daily batch of medical images (such as X-rays or CT scans) to detect potential anomalies for later review by a radiologist. | GKE with GPU accelerators is ideal for running deep learning computer vision models on large sets of images, offering maximum control and performance. |
Legal and compliance | Process and classify millions of existing documents to extract key entities, assess sentiment, and make the entire corpus searchable and analyzable. | Dataflow can be used to build a scalable NLP pipeline that preprocesses the text and runs inference, while GKE can be used for more custom model requirements. |
Industry
Problem to solve
Example solution
E-commerce and retail
Generate personalized product recommendations for the entire user base on a daily basis to ensure they are ready for fast retrieval when users visit the site.
Vertex AI Batch Predictions can run a recommendation model and load the results into a fast lookup database like Cloud SQL or Bigtable.
Telecommunications and SaaS
Identify which customers are at high risk of churning next month by analyzing usage patterns across the entire customer database.
BigQuery ML allows you to run a classification model directly on customer data stored in the data warehouse, with the results written to a new table for the retention team.
Finance and insurance
Forecast financial market trends or calculate risk scores for an entire portfolio of assets, which is a computationally intensive task performed periodically.
Vertex AI Batch Predictions can execute complex time-series models on a schedule, providing the data needed for strategic reports and dashboards.
Logistics and supply chain
Optimize inventory levels across hundreds of warehouses by running a complex demand forecasting simulation based on weekly sales and logistics data.
Google Kubernetes Engine (GKE) provides the custom, high-performance environment needed to run specialized simulation models with specific library and hardware requirements.
Healthcare
Analyze a large daily batch of medical images (such as X-rays or CT scans) to detect potential anomalies for later review by a radiologist.
GKE with GPU accelerators is ideal for running deep learning computer vision models on large sets of images, offering maximum control and performance.
Legal and compliance
Process and classify millions of existing documents to extract key entities, assess sentiment, and make the entire corpus searchable and analyzable.
Dataflow can be used to build a scalable NLP pipeline that preprocesses the text and runs inference, while GKE can be used for more custom model requirements.
Vertex AI is Google Cloud's managed machine learning platform, and it provides a streamlined, serverless approach for batch inference. The process focuses on configuring a job and letting the platform handle the underlying infrastructure.
Before starting, you need three key things:
You can initiate the job through the Google Cloud console, the gcloud command-line tool, or programmatically using the Vertex AI SDK. When you create the job, you will provide the following configuration:
Once you submit the job, Vertex AI takes over. It automatically provisions the compute resources you specified, runs your input data through the model, generates the predictions, and saves them to your designated output location. After the job is complete, Vertex AI automatically scales down all resources to zero, so you only pay for the computation time you use. You can monitor the job's progress and view logs directly in the Google Cloud console.
Once the job status shows as "Succeeded," your predictions are ready. The output files in Cloud Storage or the new table in BigQuery can now be accessed by your downstream applications, analytics tools, or BI dashboards.
Using Google Kubernetes Engine (GKE) for batch inference offers maximum control and portability, making it ideal for teams with existing Kubernetes expertise or specialized requirements. The setup involves containerizing your inference logic and managing its execution with Kubernetes resources.
Step 1: Containerize the inference application. The first step is to package your prediction code into a container image.
Step 2: Define a Kubernetes job. Instead of a long-running deployment, you define the batch task using a Kubernetes Job or CronJob manifest (a YAML file). This file specifies:
Step 3: Execute the job on the GKE cluster. You apply the manifest to your GKE cluster using kubectl. GKE's control plane then schedules a pod to run your inference container on a suitable node in the cluster. For recurring tasks, a CronJob resource automatically creates a new Job based on a predefined schedule (example; 0 2 * * * for 2 a.m. every day).
Step 4: Implement data handling and store the results. Unlike the managed Vertex AI approach, the application code inside your container is responsible for handling all data I/O. Your script must include the logic to connect to and read from the data source (for example, a Cloud Storage bucket) and write the final predictions back to your chosen destination (for example, a Cloud SQL database).
Start building on Google Cloud with $300 in free credits and 20+ always free products.