Use the Colab Enterprise Data Science Agent with BigQuery

The Data Science Agent (DSA) for Colab Enterprise and BigQuery lets you automate exploratory data analysis, perform machine learning tasks, and deliver insights all within a Colab Enterprise notebook.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Enable the BigQuery, Vertex AI, Dataform, and Compute Engine APIs.

    Enable the APIs

    For new projects, the BigQuery API is automatically enabled.

If you're new to Colab Enterprise in BigQuery, see the setup steps on the Create notebooks page.

Limitations

  • The Data Science Agent supports the following data sources:
    • CSV files
    • BigQuery tables
  • The code produced by the Data Science Agent only runs in your notebook's runtime.
  • The Data Science Agent isn't supported in projects that have enabled VPC Service Controls.
  • The first time you run the Data Science Agent, you may experience some latency of approximately five to ten minutes. This only occurs once per project during initial setup.

When to use the Data Science Agent

The Data Science Agent helps you with tasks ranging from exploratory data analysis to generating machine learning predictions and forecasts. You can use the DSA to:

  • Generate a plan: Generate and modify a plan to complete a particular task using common tools such as Python, SQL, and BigQuery DataFrames.
  • Data exploration: Explore a dataset to understand its structure, identify potential issues like missing values and outliers, and examine the distribution of key variables using Python or SQL.
  • Data cleaning: Clean your data. For example, remove data points that are outliers.
  • Data wrangling: Convert categorical features into numerical representations using techniques like one-hot encoding or label encoding or by using BigQuery feature transformation tools. Create new features for analysis.
  • Data analysis: Analyze the relationships between different variables. Calculate correlations between numerical features and explore distributions of categorical features. Look for patterns and trends in the data.
  • Data visualization: Create visualizations such as histograms, box plots, scatter plots, and bar charts that represent the distributions of individual variables and the relationships between them. You can also create visualizations in Python for tables stored in BigQuery.
  • Feature engineering: Engineer new features from a cleaned dataset.
  • Data splitting: Split an engineered dataset into training, validation, and testing datasets.
  • Model training: Train a model by using the training data in a pandas DataFrame (X_train, y_train), a BigQuery DataFrames, or by using the BigQuery ML CREATE MODEL statement with BigQuery tables.
  • Model optimization: Optimize a model by using the validation set. Explore alternative models like DecisionTreeRegressor and RandomForestRegressor and compare their performance.
  • Model evaluation: Evaluate the best performing model on a test dataset stored in a pandas DataFrame or BigQuery DataFrame. Assess model quality, make comparisons between models, or predict model performance by using BigQuery ML model evaluation functions.
  • Model inference: Perform inference with BigQuery ML trained models, imported models, and remote models using BigQuery ML inference functions. Or use the BigFrames model.predict() method to make predictions on unseen data.

Use the Data Science Agent in BigQuery

The following steps show you how to use the Data Science Agent in BigQuery.

  1. Create or open a Colab Enterprise notebook.

  2. Upload a CSV file, choose one or more BigQuery tables from the table selector, or reference a BigQuery table in your prompt.

  3. Enter a prompt that describes the data analysis you want to perform or the prototype you want to build. The Data Science Agent's default behavior is to generate Python code using open source libraries such as sklearn to accomplish complex machine learning tasks. To use a specific tool, include the following keywords in your prompt:

    • If you want to use BigQuery ML, include the "SQL" keyword.
    • If you want to use "BigQuery DataFrames", specify the "BigFrames" or "BigQuery DataFrames" keywords.

    For help, see the sample prompts.

  4. Examine the results.

Analyze a CSV file

To analyze a CSV using the Data Science Agent in BigQuery, follow these steps.

  1. Go to the BigQuery page.

    Go to BigQuery

  2. On the BigQuery Studio welcome page, under Create new, click Notebook.

    Alternatively, in the tab bar, click the drop-down arrow next to the + icon, and then click Notebook > Empty notebook.

  3. In the toolbar, click the spark Toggle Gemini button to open the chat dialog.

  4. Upload your CSV file.

    1. In the chat dialog, click Add files.

    2. If necessary, authorize your Google Account.

    3. In the action pane, click Upload file.

    4. Browse to the location of the CSV file, and then click Open.

    5. Beside the filename, click the More actions icon, and then choose Add to Gemini.

  5. Enter your prompt in the chat window. For example: Identify trends and anomalies in this file.

  6. Click Send. The results appear in the chat window.

  7. You can ask the agent to change the plan, or you can run it by clicking Accept & run. As the plan runs, generated code and text appear in the notebook. Click Cancel to stop.

Analyze BigQuery tables

To analyze a BigQuery table, choose one or more tables in the table selector, or provide a reference to the table in your prompt.

  1. Go to the BigQuery page.

    Go to BigQuery

  2. On the BigQuery Studio welcome page, under Create new, click Notebook.

    Alternatively, in the tab bar, click the drop-down arrow next to the + icon, and then click Notebook > Empty notebook.

  3. In the toolbar, click the spark Toggle Gemini button to open the chat dialog.

  4. Enter your prompt in the chat window.

  5. Choose one or more tables using the table selector:

    1. Click Add to Gemini > BigQuery tables.

    2. In the BigQuery tables window, select one or more tables in your project. You can search for tables across projects and filter tables by using the search bar.

  6. You can also reference a BigQuery table directly in your prompt. For example: "Help me perform exploratory data analysis and get insights about the data in this table: project_id:dataset.table."

    Replace the following:

    • project_id: your project ID
    • dataset: the name of the dataset that contains the table you're analyzing
    • table: the name of the table you're analyzing
  7. Click Send.

    The results appear in the chat window.

  8. You can ask the agent to change the plan, or you can run it by clicking Accept & run. As the plan runs, generated code and text appear in the notebook. For additional steps in the plan, you may be required to click Accept & run again. Click Cancel to stop.

Sample prompts

Regardless of the complexity of the prompt that you use, the Data Science Agent generates a plan that you can refine to meet your needs.

The following examples show the types of prompts that you can use with the DSA.

Python prompts

Python code is generated by default unless you use a specific keyword in the prompt such as "BigQuery ML" or "SQL".

  • Investigate and fill missing values by using the k-Nearest Neighbors (KNN) machine learning algorithm.
  • Create a plot of salary by experience level. Use the experience_level column to group the salaries, and create a box plot for each group showing the values from the salary_in_usd column.
  • Use the XGBoost algorithm to make a model for determining the class variable of a particular fruit. Split the data into training and testing datasets to generate a model and to determine the model's accuracy. Create a confusion matrix to show the predictions amongst each class, including all predictions that are correct and incorrect.
  • Forecast target_variable from filename.csv for the next six months.

SQL and BigQuery ML prompts

  • Create and evaluate a classification model on bigquery-public-data.ml_datasets.census_adult_income using BigQuery SQL.
  • Using SQL, forecast the future traffic of my website for the next month based on bigquery-public-data.google_analytics_sample.ga_sessions_*. Then, plot the historical and forecasted values.
  • Group similar customers together to create targeting market campaigns using a KMeans model. Use three features for clustering. Then visualize the results by creating a series of 2D scatter plots. Use the table bigquery-public-data.ml_datasets.census_adult_income.
  • Generate text embeddings in BigQuery ML using the review content in bigquery-public-data.imdb.reviews.

For a list of supported models and machine learning tasks, see the BigQuery ML documentation.

DataFrame prompts

  • Create a pandas DataFrame for the data in project_id:dataset.table. Analyze the data for null values, and then graph the distribution of each column using the graph type. Use violin plots for measured values and bar plots for categories.
  • Read filename.csv and construct a DataFrame. Run analysis on the DataFrame to determine what needs to be done with values. For example, are there missing values that need to be replaced or removed, or are there duplicate rows that need to be addressed. Use the data file to determine the distribution of the money invested in USD per city location. Graph the top 20 results using a bar graph that shows the results in descending order as Location versus Avg Amount Invested (USD).
  • Create and evaluate a classification model on project_id:dataset.table using BigQuery DataFrames.
  • Create a time series forecasting model on project_id:dataset.table using BigQuery DataFrames, and visualize the model evaluations.
  • Visualize the sales figures in the past year in BigQuery table project_id:dataset.table using BigQuery DataFrames.
  • Find the features that can best predict the penguin species from the table bigquery-public_data.ml_datasets.penguins using BigQuery DataFrames.

Turn off Gemini in BigQuery

To turn off Gemini in BigQuery for a Google Cloud project, an administrator must turn off the Gemini for Google Cloud API. See Disabling services.

To turn off Gemini in BigQuery for a specific user, an administrator needs to revoke the Gemini for Google Cloud User (roles/cloudaicompanion.user) role for that user. See Revoke a single IAM role.

Pricing

During Preview, you are charged for running code in the notebook's runtime and for any BigQuery slots you used. For more information, see Colab Enterprise pricing.

Supported regions

To view the supported regions for Colab Enterprise's Data Science Agent, see Locations.