The ML.QUANTILE_BUCKETIZE function

This document describes the ML.QUANTILE_BUCKETIZE function, which lets you break a continuous numerical feature into buckets based on quantiles.

When used in the TRANSFORM clause, the same quantiles are automatically used in prediction.

You can use this function with models that support manual feature preprocessing. For more information, see the following documents:

Syntax

ML.QUANTILE_BUCKETIZE(numerical_expression, num_buckets [, output_format]) OVER()

Arguments

ML.QUANTILE_BUCKETIZE takes the following arguments:

  • numerical_expression: the numerical expression to bucketize.
  • num_buckets: an INT64 value that specifies the number of buckets to split numerical_expression into.
  • output_format: a STRING value that specifies the output format of the bucket. Valid output formats are as follows:
    • bucket_names: returns a STRING value in the format bin_<bucket_index>. For example, bin_3. The bucket_index value starts at 1. This is the default bucket format.
    • bucket_ranges: returns a STRING value in the format [lower_bound, upper_bound) in interval notation. For example, (-inf, 2.5), [2.5, 4.6), [4.6, +inf).
    • bucket_ranges_json: returns a JSON-formatted STRING value in the format {"start": "lower_bound", "end": "upper_bound"}. For example, {"start": "-Infinity", "end": "2.5"}, {"start": "2.5", "end": "4.6"}, {"start": "4.6", "end": "Infinity"}. The inclusivity and exclusivity of the lower and upper bound follow the same pattern as the bucket_ranges option.

Output

ML.QUANTILE_BUCKETIZE returns a STRING value that contains the name of the bucket, in the format specified by the output_format argument.

Example

The following example breaks a numerical expression of five elements into three buckets:

SELECT
  f,
  ML.QUANTILE_BUCKETIZE(f, 3) OVER() AS bucket,
  ML.QUANTILE_BUCKETIZE(f, 3, "bucket_ranges") OVER() AS bucket_ranges,
  ML.QUANTILE_BUCKETIZE(f, 3, "bucket_ranges_json") OVER() AS bucket_ranges_json
FROM
  UNNEST([1,2,3,4,5]) AS f
ORDER BY f;

The output looks similar to the following:

+---+--------+---------------+------------------------------------+
| f | bucket | bucket_ranges | bucket_ranges_json                 |
|---|--------|---------------|------------------------------------|
| 1 | bin_1  | (-inf, 2)     | {"start": "-Infinity", "end": "2"} |
| 2 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 3 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 4 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
| 5 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
+---+--------+---------------+------------------------------------+

What's next