Skip to content

Conversation

elieser1101
Copy link

@elieser1101 elieser1101 commented Aug 24, 2025

Added instrumentation for promethues compatible endpoints and otel.
Added tests and e2e tests.

I went forward this ways as I was trying to get familiar with codebase. Happy to go in any direction reviewers advise.

Checklist

Part of #965

Signed-off-by: Elieser Pereira <elieser.pereiraa@gmail.com>
Signed-off-by: Elieser Pereira <elieser.pereiraa@gmail.com>
…o fit meaning

Signed-off-by: Elieser Pereira <elieser.pereiraa@gmail.com>
@elieser1101
Copy link
Author

Hi folks, trying to get some eyes here but not really sure who to tag, thanks for any pointer in advance
@JorTurFer @rickbrouwer @zroubalik tagging you 3 since helped me with previous unrelated PR kedacore/keda#6990

Comment on lines +174 to +178
func (r *HTTPScaledObjectReconciler) updatePromMetricsOnDelete(ctx context.Context, scaledObject *httpv1alpha1.HTTPScaledObject, namespacedName string) {
logger := log.FromContext(ctx, "updatePromMetricsOnDelete", namespacedName)
logger.Info("updatePromMetricsOnDelete")
metrics.RecordDeleteHTTPScaledObjectCount(namespacedName)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is never called, wouldn't this result in always growing metric?

assert.True(t, ok, "operator_http_scaled_object_count_total is available")

requestCount := getMetricsValue(val)
assert.GreaterOrEqual(t, requestCount, float64(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the whole e2e suite is executed, multiple e2e tests can run in parallel (iirc it could be like 3?). Perhaps the test could create 100 HSOs check if the metric is equal or over 100. And then we may want to delete that 100, wait for the metric to propagate and check again if it's under some reasonably low number, e.g. 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, meanwhile we don't execute sequential tests (I'd like to avoid it tbh, but maybe we have to 🤔)
I guess that we can just execute both checks at once in the same tests, something like:

  • spawn 20 HTTPScaledObjects
  • check prometheus
  • check otel
  • remove the HTTPScaledObjects
  • check prometheus
  • check otel

containerPort: 8080
containerPort: 2223
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this? Personally, I'd prefer to not change an already existing port

This endpoint can be enabled by setting the `OTEL_PROM_EXPORTER_ENABLED` environment variable to `true` on the operator deployment (`true` by default) and by setting `OTEL_PROM_EXPORTER_PORT` to an unused port for the endpoint to be made avaialble on (`2223` by default).

### Configuring the OTEL HTTP exporter
When configured, the ioperator can export metrics to a OTEL HTTP collector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When configured, the ioperator can export metrics to a OTEL HTTP collector.
When configured, the operator can export metrics to a OTEL HTTP collector.

### Configuring the OTEL HTTP exporter
When configured, the ioperator can export metrics to a OTEL HTTP collector.

The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).
The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled, the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).

Comment on lines +129 to +140
if metricsCfg.OtelPrometheusExporterEnabled {
// start the prometheus compatible metrics server
// serves a prometheus compatible metrics endpoint on the configured port
eg.Go(func() error {
if err := runMetricsServer(ctx, ctrl.Log, metricsCfg); !util.IsIgnoredErr(err) {
setupLog.Error(err, "could not start the Prometheus metrics server")
return err
}

return nil
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is already exposing a metrics server in prom format. Let's use it instead of starting another one, you can register more metrics on that server, for example, that's what we do for KEDA -> https://github.com/kedacore/keda/blob/main/pkg/metricscollector/prommetrics.go#L161

func NewOtelMetrics(options ...metric.Option) *OtelMetrics {
ctx := context.Background()

exporter, err := otlpmetrichttp.New(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opentelemetry supports HTTP and gRCP protocols, I prefer if we support both tbh. This is how we do it in KEDA, you can use it as example https://github.com/kedacore/keda/blob/main/pkg/metricscollector/opentelemetry.go#L67-L86

provider := metric.NewMeterProvider(options...)
meter := provider.Meter(meterName)

httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in opentelemetry context, dots are used instead of underscores

Suggested change
httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
httpScaledObjectCounter, err := meter.Int64UpDownCounter("keda.http.scaled.object.count", api.WithDescription("a counter of HttpScaledObjects processed by the operator"))

)
meter := provider.Meter(meterName)

httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
httpScaledObjectCounter, err := meter.Int64UpDownCounter("keda_http_scaled_object_total", api.WithDescription("a counter of http_scaled_objects processed by the operator"))

assert.True(t, ok, "operator_http_scaled_object_count_total is available")

requestCount := getMetricsValue(val)
assert.GreaterOrEqual(t, requestCount, float64(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, meanwhile we don't execute sequential tests (I'd like to avoid it tbh, but maybe we have to 🤔)
I guess that we can just execute both checks at once in the same tests, something like:

  • spawn 20 HTTPScaledObjects
  • check prometheus
  • check otel
  • remove the HTTPScaledObjects
  • check prometheus
  • check otel

@JorTurFer
Copy link
Member

Thanks a lot for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants