Prometheus 监控

训练 Operator 的 Prometheus 指标

旧版本

此页面介绍的是 Kubeflow Training Operator V1，有关最新信息，请查看Kubeflow Trainer V2 文档。

请遵循此指南迁移到 Kubeflow Trainer V2。

本指南解释了如何使用 Prometheus 指标监控 Kubeflow 训练作业。训练 Operator 暴露这些指标，为分布式机器学习工作负载的状态提供重要见解。

注意

指标仅在特定事件发生时生成。例如，作业创建指标仅在作业创建后出现。如果某个指标不可见，可能是因为相应的事件尚未发生。

训练 Operator 的 Prometheus 指标

训练 Operator 包含一个内置的 /metrics 端点，用于暴露 Prometheus 指标。此功能默认启用，基本使用无需额外配置。

配置指标端口

默认情况下，指标在端口 8080 上暴露，并且可以从任何 IP 地址抓取。

如果您想更改指标导出的默认端口并限制哪些 IP 地址可以抓取指标，只需添加 metrics-bind-address 参数。

例如:

# deployment.yaml for the Training Operator
spec:
    containers:
    - command:
        - /manager
        image: kubeflow/training-operator
        name: training-operator
        ports:
        - containerPort: 8080
        - containerPort: 9443
            name: webhook-server
            protocol: TCP
        args:
        - "--metrics-bind-address=192.168.1.100:8082"

解释

--metrics-bind-address=192.168.1.100:8082 指定指标现在可通过 端口 8082 访问，且仅限于 IP 地址 192.168.1.100。或者，您可以使用 0.0.0.0:8082 将指标绑定到所有接口。

访问指标

访问这些指标的方法可能因您的 Kubernetes 设置和环境而异。例如，对于本地环境，使用以下命令

kubectl port-forward -n kubeflow deployment/training-operator 8080:8080

然后，您将在 https://:8080/metrics 看到这种格式的指标

# HELP training_operator_jobs_created_total Counts number of jobs created
# TYPE training_operator_jobs_created_total counter
training_operator_jobs_created_total{framework="tensorflow",job_namespace="kubeflow"} 7

Job 指标列表

指标名称	描述	标签
`training_operator_jobs_created_total`	创建的 Job 总数	`namespace`, `framework`
`training_operator_jobs_deleted_total`	删除的 Job 总数	`namespace`, `framework`
`training_operator_jobs_successful_total`	成功的 Job 总数	`namespace`, `framework`
`training_operator_jobs_failed_total`	失败的 Job 总数	`namespace`, `framework`
`training_operator_jobs_restarted_total`	重启的 Job 总数	`namespace`, `framework`

标签信息可解释如下

标签名称	描述
`namespace`	Job 运行所在的 Kubernetes 命名空间
`framework`	使用的机器学习框架（例如 TensorFlow、PyTorch）

反馈

此页面有帮助吗？

感谢您的反馈！

抱歉，本页面未能提供帮助。如果您有时间，请分享您的反馈，以便我们改进。

上次修改时间 2025 年 2 月 15 日：trainer: 添加 Training Operator v1 文档的弃用警告 (#3997) (8ad90c5)