PyTorch 训练 (PyTorchJob)

使用 PyTorchJob 训练 PyTorch 模型

旧版本

此页面介绍的是 **Kubeflow Training Operator V1**，有关最新信息，请查阅 Kubeflow Trainer V2 文档。

此页面介绍如何使用 PyTorchJob 训练 PyTorch 机器学习模型。

PyTorchJob 是 Kubernetes 自定义资源，用于在 Kubernetes 上运行 PyTorch 训练作业。Kubeflow 实现的 PyTorchJob 位于 training-operator 中。

注意：由于 Istio 自动 Sidecar 注入，PyTorchJob 默认在用户命名空间中不起作用。为了使其运行，需要添加注解 sidecar.istio.io/inject: "false" 以禁用其对 PyTorchJob pod 或命名空间的注入。有关如何在 yaml 文件中添加此注解的示例，请参阅 TFJob 文档。

创建 PyTorch 训练作业

您可以通过定义 PyTorchJob 配置文件来创建训练作业。请参阅分布式 MNIST 示例的清单。您可以根据您的要求更改配置文件。

部署 PyTorchJob 资源以开始训练

kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/refs/heads/release-1.9/examples/pytorch/simple.yaml

您现在应该能够看到创建的 pod，其数量与指定的副本数匹配。

kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple -n kubeflow

在 CPU 集群上，训练通常需要 5-10 分钟。可以检查日志以查看训练进度。

PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple,training.kubeflow.org/replica-type=master,training.kubeflow.org/replica-index=0 -o name -n kubeflow)
kubectl logs -f ${PODNAME} -n kubeflow

监控 PyTorchJob

kubectl get -o yaml pytorchjobs pytorch-simple -n kubeflow

请参阅状态部分以监控作业状态。以下是作业成功完成时的示例输出。

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-12-16T21:39:09Z
  generation: 1
  name: pytorch-tcp-dist-mnist
  namespace: default
  resourceVersion: "15532"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
  uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              name: pytorch
              ports:
                - containerPort: 23456
                  name: pytorchjob-port
              resources: {}
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              name: pytorch
              ports:
                - containerPort: 23456
                  name: pytorchjob-port
              resources: {}
status:
  completionTime: 2018-12-16T21:43:27Z
  conditions:
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:39:09Z
      message: PyTorchJob pytorch-tcp-dist-mnist is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:40:45Z
      message: PyTorchJob pytorch-tcp-dist-mnist is running.
      reason: PyTorchJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: 2018-12-16T21:39:09Z
      lastUpdateTime: 2018-12-16T21:43:27Z
      message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
      reason: PyTorchJobSucceeded
      status: "True"
      type: Succeeded
  replicaStatuses:
    Master: {}
    Worker: {}
  startTime: 2018-12-16T21:40:45Z

下一步

了解 Training Operator 中的分布式训练。
了解如何使用 gang-scheduling 运行作业。

反馈

此页面有帮助吗？

感谢您的反馈！

很抱歉此页面未能提供帮助。如果您有时间，请分享您的反馈，以便我们改进。

最后修改时间 2025 年 2 月 15 日：trainer: Add deprecation warning to Training Operator v1 docs (#3997) (8ad90c5)