PaddlePaddle 训练 (PaddleJob)

使用 PaddleJob 通过 PaddlePaddle 训练模型

本页面介绍如何使用 PaddleJob 通过 PaddlePaddle 训练机器学习模型。

PaddleJob 是一个 Kubernetes 自定义资源,用于在 Kubernetes 上运行 PaddlePaddle 训练作业。Kubeflow 对 PaddleJob 的实现位于 training-operator 中。

注意:由于 Istio 自动 sidecar 注入PaddleJob 默认情况下无法在用户命名空间中工作。为了使其运行,需要在 PaddleJob pods 或命名空间中添加注解 sidecar.istio.io/inject: "false" 来禁用它。有关如何将此注解添加到 yaml 文件的示例,请参阅 TFJob 文档

创建 PaddlePaddle 训练作业

您可以通过定义 PaddleJob 配置文件来创建训练作业。请参阅分布式示例的清单。您可以根据自己的要求修改配置文件。

部署 PaddleJob 资源以开始训练

kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/refs/heads/release-1.9/examples/paddlepaddle/simple-cpu.yaml

现在您应该能够看到创建的 pods,其数量与指定的副本数一致。

kubectl get pods -l job-name=paddle-simple-cpu -n kubeflow

训练在 CPU 集群上需要几分钟。可以检查日志以查看其训练进度。

PODNAME=$(kubectl get pods -l job-name=paddle-simple-cpu,replica-type=worker,replica-index=0 -o name -n kubeflow)
kubectl logs -f ${PODNAME} -n kubeflow

监控 PaddleJob

kubectl get -o yaml paddlejobs paddle-simple-cpu -n kubeflow

请参阅状态部分来监控作业状态。以下是作业成功完成时的示例输出。

apiVersion: kubeflow.org/v1
kind: PaddleJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1","kind":"PaddleJob","metadata":{"annotations":{},"name":"paddle-simple-cpu","namespace":"kubeflow"},"spec":{"paddleReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-m","paddle.distributed.launch","run_check"],"command":["python"],"image":"registry.baidubce.com/paddlepaddle/paddle:2.4.0rc0-cpu","imagePullPolicy":"Always","name":"paddle","ports":[{"containerPort":37777,"name":"master"}]}]}}}}}}
  creationTimestamp: "2022-10-24T03:47:45Z"
  generation: 3
  name: paddle-simple-cpu
  namespace: kubeflow
  resourceVersion: "266235056"
  selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/paddlejobs/paddle-simple-cpu
  uid: 7ef4f92f-0ed4-4a35-b10a-562b79538cc6
spec:
  paddleReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - args:
            - -m
            - paddle.distributed.launch
            - run_check
            command:
            - python
            image: registry.baidubce.com/paddlepaddle/paddle:2.4.0rc0-cpu
            imagePullPolicy: Always
            name: paddle
            ports:
            - containerPort: 37777
              name: master
              protocol: TCP
status:
  completionTime: "2022-10-24T04:04:43Z"
  conditions:
  - lastTransitionTime: "2022-10-24T03:47:45Z"
    lastUpdateTime: "2022-10-24T03:47:45Z"
    message: PaddleJob paddle-simple-cpu is created.
    reason: PaddleJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-10-24T04:04:28Z"
    lastUpdateTime: "2022-10-24T04:04:28Z"
    message: PaddleJob kubeflow/paddle-simple-cpu is running.
    reason: JobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2022-10-24T04:04:43Z"
    lastUpdateTime: "2022-10-24T04:04:43Z"
    message: PaddleJob kubeflow/paddle-simple-cpu successfully completed.
    reason: JobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Worker:
      labelSelector:
        matchLabels:
          group-name: kubeflow.org
          job-name: paddle-simple-cpu
          training.kubeflow.org/job-name: paddle-simple-cpu
          training.kubeflow.org/operator-name: paddlejob-controller
          training.kubeflow.org/replica-type: Worker
      succeeded: 2
  startTime: "2022-10-24T03:47:45Z"

反馈

本页面是否有帮助?