PaddlePaddle 训练 (PaddleJob)
使用 PaddleJob 通过 PaddlePaddle 训练模型
旧版本
本页面介绍的是 Kubeflow Training Operator V1,最新信息请查看 Kubeflow Trainer V2 文档。
请遵循本指南迁移到 Kubeflow Trainer V2。
本页面介绍如何使用 PaddleJob
通过 PaddlePaddle 训练机器学习模型。
PaddleJob
是一个 Kubernetes 自定义资源,用于在 Kubernetes 上运行 PaddlePaddle 训练作业。Kubeflow 对 PaddleJob
的实现位于 training-operator
中。
注意:由于 Istio 自动 sidecar 注入,PaddleJob
默认情况下无法在用户命名空间中工作。为了使其运行,需要在 PaddleJob
pods 或命名空间中添加注解 sidecar.istio.io/inject: "false"
来禁用它。有关如何将此注解添加到 yaml
文件的示例,请参阅 TFJob
文档。
创建 PaddlePaddle 训练作业
您可以通过定义 PaddleJob
配置文件来创建训练作业。请参阅分布式示例的清单。您可以根据自己的要求修改配置文件。
部署 PaddleJob
资源以开始训练
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/refs/heads/release-1.9/examples/paddlepaddle/simple-cpu.yaml
现在您应该能够看到创建的 pods,其数量与指定的副本数一致。
kubectl get pods -l job-name=paddle-simple-cpu -n kubeflow
训练在 CPU 集群上需要几分钟。可以检查日志以查看其训练进度。
PODNAME=$(kubectl get pods -l job-name=paddle-simple-cpu,replica-type=worker,replica-index=0 -o name -n kubeflow)
kubectl logs -f ${PODNAME} -n kubeflow
监控 PaddleJob
kubectl get -o yaml paddlejobs paddle-simple-cpu -n kubeflow
请参阅状态部分来监控作业状态。以下是作业成功完成时的示例输出。
apiVersion: kubeflow.org/v1
kind: PaddleJob
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1","kind":"PaddleJob","metadata":{"annotations":{},"name":"paddle-simple-cpu","namespace":"kubeflow"},"spec":{"paddleReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-m","paddle.distributed.launch","run_check"],"command":["python"],"image":"registry.baidubce.com/paddlepaddle/paddle:2.4.0rc0-cpu","imagePullPolicy":"Always","name":"paddle","ports":[{"containerPort":37777,"name":"master"}]}]}}}}}}
creationTimestamp: "2022-10-24T03:47:45Z"
generation: 3
name: paddle-simple-cpu
namespace: kubeflow
resourceVersion: "266235056"
selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/paddlejobs/paddle-simple-cpu
uid: 7ef4f92f-0ed4-4a35-b10a-562b79538cc6
spec:
paddleReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- args:
- -m
- paddle.distributed.launch
- run_check
command:
- python
image: registry.baidubce.com/paddlepaddle/paddle:2.4.0rc0-cpu
imagePullPolicy: Always
name: paddle
ports:
- containerPort: 37777
name: master
protocol: TCP
status:
completionTime: "2022-10-24T04:04:43Z"
conditions:
- lastTransitionTime: "2022-10-24T03:47:45Z"
lastUpdateTime: "2022-10-24T03:47:45Z"
message: PaddleJob paddle-simple-cpu is created.
reason: PaddleJobCreated
status: "True"
type: Created
- lastTransitionTime: "2022-10-24T04:04:28Z"
lastUpdateTime: "2022-10-24T04:04:28Z"
message: PaddleJob kubeflow/paddle-simple-cpu is running.
reason: JobRunning
status: "False"
type: Running
- lastTransitionTime: "2022-10-24T04:04:43Z"
lastUpdateTime: "2022-10-24T04:04:43Z"
message: PaddleJob kubeflow/paddle-simple-cpu successfully completed.
reason: JobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Worker:
labelSelector:
matchLabels:
group-name: kubeflow.org
job-name: paddle-simple-cpu
training.kubeflow.org/job-name: paddle-simple-cpu
training.kubeflow.org/operator-name: paddlejob-controller
training.kubeflow.org/replica-type: Worker
succeeded: 2
startTime: "2022-10-24T03:47:45Z"
上次修改时间 2025 年 2 月 15 日: trainer: 为 Training Operator v1 文档添加弃用警告 (#3997) (8ad90c5)