openshift4 上 GPU/vGPU 共享

openshift/k8s集群上,运行了越来越多的AI/ML应用,这些应用大部分需要GPU的支持,但是英伟达/k8s官方的device-plug中,GPU的调度,是按照一块GPU为单元来进行调度的,这就在k8s调度层面,带来一个问题,即GPU资源浪费的问题。

好在社区有很多类似的方案,比如aliyun的方案,就相对简单,当然功能也简单。本文就试图在openshift4上,运行aliyun的gpu共享方案。

由于aliyun等类似的方案,大多基于nvidia-docker,而openshift4使用了crio,所以里面有一点定制化的部分。

由于时间所限,本文只是完成了方案的大致成功运行,完美的运行,需要更多的定制化,这个就有待项目中继续完善吧。

注意

  • 这是调度共享方案,不是共享隔离方案

todo

  • 在真实的多GPU卡环境中验证。
  • 增强scheduler extender安全性

视频讲解

部署运行 scheduler extender

aliyun类似的方案,都是扩展k8s scheduler的功能,来增强k8s已有的功能,在最新版本的openshift4中,已经可以通过配置,把这个scheduler扩展功能激活。

cd /data/install
cat << EOF > ./policy.cfg
    {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "predicates" : [
            {"name" : "MaxGCEPDVolumeCount"},
            {"name" : "GeneralPredicates"},
            {"name" : "MaxAzureDiskVolumeCount"},
            {"name" : "MaxCSIVolumeCountPred"},
            {"name" : "CheckVolumeBinding"},
            {"name" : "MaxEBSVolumeCount"},
            {"name" : "MatchInterPodAffinity"},
            {"name" : "CheckNodeUnschedulable"},
            {"name" : "NoDiskConflict"},
            {"name" : "NoVolumeZoneConflict"},
            {"name" : "PodToleratesNodeTaints"}
            ],
    "priorities" : [
            {"name" : "LeastRequestedPriority", "weight" : 1},
            {"name" : "BalancedResourceAllocation", "weight" : 1},
            {"name" : "ServiceSpreadingPriority", "weight" : 1},
            {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
            {"name" : "NodeAffinityPriority", "weight" : 1},
            {"name" : "TaintTolerationPriority", "weight" : 1},
            {"name" : "ImageLocalityPriority", "weight" : 1},
            {"name" : "SelectorSpreadPriority", "weight" : 1},
            {"name" : "InterPodAffinityPriority", "weight" : 1},
            {"name" : "EqualPriority", "weight" : 1}
            ],
    "extenders": [
            {
              "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
              "filterVerb": "filter",
              "bindVerb":   "bind",
              "enableHttps": false,
              "nodeCacheCapable": true,
              "managedResources": [
                {
                  "name": "aliyun.com/gpu-mem",
                  "ignoredByScheduler": false
                }
              ],
              "ignorable": false
            }
          ]
    }
   
EOF
oc delete configmap -n openshift-config  scheduler-policy
oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy

oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge

然后我们就可以部署 scheduler extender 了

curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
# replace docker image
cd /data/install
sed -i 's/image:.*/image: quay.io\/wangzheng422\/qimgs:gpushare-scheduler-extender-2021-02-26-1339/' gpushare-schd-extender.yaml
oc delete -f gpushare-schd-extender.yaml
oc create -f gpushare-schd-extender.yaml

operator hub 中添加 catalog source

我们定制了nvidia gpu-operator,所以我们要把我们新的operator加到operator hub中去。

#
cat << EOF > /data/ocp4/my-catalog.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: wzh-operator-catalog
  namespace: openshift-marketplace
spec:
  displayName: WZH Operator Catalog
  image: 'quay.io/wangzheng422/qimgs:registry-wzh-index.2021-02-28-1446'
  publisher: WZH
  sourceType: grpc
EOF
oc create -f  /data/ocp4/my-catalog.yaml

oc delete -f /data/ocp4/my-catalog.yaml

到此,我们就能在 operator hub 中,查找到2个gpu-operator了

安装 gpu-operator 并配置 ClusterPolicies

点击安装 nvidia & wzh 那个。

安装成功以后,创建 project gpu-operator-resources

然后在 project gpu-operator-resources 中,给gpu-operator创建一个ClusterPolicies 配置,使用以下模版创建。不过里面涉及到准备一个离线安装源的操作,参考这里完成。


apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    nodeSelector: {}
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nvcr.io/nvidia/k8s
    securityContext: {}
    version: 'sha256:85016e39f73749ef9769a083ceb849cae80c31c5a7f22485b3ba4aa590ec7b88'
    image: dcgm-exporter
    tolerations: []
  devicePlugin:
    nodeSelector: {}
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: quay.io/wangzheng422
    securityContext: {}
    version: gpu-aliyun-device-plugin-2021-02-24-1346
    image: qimgs
    tolerations: []
    args:
      - 'gpushare-device-plugin-v2'
      - '-logtostderr'
      - '--v=5'
    env:
      - name: NODE_NAME
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName
  driver:
    nodeSelector: {}
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nvcr.io/nvidia
    securityContext: {}
    repoConfig:
      configMapName: repo-config
      destinationDir: /etc/yum.repos.d
    version: 'sha256:324e9dc265dec320207206aa94226b0c8735fd93ce19b36a415478c95826d934'
    image: driver
    tolerations: []
  gfd:
    nodeSelector: {}
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nvcr.io/nvidia
    securityContext: {}
    version: 'sha256:8d068b7b2e3c0b00061bbff07f4207bd49be7d5bfbff51fdf247bc91e3f27a14'
    image: gpu-feature-discovery
    tolerations: []
    migStrategy: single
    sleepInterval: 60s
  operator:
    defaultRuntime: crio
    validator:
      image: cuda-sample
      imagePullSecrets: []
      repository: nvcr.io/nvidia/k8s
      version: 'sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2'
    deployGFD: true
  toolkit:
    nodeSelector: {}
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nvcr.io/nvidia/k8s
    securityContext: {}
    version: 'sha256:81295a9eca36cbe5d94b80732210b8dc7276c6ef08d5a60d12e50479b9e542cd'
    image: container-toolkit
    tolerations: []

至此,gpu-operator就安装完成了,我们可以看到,device-plugin的validate并没有运行,这是因为,我们定制了sheduler, nvidia.com/gpu 已经被 aliyun.com/gpu-mem 代替。 完美解决这个问题,就需要继续定制化了,但是系统已经能按照预期运行,我们就把定制化留到以后项目中去做好了。

测试一下

我们就来实际测试一下效果

cat << EOF > /data/ocp4/gpu.test.yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
  name: demo1
  labels:
    app: demo1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo1
  template:
    metadata:
      labels:
        app: demo1
    spec:
      # nodeSelector:
      #   kubernetes.io/hostname: 'worker-0'
      restartPolicy: Always
      containers:
        - name: demo1
          image: "docker.io/wangzheng422/imgs:tensorrt-ljj-2021-01-21-1151"
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['ALIYUN_COM_GPU_MEM_IDX']
          resources:
            limits:
              # GiB
              aliyun.com/gpu-mem: 3

EOF
oc create -n demo -f /data/ocp4/gpu.test.yaml


进入测试容器,看环境变量,我们就能看到 NVIDIA_VISIBLE_DEVICES 被自动设置了

我们进入scheduler extender看看日志, 可以看到scheduler试图给pod添加annotation

我们再进入device-plugin看看日志,可以看到device-plugin在对比内存,挑选gpu设备。