我正在尝试在 kubernets 上运行 Spark connect。我的 kebernetes 上的名称空间如下:

root@master-node:~# kubectl get namespaces
NAME              STATUS   AGE
default           Active   17h
kube-node-lease   Active   17h
kube-public       Active   17h
kube-system       Active   17h

我尝试在 kubernets 上运行 spark 连接,如下所示:

./spark-3.5.1-bin-hadoop3/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 --conf spark.ui.port=4041 --driver-memory 8g --master k8s://https://172.22.0.80:6443  --conf spark.kubernetes.container.image=apache/spark-py --conf spark.kubernetes.namespace=default

获取 Pod 状态

root@master-node:~# kubectl get pods -n default
NAME                                           READY   STATUS              RESTARTS   AGE
spark-connect-server-1ab49692f56b85fe-exec-1   0/1     ContainerCreating   0          9m56s
spark-connect-server-1ab49692f56b85fe-exec-2   0/1     ContainerCreating   0          9m56s
root@master-node:~#

我不确定为什么它总是创建两个 pod,并且始终处于 ContainerCreating 状态

描述 Pod

root@master-node:~# kubectl describe pod spark-connect-server-1ab49692f56b85fe-exec-1
Name:             spark-connect-server-1ab49692f56b85fe-exec-1
Namespace:        default
Priority:         0
Service Account:  default
Node:             master-node/172.22.0.80
Start Time:       Mon, 04 Nov 2024 07:55:19 +0330
Labels:           spark-app-name=spark-connect-server
                  spark-app-selector=spark-9939557601604134993bed0957e51c4f

…..

Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               10m                    default-scheduler  Successfully assigned default/spark-connect-server-1ab49692f56b85fe-exec-1 to master-node
  Warning  FailedCreatePodSandBox  10m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed t
o set up sandbox container "4f3673b7b5451dfed45fa2ed5cd230bbaa64ddd997633f8933c052f20a1bfd36" network for pod "spark-connect-server-1ab49692f56b85fe
-exec-1": networkPlugin cni failed to set up pod "spark-connect-server-1ab49692f56b85fe-exec-1_default" network: plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  10m                    kubelet            
  Normal   SandboxChanged          5m56s (x241 over 10m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  56s (x473 over 10m)    kubelet            (combined from similar events): Failed to create pod sandbox: rpc error
: code = Unknown desc = failed to set up sandbox container "4c6bf46f5379300fed39f569d46c3bfa7e7c845ec11321b2618b3c15adce60ed" network for pod "spark
-connect-server-1ab49692f56b85fe-exec-1": networkPlugin cni failed to set up pod "spark-connect-server-1ab49692f56b85fe-exec-1_default" network: plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory

它给出了与 flannel 名称空间相关的错误。我之前删除过这些资源,但我不知道为什么会出现此错误。重置和初始化集群无法解决此错误


最佳答案
2

  1. 检查 Flannel 是否正在运行:运行此命令检查 Flannel pod 是否启动并正在运行:

    kubectl get pods -n kube-system
    

    查找名称中带有“flannel”的任何内容。如果您没有看到任何 Flannel pod,则 Flannel 可能未安装或可能被意外移除。

  2. 重新启动 Flannel pod:如果 Flannel pod 存在但似乎卡住或有问题,请尝试删除它们以触发重新启动:

    kubectl delete pod -n kube-system -l app=flannel
    

    系统应该重新启动 Flannel,这可能会重新生成该subnet.env文件。

  3. 检查 Flannel ConfigMap:Flannel 的设置存储在kube-system命名空间中的 ConfigMap 中。您可以使用以下命令进行检查:

    kubectl get configmap -n kube-system
    

    如果您没有看到任何与 Flannel 相关的内容,重新应用原始 Flannel 安装文件可能会有所帮助。

  4. 重新应用 Flannel 配置:重新应用 Flannel 配置可以修复丢失的文件。您可以使用:

    kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
    
  5. 检查节点就绪情况:确保所有节点都处于就绪状态。有时,网络或节点问题会影响 Flannel 为 pod 设置网络的能力。

这些步骤有望使 Flannel 网络正常运行,因此您可以尝试再次启动 Spark Connect。如果这有帮助或者有更多关于您所看到的内容的信息,请告诉我!

这对我来说是有用的:

vim /run/flannel/subnet.env

FLANNEL_NETWORK=10.240.0.0/16
FLANNEL_SUBNET=10.240.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

尽管如此,我在命名空间的任何地方都没有 flannel 应用。不知道为什么我必须这样做!