我正在尝试在 kubernets 上运行 Spark connect。我的 kebernetes 上的名称空间如下:
root@master-node:~# kubectl get namespaces
NAME STATUS AGE
default Active 17h
kube-node-lease Active 17h
kube-public Active 17h
kube-system Active 17h
我尝试在 kubernets 上运行 spark 连接,如下所示:
./spark-3.5.1-bin-hadoop3/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 --conf spark.ui.port=4041 --driver-memory 8g --master k8s://https://172.22.0.80:6443 --conf spark.kubernetes.container.image=apache/spark-py --conf spark.kubernetes.namespace=default
获取 Pod 状态
root@master-node:~# kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
spark-connect-server-1ab49692f56b85fe-exec-1 0/1 ContainerCreating 0 9m56s
spark-connect-server-1ab49692f56b85fe-exec-2 0/1 ContainerCreating 0 9m56s
root@master-node:~#
我不确定为什么它总是创建两个 pod,并且始终处于 ContainerCreating 状态
描述 Pod
root@master-node:~# kubectl describe pod spark-connect-server-1ab49692f56b85fe-exec-1
Name: spark-connect-server-1ab49692f56b85fe-exec-1
Namespace: default
Priority: 0
Service Account: default
Node: master-node/172.22.0.80
Start Time: Mon, 04 Nov 2024 07:55:19 +0330
Labels: spark-app-name=spark-connect-server
spark-app-selector=spark-9939557601604134993bed0957e51c4f
…..
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/spark-connect-server-1ab49692f56b85fe-exec-1 to master-node
Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed t
o set up sandbox container "4f3673b7b5451dfed45fa2ed5cd230bbaa64ddd997633f8933c052f20a1bfd36" network for pod "spark-connect-server-1ab49692f56b85fe
-exec-1": networkPlugin cni failed to set up pod "spark-connect-server-1ab49692f56b85fe-exec-1_default" network: plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 10m kubelet
Normal SandboxChanged 5m56s (x241 over 10m) kubelet Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 56s (x473 over 10m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error
: code = Unknown desc = failed to set up sandbox container "4c6bf46f5379300fed39f569d46c3bfa7e7c845ec11321b2618b3c15adce60ed" network for pod "spark
-connect-server-1ab49692f56b85fe-exec-1": networkPlugin cni failed to set up pod "spark-connect-server-1ab49692f56b85fe-exec-1_default" network: plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory
它给出了与 flannel 名称空间相关的错误。我之前删除过这些资源,但我不知道为什么会出现此错误。重置和初始化集群无法解决此错误
最佳答案
2
-
检查 Flannel 是否正在运行:运行此命令检查 Flannel pod 是否启动并正在运行:
kubectl get pods -n kube-system
查找名称中带有“flannel”的任何内容。如果您没有看到任何 Flannel pod,则 Flannel 可能未安装或可能被意外移除。
-
重新启动 Flannel pod:如果 Flannel pod 存在但似乎卡住或有问题,请尝试删除它们以触发重新启动:
kubectl delete pod -n kube-system -l app=flannel
系统应该重新启动 Flannel,这可能会重新生成该
subnet.env
文件。 -
检查 Flannel ConfigMap:Flannel 的设置存储在
kube-system
命名空间中的 ConfigMap 中。您可以使用以下命令进行检查:kubectl get configmap -n kube-system
如果您没有看到任何与 Flannel 相关的内容,重新应用原始 Flannel 安装文件可能会有所帮助。
-
重新应用 Flannel 配置:重新应用 Flannel 配置可以修复丢失的文件。您可以使用:
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
-
检查节点就绪情况:确保所有节点都处于就绪状态。有时,网络或节点问题会影响 Flannel 为 pod 设置网络的能力。
这些步骤有望使 Flannel 网络正常运行,因此您可以尝试再次启动 Spark Connect。如果这有帮助或者有更多关于您所看到的内容的信息,请告诉我!
|
这对我来说是有用的:
vim /run/flannel/subnet.env
FLANNEL_NETWORK=10.240.0.0/16
FLANNEL_SUBNET=10.240.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
尽管如此,我在命名空间的任何地方都没有 flannel 应用。不知道为什么我必须这样做!
|
|