Kubernetes 版本:1.22 所用云:裸机 安装方法:OnPremise 主机操作系统:Red Hat Enterprise Linux 版本 8.10

早上好,目前在使用 kubeadm kubeadm certs renew all 命令更新 Kubernetes 集群的证书后出现问题,我们在本地有一个 Kubernetes 集群,该集群有 2 个主节点和 6 个工作节点,更新后我们失去了对集群的管理,所做的就是应用上述命令并更新其中一个主节点中的证书,并复制另一个主节点中带有证书的 /etc/kubernetes 文件夹,以便两个主节点都更新了证书:

更新硕士校长TMT102(10.164.5.236)

续订辅助主控 TCOLD013 (10.161.169.26)

但是,问题是 apiserver pod 没有启动,并且在两个主服务器上都出现以下错误:

*I0909 15:46:36.537724       1 server.go:553] external host was not specified, using 10.164.5.236*
*I0909 15:46:36.538897       1 server.go:161] Version: v1.22.0*
*I0909 15:46:37.156242       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer*
*I0909 15:46:37.158840       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.158879       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*I0909 15:46:37.161155       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.161190       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*Error: context deadline exceeded*

如果我们验证两个主服务器上的 etcd 状态,主服务器会显示以下内容:

*[root@TMT102 jenkinsqa]# systemctl status etcd
*● etcd.service - etcd*
*   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)*
*   Active: active (running) since Fri 2024-09-06 08:41:51 -04; 3 days ago*
*     Docs: https://github.com/coreos*
* Main PID: 921 (etcd)*
*    Tasks: 10 (limit: 23184)*
*   Memory: 70.3M*
*   CGroup: /system.slice/etcd.service*
*           └─921 /usr/local/bin/etcd --name TMT102 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/e>*

*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36578" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36588" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36590" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36600" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36616" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36632" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36644" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36648" (error "remote error: tls: bad certificate", ServerName "")*

如果我们验证 kubelet 的状态,我们无法识别该节点:

root@TMT102 jenkinsqa]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Fri 2024-09-06 08:41:52 -04; 3 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 1055 (kubelet)
    Tasks: 17 (limit: 23184)
   Memory: 117.9M
   CGroup: /system.slice/kubelet.service
           └─1055 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/va>

Sep 09 11:50:02 TMT102 kubelet[1055]: E0909 11:50:02.966092    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.066193    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.167212    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.267684    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.368502    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.468755    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.569086    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.670261    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.771753    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.872367    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"

由于这些错误,我们无法连接来管理 Kubernetes 集群,这可能是因为这是一个 OnPremise Bare Metal 安装

1

  • 您有时间查看我的答案吗?它帮助您解决了问题吗?如果没有,请告诉我,我很乐意为您提供进一步的帮助。


    – 


最佳答案
1

确保两个主节点上的证书都已正确更新。您可以通过检查 API 服务器和 etcd 使用的证书和密钥来执行此操作。

由于您在更新证书后遇到此问题,请尝试重新启动控制平面组件,以了解它们是否已正确获取新证书,

  systemctl restart kube-apiserver
  systemctl restart kube-controller-manager
  systemctl restart kube-scheduler

确保所有主节点上的 etcd 证书(etcd-server.crt、etcd-server.key 等)都已更新。检查 etcd 配置文件 (/etc/etcd/etcd.cofg) 以获取正确的证书路径,并确保使用更新后的证书启动 etcd。您也可以尝试重新启动 etcd

systemctl restart etcd

此外,您还可以阅读此来获取有关此内容的更多信息。

如果您已经更新证书或它已自动更新,则必须在所有主节点上重新启动 kube-apiserver。

去大师那里找docker,

ps | grep -i kube-apiserver

使用 dockerkill 杀死容器并等待 10-15 秒它就会开始工作。

关于错误消息“无法连接:x509:证书已过期或尚未生效”,您也可以查看,这对于解决此问题非常有帮助。