Kubernetes 版本:1.22 所用云:裸机 安装方法:OnPremise 主机操作系统:Red Hat Enterprise Linux 版本 8.10
早上好,目前在使用 kubeadm kubeadm certs renew all 命令更新 Kubernetes 集群的证书后出现问题,我们在本地有一个 Kubernetes 集群,该集群有 2 个主节点和 6 个工作节点,更新后我们失去了对集群的管理,所做的就是应用上述命令并更新其中一个主节点中的证书,并复制另一个主节点中带有证书的 /etc/kubernetes 文件夹,以便两个主节点都更新了证书:
更新硕士校长TMT102(10.164.5.236)
续订辅助主控 TCOLD013 (10.161.169.26)
但是,问题是 apiserver pod 没有启动,并且在两个主服务器上都出现以下错误:
*I0909 15:46:36.537724 1 server.go:553] external host was not specified, using 10.164.5.236*
*I0909 15:46:36.538897 1 server.go:161] Version: v1.22.0*
*I0909 15:46:37.156242 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer*
*I0909 15:46:37.158840 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.158879 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*I0909 15:46:37.161155 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.161190 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*Error: context deadline exceeded*
如果我们验证两个主服务器上的 etcd 状态,主服务器会显示以下内容:
*[root@TMT102 jenkinsqa]# systemctl status etcd
*● etcd.service - etcd*
* Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)*
* Active: active (running) since Fri 2024-09-06 08:41:51 -04; 3 days ago*
* Docs: https://github.com/coreos*
* Main PID: 921 (etcd)*
* Tasks: 10 (limit: 23184)*
* Memory: 70.3M*
* CGroup: /system.slice/etcd.service*
* └─921 /usr/local/bin/etcd --name TMT102 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/e>*
*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36578" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36588" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36590" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36600" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36616" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36632" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36644" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36648" (error "remote error: tls: bad certificate", ServerName "")*
如果我们验证 kubelet 的状态,我们无法识别该节点:
root@TMT102 jenkinsqa]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2024-09-06 08:41:52 -04; 3 days ago
Docs: https://kubernetes.io/docs/
Main PID: 1055 (kubelet)
Tasks: 17 (limit: 23184)
Memory: 117.9M
CGroup: /system.slice/kubelet.service
└─1055 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/va>
Sep 09 11:50:02 TMT102 kubelet[1055]: E0909 11:50:02.966092 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.066193 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.167212 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.267684 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.368502 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.468755 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.569086 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.670261 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.771753 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.872367 1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
由于这些错误,我们无法连接来管理 Kubernetes 集群,这可能是因为这是一个 OnPremise Bare Metal 安装
1
最佳答案
1
确保两个主节点上的证书都已正确更新。您可以通过检查 API 服务器和 etcd 使用的证书和密钥来执行此操作。
由于您在更新证书后遇到此问题,请尝试重新启动控制平面组件,以了解它们是否已正确获取新证书,
systemctl restart kube-apiserver
systemctl restart kube-controller-manager
systemctl restart kube-scheduler
确保所有主节点上的 etcd 证书(etcd-server.crt、etcd-server.key 等)都已更新。检查 etcd 配置文件 (/etc/etcd/etcd.cofg) 以获取正确的证书路径,并确保使用更新后的证书启动 etcd。您也可以尝试重新启动 etcd,
systemctl restart etcd
此外,您还可以阅读此来获取有关此内容的更多信息。
如果您已经更新证书或它已自动更新,则必须在所有主节点上重新启动 kube-apiserver。
去大师那里找docker,
ps | grep -i kube-apiserver
使用 dockerkill 杀死容器并等待 10-15 秒它就会开始工作。
关于错误消息“无法连接:x509:证书已过期或尚未生效”,您也可以查看,这对于解决此问题非常有帮助。
|
–
|