学习｜kubernetes集群搭建

September 19, 2024

参考官网基于Debian的发行版的安装步骤。

安装containerd容器

参考官网ubuntu安装步骤。

修改containerd容器配置

使用apt安装的containerd容器，在/etc/containerd/config.toml配置文件中，默认是禁用cri的，需要修改/etc/containerd/config.toml文件，并重启。

# /etc/containerd/config.toml

# disabled_plugins : ["cri"]
# 改成
disabled_plugins : [] # 去除cri

重启：

sudo systemctl restart containerd

kubeadm初始化

配置文件

# kubeadm-config.yaml

kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta4
kubernetesVersion: v1.31.1. # k8s版本
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd # 使用systemd而不是cgroupfs
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 10.0.0.177 # ip地址
  bindPort: 6443 # 端口
nodeRegistration:
  criSocket: /var/run/containerd/containerd.sock # containerd容器位置

启动

sudo kubeadm init --config kubeadm-config.yaml # 指定上面的配置文件

# 或者使用
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

启动日志：

(base) ubuntu@23-7-31-1539:~$ sudo kubeadm init --config kubeadm-config.yaml
W0919 11:23:05.057648 2924945 initconfiguration.go:126] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/containerd.sock". Please update your configuration!
[init] Using Kubernetes version: v1.31.1
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W0919 11:23:05.195324 2924945 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.6" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.10" as the CRI sandbox image.
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [23-7-31-1539 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.0.0.177]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [23-7-31-1539 localhost] and IPs [10.0.0.177 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [23-7-31-1539 localhost] and IPs [10.0.0.177 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "super-admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests"
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 504.699086ms
[api-check] Waiting for a healthy API server. This can take up to 4m0s
[api-check] The API server is healthy after 8.001822739s
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node 23-7-31-1539 as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node 23-7-31-1539 as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: 2v9nfc.rn3vhrt63xkixat5
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  **mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config**

Alternatively, if you are the root user, you can run:

  **export KUBECONFIG=/etc/kubernetes/admin.conf**

You should now deploy a pod network to the cluster.
Run "**kubectl apply -f [podnetwork].yaml**" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

**kubeadm join 10.0.0.177:6443 --token 2v9nfc.rn3vhrt63xkixat5 \
        --discovery-token-ca-cert-hash sha256:a5f50d85f8c3d804b378657f43b471cecdd0fa917774aaa7cc588a94cb62016b**

按照上面提示，运行命令。

配置kubectl（主节点）

# 非root用户
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# root用户
export KUBECONFIG=/etc/kubernetes/admin.conf

安装网络插件（主节点，需要在子节点加入后安装）

# 不需要sudo
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

检查状态（主节点运行）

这里前pods节点状态异常（查看问题排查）。

(base) ubuntu@23-7-31-1539:~$ **kubectl get nodes**
NAME           STATUS   ROLES           AGE    VERSION
23-7-31-1539   Ready    control-plane   165m   v1.31.1
(base) ubuntu@23-7-31-1539:~$ **kubectl get pods --all-namespaces**
NAMESPACE      NAME                                   READY   STATUS              RESTARTS      AGE
kube-flannel   kube-flannel-ds-9tkl6                  0/1     CrashLoopBackOff    6 (43s ago)   6m35s
kube-system    coredns-7c65d6cfc9-s9h2g               0/1     ContainerCreating   0             165m
kube-system    coredns-7c65d6cfc9-t8zhp               0/1     ContainerCreating   0             165m
kube-system    etcd-23-7-31-1539                      1/1     Running             0             165m
kube-system    kube-apiserver-23-7-31-1539            1/1     Running             0             165m
kube-system    kube-controller-manager-23-7-31-1539   1/1     Running             0             165m
kube-system    kube-proxy-fx5nj                       1/1     Running             0             165m
kube-system    kube-scheduler-23-7-31-1539            1/1     Running             0             165m
(base) ubuntu@23-7-31-1539:~$ **kubectl get svc --all-namespaces**
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  166m
kube-system   kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   166m

其他节点加入集群

sudo kubeadm join 10.0.0.177:6443 --token 2v9nfc.rn3vhrt63xkixat5 \
        --discovery-token-ca-cert-hash sha256:a5f50d85f8c3d804b378657f43b471cecdd0fa917774aaa7cc588a94cb62016b

或者使用下面命令生成join命令

sudo kubeadm token create --print-join-command

问题排查

CoreDNS异常

在第一次使用kubeadm init初始化的时候，没有配置--pod-network-cidr=10.244.0.0/16 参数，导致在安装 Pod 网络附加组件的时候，CoreDNS出现异常，不在 Running 状态。要再次运行 kubeadm init，必须首先卸载集群。在重新初始化的时候，没有将iptables防火墙规则重置，导致节点状态错误，为NotReady ，以下为完整重置方法。

重置

(base) ubuntu@23-7-31-1539:~$ **sudo kubeadm reset**
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0919 16:30:20.433585 3207403 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
[reset] Deleted contents of the etcd data directory: /var/lib/etcd
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/super-admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

根据提示删除/etc/cni/net.d文件和$HOME/.kube/config文件

sudo rm -rf /etc/cni/net.d
rm -rf $HOME/.kube/config

重置iptables防火墙规则，如果不重重防火墙规则，后续节点会出现container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized 异常。

# 重置所有防火墙规则
sudo iptables -F
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -X

# 仅重置**kubeadm规则**
sudo iptables -t nat -F KUBE-SERVICES
sudo iptables -t nat -F KUBE-NODEPORTS
sudo iptables -t nat -F KUBE-POSTROUTING
sudo iptables -t filter -F KUBE-FIREWALL

修改 kubeadm-config.yaml 文件

确保你的配置文件包含以下内容（特别是 ClusterConfiguration 部分）：

apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.31.1
networking:
  podSubnet: "10.244.0.0/16"   # 将 pod 网络设置为 Flannel 所需的 CIDR

重新初始化

sudo kubeadm init --config kubeadm-config.yaml

etcd一直重启（所有主机）

我在本地搭建集群测试的时候，使用kubeadm初始化之后，发现6443端口一会可以用一会又不能用。使用sudo crictl ps -a查看发现etcd、kube-apiserver、kube-scheduler、kube-controller-manager、kube-proxy这几个组件一直在重启。

它们之间的依赖关系：

                   +------------------------+
                   |       etcd             |
                   +------------------------+
                            ↑
                            |
                            |
                   +------------------------+
                   |    kube-apiserver      |
                   +------------------------+
                            ↑
            +---------------+---------------+
            |                               |
+-----------------------+       +--------------------------+
| kube-controller-manager |       |    kube-scheduler       |
+-----------------------+       +--------------------------+
                            |
                            ↓
                   +------------------------+
                   |      kube-proxy        |
                   +------------------------+

最底层的是etcd，问题应该是出现在这里。

查看etcd日志crictl logs <id> ，日志中并没有明显的error日志。问gpt半天也解决不了，最后找到一片博客。博客中提到【该问题为未正确设置cgroups导致，在containerd的配置文件/etc/containerd/config.toml中】，我安装博客中进行设置，发现还是不行，再按照官方的进行设置还是不行。

最后我在官网中找到，【如果你在初次安装集群后或安装 CNI 后遇到容器崩溃循环，则随软件包提供的 containerd 配置可能包含不兼容的配置参数。考虑按照 getting-started.md 中指定的 containerd config default > /etc/containerd/config.toml 重置 containerd 配置，然后相应地设置上述配置参数。】

安装官网重置配置参数，然后再找到SystemdCgroup 参数，将其改为ture。

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  ...
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

重启containerd

sudo systemctl restart containerd

查看是否生效

test@node1:/etc/kubernetes$ sudo crictl info | grep SystemdCgroup
[sudo] password for test: 
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
            "SystemdCgroup": true # true生效

参考：

https://kubernetes.io/zh-cn/docs/setup/production-environment/container-runtimes/#containerd

https://blog.sundayhk.com/post/kube-flannel-always-restarting/

Flannel 插件报错（所有主机都需要配置）

test@node1:/etc/kubernetes$ kubectl **get pods --all-namespaces**
NAMESPACE      NAME                            READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-bds75           0/1     Error     2 (22s ago)   27s
kube-system    coredns-7c65d6cfc9-bhsfd        0/1     Pending   0             45s
kube-system    coredns-7c65d6cfc9-x88w5        0/1     Pending   0             45s
kube-system    etcd-node1                      1/1     Running   65            51s
kube-system    kube-apiserver-node1            1/1     Running   97            52s
kube-system    kube-controller-manager-node1   1/1     Running   106           51s
kube-system    kube-proxy-gxmhs                1/1     Running   0             45s
kube-system    kube-scheduler-node1            1/1     Running   97            51s

test@node1:/etc/kubernetes$ **sudo crictl ps -a**
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
WARN[0000] image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
CONTAINER           IMAGE               CREATED              STATE               NAME                      ATTEMPT             POD ID              POD
12fce0592f4e4       f1f31f4a8d91a       17 seconds ago       Exited              kube-flannel              4                   d63150ad4e6fd       kube-flannel-ds-bds75
97b6e42125c78       f1f31f4a8d91a       About a minute ago   Exited              install-cni               0                   d63150ad4e6fd       kube-flannel-ds-bds75
a8c395bc960a1       fd38104bd1952       About a minute ago   Exited              install-cni-plugin        0                   d63150ad4e6fd       kube-flannel-ds-bds75
7a040300d828b       db596b4c91462       2 minutes ago        Running             kube-proxy                0                   0a3d14b374fad       kube-proxy-gxmhs
24939c4c03721       544482eaef8e2       2 minutes ago        Running             kube-controller-manager   106                 809005c11de90       kube-controller-manager-node1
096046a931b16       8bb1f68f41635       2 minutes ago        Running             kube-scheduler            97                  8bd4f5af02265       kube-scheduler-node1
6697a3eb5729b       d99a4d575ed24       2 minutes ago        Running             kube-apiserver            97                  f38cc460a0578       kube-apiserver-node1
b012edfe5fa8d       27e3830e14027       2 minutes ago        Running             etcd                      65                  5099c6e06b3a5       etcd-node1

test@node1:/etc/kubernetes$ **sudo crictl logs 12fce0592f4e4**
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
I1125 15:53:31.188436       1 main.go:212] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W1125 15:53:31.188523       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1125 15:53:31.197120       1 kube.go:139] Waiting 10m0s for node controller to sync
I1125 15:53:31.197161       1 kube.go:469] Starting kube subnet manager
I1125 15:53:32.198266       1 kube.go:146] Node controller sync successful
I1125 15:53:32.198308       1 main.go:232] Created subnet manager: Kubernetes Subnet Manager - node1
I1125 15:53:32.198312       1 main.go:235] Installing signal handlers
I1125 15:53:32.198688       1 main.go:469] Found network config - Backend type: vxlan
E1125 15:53:32.198726       1 main.go:269] Failed to check br_netfilter: stat /proc/sys/net/bridge/bridge-nf-call-iptables: no such file or directory

询问GPT说是br_netfilter模块没有加载，还有相关内核参数没有设置。

解决方法：

检查 br_netfilter 模块是否加载

lsmod | grep br_netfilter # 没有输出表示没有加载

手动加载 br_netfilter 模块（临时）

sudo modprobe br_netfilter # 运行命令后，在使用上面的命令检查

确保 br_netfilter 在启动时加载

echo "br_netfilter" | sudo tee -a /etc/modules-load.d/k8s.conf

检查并启用相关内核参数

sudo tee /etc/sysctl.d/99-kubernetes-cri.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF

应用修改

sudo sysctl --system

重启相关服务

sudo systemctl restart containerd
sudo systemctl restart kubelet

部署应用

创建 Nginx Deployment

kubectl create deployment nginx --image=nginx

验证 Deployment 和 Pod 的状态

kubectl get deployments
kubectl get pods

看到 Nginx Deployment 和 Pod 处于 Running 状态表示一切正常。

暴露 Nginx 服务

kubectl expose deployment nginx --port=80 --type=NodePort

查找分配的端口号

(base) ubuntu@23-7-31-1539:~$ **kubectl get svc**
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP   10.96.0.1      <none>        443/TCP        83m
nginx        NodePort    10.107.90.14   <none>        80:32008/TCP   6s

注意输出中的 32008，这就是为外部访问分配的端口号。

访问 Nginx 应用

http://<Node-IP>:<NodePort>

删除应用

kubectl delete deployment nginx
kubectl delete svc nginx

总结

第一次部署k8s，只在单机上测试，没有尝试加入其他节点。很多原理还不是很清楚，所以在解决问题的时候走了很多弯路，这里做一个学习记录。

安装containerd容器

修改containerd容器配置

kubeadm初始化

配置文件​

启动​

配置kubectl（主节点）​

安装网络插件**（主节点，需要在子节点加入后安装）**​

检查状态（主节点运行）​

其他节点加入集群​

问题排查​

CoreDNS异常​

etcd一直重启（所有主机）​

Flannel 插件报错（所有主机都需要配置）​