Recently, I’ve try to use Cluster-api
to deploy a cluster on vsphere. During this process, I’ve met many different problems, so in this article, I’m going to talk about how to use this tool and how to deal with the problems that we might meet.
Environment requirements
Make sure these tools are all installed:
- kubectl
- go
- docker
- clusterctl
- kind
Deployment Steps
1. Configure clusterctl
Create a file named clusterctl.yaml
in folder .cluster-api
, which is normally located at $Home
.
For example:
## -- Controller settings -- ##
VSPHERE_USERNAME: "..." # The username used to access the remote vSphere endpoint
VSPHERE_PASSWORD: "..." # The password used to access the remote vSphere endpoint
## -- Required workload cluster default settings -- ##
VSPHERE_SERVER: <vcenter_server_ip> # The vCenter server IP or FQDN
VSPHERE_DATACENTER: <vcenter_data_center> # The vSphere datacenter to deploy the management cluster on
VSPHERE_DATASTORE: <datastore_name> # The vSphere datastore to deploy the management cluster on
VSPHERE_NETWORK: <network_name> # The VM network to deploy the management cluster on
VSPHERE_RESOURCE_POOL: <Path_to_resource_pool> # The vSphere resource pool for your VMs( About how to get correct path of VSPHERE_RESOURCE_POOL, Please check the Troubleshooting #1 section.)
VSPHERE_FOLDER: "" # The VM folder for your VMs. Set to "" to use the root vSphere folder
VSPHERE_TEMPLATE: <template_name> # The VM template to use for your management cluster.
CONTROL_PLANE_ENDPOINT_IP: <control_plane_ip>" # the IP that kube-vip is going to use as a control plane endpoint
VSPHERE_TLS_THUMBPRINT: "..." # sha1 thumbprint of the vcenter certificate, can be gotten with this command: openssl x509 -sha1 -fingerprint -in ca.crt -noout
EXP_CLUSTER_RESOURCE_SET: "true" # This enables the ClusterResourceSet feature that we are using for deploying CSI
VSPHERE_SSH_AUTHORIZED_KEY: "ssh-rsa ..." # The public ssh authorized key on all machines, Set it to "" if you don't want to use ssh to access to nodes
VSPHERE_STORAGE_POLICY: "" # This is the vSphere storage policy. Set it to "" if you don't want to use a storage policy.
Remember to set these before we init the management cluster, otherwise, we might get an error about the cluster resource set
( If this error occurs, we need to edit the deployment of capi-controller-manager as shown below with kubectl
)
#deploy/capi-controller-manager
spec:
containers:
- args:
- --leader-elect
- --metrics-bind-addr=localhost:8080
- --feature-gates=MachinePool=false,ClusterResourceSet=true,ClusterTopology=false
Set ClusterResourceSet
to true
like what is shown in the code above.
2. Use Kind and Cluserctl to create a management cluster
(1) Use Kind
to create a new Kubernetes cluster (A docker container actually).
(2) Configure kubeconfig
, make sure the current context of kubectl
is pointing to the Kind
cluster.
(3) Use clusterctl
to transform the current cluster to a management cluster (remember to specify using vsphere
as infrastructure provider).
# Create kind cluster
$ kind create cluster
# Set current context, the default name of this kind cluster will be kind-kind
$ kubectl cluster-info --context kind-kind
# Transform current cluster to management cluster
$ clusterctl init --infrastructure vsphere
In this step, clusterctl
will install 4 components into the cluster: cluster-api
, bootstrap-kubeadm
, control-plan-kubeadm
and infrastructure-vsphere
. Then we can use kubectl get pods --all-namespaces
command to check the status of the management cluster. If some pods are stuck at ImagePullBackOff status, please check Troubleshooting #3 section.
3. Download the ova file and deploy a template on Vsphere
It is required that machines provisioned by CAPV have cloud-init
, kubeadm
, and a container runtime
pre-installed. We can use one of the CAPV machine images generated by SIG Cluster Lifecycle as a VM template. Images are available here: ova download.
4. Deploy workload cluster
(1) Generate YAML file of the cluster:
Use the command shown below to generate a YAML file of the target cluster.
clusterctl generate cluster <cluster name> --kubernetes-version <kubernete version> --control-plane-machine-count 1 --worker-machine-count 3 > cluster.yaml
(2) Deploy target cluster
Deploy cluster with command kubectl apply -f cluster.yaml
.
After these two steps, we can get info about the cluster with the command kubectl get cluster
command:
$ kubectl get cluster
NAME PHASE AGE VERSION
eighth Provisioned 23h
If nothing goes wrong (no situation like wrong Vsphere configuration[Troubleshooting #1] or missing kube-vip image[Troubleshooting #2]), then all VMs (master node and workers) will be created on Vsphere.
With kubectl get machine
command, we can check the status of VMs that have been created for our cluster.
$ kubectl get machine
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
eighth-md-0-7b848c46d8-69m2n eighth eighth-md-0-7b848c46d8-69m2n vsphere://423aff2a-386c-f7e1-ca4a-3d1fcde217c6 Running 23h v1.20.1
eighth-md-0-7b848c46d8-7m6h5 eighth eighth-md-0-7b848c46d8-7m6h5 vsphere://423a2464-d3ee-4c75-96ac-3202609cc2ee Running 23h v1.20.1
eighth-md-0-7b848c46d8-cbnwf eighth eighth-md-0-7b848c46d8-cbnwf vsphere://423a1791-3d80-1f8f-a071-b7b5900f69ea Running 23h v1.20.1
eighth-tf4mg eighth eighth-tf4mg vsphere://423af9cd-4b1e-617c-a778-b4e435c4efd0 Running 23h v1.20.1
And now we can generate a kubeconfig
(with clusterctl get kubeconfig <cluster_name>
command) to access this new cluster:
$ clusterctl get kubeconfig eighth > cluster.kubeconfig
$ KUBECONFIG=cluster.kubeconfig kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-74ff55c5b-f9qp7 0/1 Pending 0 3m57s <none> <none> <none> <none>
kube-system coredns-74ff55c5b-qdm4m 0/1 Pending 0 3m57s <none> <none> <none> <none>
kube-system etcd-eighth-tf4mg 1/1 Running 0 3m50s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-apiserver-eighth-tf4mg 1/1 Running 0 3m50s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-controller-manager-eighth-tf4mg 1/1 Running 0 3m50s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-proxy-2njsn 1/1 Running 0 3m57s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-proxy-5cw4z 1/1 Running 0 22s 10.103.226.235 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system kube-proxy-5kknb 1/1 Running 0 7s 10.103.226.236 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
kube-system kube-proxy-mmtzw 1/1 Running 0 78s 10.103.226.234 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system kube-scheduler-eighth-tf4mg 1/1 Running 0 3m50s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-vip-eighth-tf4mg 1/1 Running 0 3m49s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system vsphere-cloud-controller-manager-2ldmr 0/1 ImagePullBackOff 0 3m58s 10.103.226.233 eighth-tf4mg <none> <none>
kube-system vsphere-cloud-controller-manager-5fp4w 0/1 ImagePullBackOff 0 23s 10.103.226.235 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system vsphere-cloud-controller-manager-6pxj7 0/1 ContainerCreating 0 7s 10.103.226.236 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
kube-system vsphere-cloud-controller-manager-nlg8r 0/1 ImagePullBackOff 0 78s 10.103.226.234 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system vsphere-csi-controller-5456544dd5-htkvn 0/5 Pending 0 3m59s <none> <none> <none> <none>
kube-system vsphere-csi-node-g4wgk 0/3 ContainerCreating 0 78s <none> eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system vsphere-csi-node-lnt7t 0/3 ContainerCreating 0 3m59s <none> eighth-tf4mg <none> <none>
kube-system vsphere-csi-node-sm2fg 0/3 ContainerCreating 0 23s <none> eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system vsphere-csi-node-xcs5d 0/3 ContainerCreating 0 7s <none> eighth-md-0-7b848c46d8-7m6h5 <none> <none>
According to the output, there are several problems that need to be solved.
If we get ImagePullBackOff
, the Troubleshooting #3 section might be helpful.
Also,coredns
pods will be stuck at Pending
because CNI is missing in this cluster and we need to deploy it manually.
5. Deploy CNI
Calico
is used in my cluster. We can use this command to deploy it to our cluster:
#Deploy CNI
$kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
After CNI is successfully deployed, we can check the status of the cluster:
$kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-558995777d-jkkk4 1/1 Running 0 24h 192.168.139.67 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system calico-node-6gfdj 1/1 Running 0 24h 10.103.226.236 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
kube-system calico-node-92nn8 1/1 Running 0 24h 10.103.226.234 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system calico-node-jbjvx 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system calico-node-k42dc 1/1 Running 0 24h 10.103.226.235 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system coredns-74ff55c5b-f9qp7 1/1 Running 0 24h 192.168.139.68 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system coredns-74ff55c5b-qdm4m 1/1 Running 0 24h 192.168.139.66 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system etcd-eighth-tf4mg 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-apiserver-eighth-tf4mg 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-controller-manager-eighth-tf4mg 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-proxy-2njsn 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-proxy-5cw4z 1/1 Running 0 24h 10.103.226.235 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system kube-proxy-5kknb 1/1 Running 0 24h 10.103.226.236 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
kube-system kube-proxy-mmtzw 1/1 Running 0 24h 10.103.226.234 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system kube-scheduler-eighth-tf4mg 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system kube-vip-eighth-tf4mg 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system vsphere-cloud-controller-manager-2ldmr 1/1 Running 0 24h 10.103.226.233 eighth-tf4mg <none> <none>
kube-system vsphere-cloud-controller-manager-5fp4w 1/1 Running 0 24h 10.103.226.235 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system vsphere-cloud-controller-manager-6pxj7 1/1 Running 0 24h 10.103.226.236 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
kube-system vsphere-cloud-controller-manager-nlg8r 1/1 Running 0 24h 10.103.226.234 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system vsphere-csi-controller-5456544dd5-9w49q 5/5 Running 0 22h 192.168.215.194 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system vsphere-csi-node-g4wgk 3/3 Running 0 24h 192.168.139.65 eighth-md-0-7b848c46d8-cbnwf <none> <none>
kube-system vsphere-csi-node-lnt7t 3/3 Running 6 24h 192.168.193.193 eighth-tf4mg <none> <none>
kube-system vsphere-csi-node-sm2fg 3/3 Running 1 24h 192.168.215.193 eighth-md-0-7b848c46d8-69m2n <none> <none>
kube-system vsphere-csi-node-xcs5d 3/3 Running 1 24h 192.168.250.1 eighth-md-0-7b848c46d8-7m6h5 <none> <none>
All the pods of this cluster are running, which means a Kubernetes cluster is successfully deployed on Vsphere.
Troubleshooting
#1 No VM is created on Vsphere after kubectl apply -f <cluster_yaml_file>
NO new VM is created on Vsphere after kubectl apply -f <cluster_yaml_file>
command and ControlPlane stuck at WaitingForKubeadmInit:
# Get cluster status
$clusterctl describe cluster test
NAME READY SEVERITY REASON SINCE MESSAGE
/test False Info WaitingForKubeadmInit 109m
├─ClusterInfrastructure - VSphereCluster/test True 109m
├─ControlPlane - KubeadmControlPlane/test False Info WaitingForKubeadmInit 20h
│ └─Machine/test-m7n2j True 21h
└─Workers
└─MachineDeployment/test-md-0 False Warning WaitingForAvailableMachines 21h Minimum availability requires 3 replicas, current 0 available
└─3 Machines... False Info WaitingForControlPlaneAvailable 21h See test-md-0-f67dc55-dkpqx, test-md-0-f67dc55-g4brk, ...
We can check the logs of capv-controller-manager
and capi-controller-manager
:
#capv-controller-manager
$ kubectl logs deploy/capv-controller-manager -f
...
E1209 10:33:47.637562 1 controller.go:317] controller/vspherevm "msg"="Reconciler error" "error"="failed to reconcile VM: cannot traverse type VirtualMachine" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereVM"
I1209 10:33:59.999674 1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Watch close - *v1beta1.VSphereMachine total 2 items received
I1209 10:34:12.990178 1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Watch close - *v1beta1.VSphereDeploymentZone total 0 items received
#capi-controller-manager
$ kubectl logs deploy/capi-controller-manager -f
I1209 10:35:44.955263 1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:35:44.955316 1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-sp4j8" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.491520 1 machine_controller_phases.go:220] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.498654 1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.498760 1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:14.966929 1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:14.967009 1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-sp4j8" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
capv-controller-manager
printed error message: failed to reconcile VM: cannot traverse type VirtualMachine.
It might be because VSPHERE_FOLDER
and VSPHERE_RESOURCE_POOL
specified in clusterctl.yaml
are incorrect.
To get the correct path of VM folder
and resource pool
, we can use govc CLI
( a vSphere command line tool: https://github.com/vmware/govmomi/tree/master/govc).
(1)VSPHERE_RESOURCE_POOL
For example, if the resource pool’s name is “Test” and we want to get the full path of this resource pool:
$govc pool.info -dc=<data_center_name> Test
Name: Test
Path: /<Datacenter>/host/<host_name>/.../Resources/.../Test
CPU Usage: 965MHz (2.5%)
CPU Shares: normal
CPU Reservation: 0MHz (expandable=true)
CPU Limit: -1MHz
Mem Usage: 59816MB (11.7%)
Mem Shares: normal
Mem Reservation: 0MB (expandable=true)
Mem Limit: -1MB
Then the Path
shown in the response is the correct path that we can fill in to VSPHERE_RESOURCE_POOL
field in clusterctl.yaml
file.
(2)VSPHERE_FOLDER
# govc ls command will list all inventory items in target vsphere:
$govc ls
/<datacentor>/vm
/<datacentor>/network
/<datacentor>/host
/<datacentor>/datastore
# List items that are stored in /<datacentor>/vm inventory
$govc ls /<datacentor>/vm
/<datacentor>/vm/Cluster_Conntroller_10.103.64.218
/<datacentor>/vm/test-win10-lan1
...
According to what we’ve got above, it can be told that basically, all the VMs are stored in /<datacenter>/vm
inventory, so we can set VSPHERE_FOLDER
to “” and vsphere will directly use /
#2 Stuck at WaitingForKubeadmInit after VM is successfully provisioned
$kubectl get machine
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
test-m7n2j test vsphere://423aaa89-02b5-4796-7fdc-0b4619a0d4d6 Provisioned 21h v1.20.1
test-md-0-f67dc55-dkpqx test Pending 21h v1.20.1
test-md-0-f67dc55-g4brk test Pending 21h v1.20.1
test-md-0-f67dc55-x48xx test Pending 21h v1.20.1
It can be told from the output we’ve got above that the master node is successfully provisioned (the VM is successfully created on Vsphere).
We can access to target VM using ssh with account capv
and ssh key
we’ve specified in clusterctl.yaml file.
# Access to target VM using ssh (Make sure you've set VSPHERE_SSH_AUTHORIZED_KEY in cluterctl.yaml)
$ ssh capv@<vm_ip>
We can check the output of cloud-init
first:
# check output of cloud-init
$ cat /var/log/cloud-init-output.log|less
...
[2021-12-16 17:42:57] [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[2021-12-16 17:42:58] [kubeconfig] Writing "admin.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "kubelet.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "controller-manager.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "scheduler.conf" kubeconfig file
[2021-12-16 17:42:58] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[2021-12-16 17:42:58] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[2021-12-16 17:42:58] [kubelet-start] Starting the kubelet
[2021-12-16 17:42:58] [control-plane] Using manifest folder "/etc/kubernetes/manifests"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-apiserver"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-controller-manager"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-scheduler"
[2021-12-16 17:42:58] [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[2021-12-16 17:42:58] [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[2021-12-16 10:09:10] [kubelet-check] Initial timeout of 40s passed.
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] Unfortunately, an error has occurred:
[2021-12-16 10:12:39] timed out waiting for the condition
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] This error is likely caused by:
[2021-12-16 10:12:39] - The kubelet is not running
[2021-12-16 10:12:39] - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
[2021-12-16 10:12:39] - 'systemctl status kubelet'
[2021-12-16 10:12:39] - 'journalctl -xeu kubelet'
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] Additionally, a control plane component may have crashed or exited when started by the container runtime.
[2021-12-16 10:12:39] To troubleshoot, list all containers using your preferred container runtimes CLI.
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
[2021-12-16 10:12:39] - 'crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
[2021-12-16 10:12:39] Once you have found the failing container, you can inspect its logs with:
[2021-12-16 10:12:39] - 'crictl --runtime-endpoint /var/run/containerd/containerd.sock logs CONTAINERID'
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
...
Seems like there was an error occurred while waiting for the kubelet to boot up the control plane. Then we can check the status and logs of kubelet
:
# Check status of kubelet
$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2021-12-16 17:42:58 UTC; 10h ago
Docs: https://kubernetes.io/docs/home/
Main PID: 1127 (kubelet)
Tasks: 15 (limit: 4915)
CGroup: /system.slice/kubelet.service
└─1127 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external --container-runtime=remote --container-ru
# If kubelet is running without any error
# use `sudo journalctl -u kubelet` or `sudo journalctl -u kubelet --since <specific_time>`
$sudo journalctl -u kubelet --since "1 day ago"|less
...
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885118 1127 remote_image.go:113] PullImage "ghcr.io/kube-vip/kube-vip:v0.3.5" from image service failed: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885218 1127 kuberuntime_image.go:51] Pull image "ghcr.io/kube-vip/kube-vip:v0.3.5" failed: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885452 1127 kuberuntime_manager.go:829] container &Container{Name:kube-vip,Image:ghcr.io/kube-vip/kube-vip:v0.3.5,Command:[],Args:[start],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:vip_arp,Value:true,ValueFrom:nil,},EnvVar{Name:vip_leaderelection,Value:true,ValueFrom:nil,},EnvVar{Name:vip_address,Value:10.103.226.219,ValueFrom:nil,},EnvVar{Name:vip_interface,Value:eth0,ValueFrom:nil,},EnvVar{Name:vip_leaseduration,Value:15,ValueFrom:nil,},EnvVar{Name:vip_renewdeadline,Value:10,ValueFrom:nil,},EnvVar{Name:vip_retryperiod,Value:2,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kubeconfig,ReadOnly:false,MountPath:/etc/kubernetes/admin.conf,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_ADMIN SYS_TIME],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5): ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885518 1127 pod_workers.go:191] Error syncing pod b31b938f7a5929f365eb5caefef24fa5 ("kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5)"), skipping: failed to "StartContainer" for "kube-vip" with ErrImagePull: "rpc error: code = Unknown desc = failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip:v0.3.5\": failed to resolve reference \"ghcr.io/kube-vip/kube-vip:v0.3.5\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5\": dial tcp 20.205.243.164:443: connect: connection refused"
Dec 16 10:08:39 fifth-tgq4j kubelet[1127]: E1216 10:08:39.268272 1127 pod_workers.go:191] Error syncing pod b31b938f7a5929f365eb5caefef24fa5 ("kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5)"), skipping: failed to "StartContainer" for "kube-vip" with ImagePullBackOff: "Back-off pulling image \"ghcr.io/kube-vip/kube-vip:v0.3.5\""
...
Seems like kudeadm
fail to pull image of kube-vip
during initialization.
# Also we can check manifests folder of kubernetes, which contains all
# the configurations of static pods that kubernete is going to create during
# initailization
$ ls /etc/kubernetes/manifests
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml kube-vip.yaml
# List all Kubernetes containers running in cri-o/containerd
$ sudo crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause
0154378d48ba2 4aa0b4397bbbc 20 hours ago Running kube-scheduler 0 ea07bd6eb04ff
57bd6d75bad96 75c7f71120808 20 hours ago Running kube-apiserver 0 66da2fd70475a
4008afcd0c408 2893d78e47dc3 20 hours ago Running kube-controller-manager 0 f3b477e5c7ae1
...
If it’s because kudeadm
fail to pull the image of kube-vip
during initialization, we might need to upload the image manually:
(Assume that the image is successfully downloaded to the local environment)
# Save docker image to file kube-vip (local environment which might have proxy)
$ docker save ghcr.io/kube-vip/kube-vip:v0.3.5 -o kube-vip
# Transfer file to target VM
$ scp kube-vip capv@<ip>:~
# Access to target VM using ssh
$ ssh [email protected]
# Ask containerd to load image
$ sudo ctr -n k8s.io image import kube-vip
After the image is successfully uploaded to the target VM, we need to reset and run kubeadm init
again with the same config. Because kube-proxy and coredns are failed to be initialized since kubeadm init has been interrupted before.
#3 Kubernetes pods stuck at ImagePullBackOff status
If ImagePullBackOff
error is found when we check the status of the cluster with kubectl get pods --all-namespaces
command:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-58945b95bf-nr8kv 0/1 ImagePullBackOff 0 3h57m
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-58fc8f8c7c-jg9wn 0/1 ImagePullBackOff 0 3h56m
capi-system capi-controller-manager-576744d8b7-vlvjt 0/1 ImagePullBackOff 0 8m7s
capv-system capv-controller-manager-6fcb95cd6-pv7m5 0/1 ImagePullBackOff 0 3h28m
cert-manager cert-manager-848f547974-gtvzs 1/1 Running 2 3h57m
...
This is because the cluster failed to pull images of these components. We can try to set up a proxy to resolve it.
If the proxy doesn’t work, then we need to find another source for these images.
To solve the ImagePullBackOff
problem:
(1) Check deployments of these pods and find out which images they need.
(2) Search the images on docker hub and download them.
(3) Use docker tag <origin> <target>
command to change tag of these images.
(4) Use kind load <image tag>
command to load the image into the environment of the management cluster.
Then all the pods of the management cluster will run correctly.
#4 ControlPlane stuck at PoweringOn @ Machine/-
$ clusterctl describe cluster <cluster_name>
NAME READY SEVERITY REASON SINCE MESSAGE
/localt<cluster_name>est False Info PoweringOn @ Machine/<cluster_name>-svpt4 17h 1 of 2 completed
├─ClusterInfrastructure - VSphereCluster/<cluster_name> True 17h
├─ControlPlane - KubeadmControlPlane/<cluster_name> False Info PoweringOn @ Machine/<cluster_name>-svpt4 17h 1 of 2 completed
│ └─Machine/<cluster_name>-svpt4 False Info PoweringOn 17h 1 of 2 completed
└─Workers
└─MachineDeployment/<cluster_name>-md-0 False Warning WaitingForAvailableMachines 17h Minimum availability requires 3 replicas, current 0 available
└─3 Machines... False Info WaitingForControlPlaneAvailable 17h See localtest-md-0-7f754dc455-796h5, localtest-md-0-7f754dc455-7r5kd, ...
We can log in to Vsphere center and use the web console to access our new VM(Machine/
(1) If the VSPHERE_TEMPLATE
specified in cluterctl.yaml
is centos-
capv.vm kernel: BAR 13 : failed to assign [io size 0x1000]
capv.vm kernel: BAR 13 : no space for [io size 0x1000]
capv.vm dracut-initqueue[225]: Warning : dracut-initqueue timeout - starting timeout scripts
capv.vm dracut-initqueue[225]: Warning : Could not boot
Since cloud-init
hasn’t finished initializing, I cannot log in to this VM, thus I haven’t figured out how to deal with this situation. So I used ubuntu instead.
(2) If the VSPHERE_TEMPLATE specified in cluterctl.yaml
is ubuntu-
...
failed to start wait for network to be configured
...
This might be because DHCP service is not set up in this vsphere center, the created VM should have its own IP address which is assigned by DHCP.
#5 Fingerprint error / certificate error
If a pod of vsphere-cloud-controller-manager stuck at CrashLoopBackOff and its log shows:
$kubectl logs -n kube-system pod/vsphere-csi-controller-5456544dd5-44h77 vsphere-csi-controller
...
{"level":"error","time":"2021-12-20T11:33:00.238213779Z","caller":"vanilla/controller.go:121","msg":"failed to get vcenter. err=Post https://...:443/sdk: x509: certificate is valid for ..., not ....","TraceId":"d39d14cf-6386-4585-831c-9a5020ab14ab",...
{"level":"error","time":"2021-12-20T11:33:00.238251257Z","caller":"service/service.go:135","msg":"failed to init controller. Error: Post https://...m:443/sdk: x509: certificate is valid for ..., not ...","TraceId":"5e5b850d-4350-4115-8e86-6beb34f2ebad",...
{"level":"info","time":"2021-12-20T11:33:00.238357103Z","caller":"service/service.go:110","msg":"configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"5e5b850d-4350-4115-8e86-6beb34f2ebad"}
time="2021-12-20T11:33:00Z" level=info msg="removed sock file" path=/var/lib/csi/sockets/pluginproxy/csi.sock
time="2021-12-20T11:33:00Z" level=fatal msg="grpc failed" error="Post https://<vsphere_server_ip>:443/sdk: x509: certificate is valid for ..., not ..."
...
or
$kubectl logs -n kube-system pod/vsphere-csi-controller-5456544dd5-44h77 vsphere-csi-controller
...
{"level":"error","time":"2021-12-21T07:03:22.107974392Z","caller":"service/service.go:135","msg":"failed to init controller. Error: Post <vsphere_server_ip>:443/sdk: x509: certificate signed by unknown authority"...
...
time="2021-12-21T07:03:22Z" level=fatal msg="grpc failed" error="Post https://<vsphere_server_ip>:443/sdk: x509: certificate signed by unknown authority"
Please check whether VSPHERE_SERVER
field and VSPHERE_TLS_THUMBPRINT
in clusterctl.yaml
are matched with each other.
If they are matched and you still get an unknown authority error like what is shown above, then you can add insecure-flag = true
to secret/csi-vsphere-config
:
# Original secret/csi-vsphere-config
apiVersion: v1
data:
csi-vsphere.conf: W0dsb2JhbF0KY2x1c3Rlci1pZCA9IZhdWx0L2VpZ2h0aCIKaW5z1cmUtZmxhZyA9IGZhbHNlCnRodW1icHJpbnQgPSAiODA6Mjk6N0I6NkU6OUY6MjU6Nzk6QzM6MM6NEY6NDg6Ng6Nzk6N0Q6NEY6NEE6RjI6REE6REQ6MzUiCgpbVmlydHVhbENlbnRlciAidmNlbnRlci...
kind: Secret
metadata:
...
If you decode the data of csi-vsphere.conf , then you will get:
# csi-vsphere.conf
[Global]
cluster-id = "default/eighth"
...
Add insecure-flag = true to Global section:
# csi-vsphere.conf
[Global]
cluster-id = "default/eighth"
insecure-flag = true
...
Update secret/csi-vsphere-config and rerun pod of vsphere-cloud-controller-manager
.