[Cluster API] Deploy cluster on Vsphere

Recently, I’ve try to use Cluster-api to deploy a cluster on vsphere. During this process, I’ve met many different problems, so in this article, I’m going to talk about how to use this tool and how to deal with the problems that we might meet.

Environment requirements

Make sure these tools are all installed:

  • kubectl
  • go
  • docker
  • clusterctl
  • kind

Deployment Steps

1. Configure clusterctl

Create a file named clusterctl.yaml in folder .cluster-api, which is normally located at $Home.

For example:

## -- Controller settings -- ##
VSPHERE_USERNAME: "..." # The username used to access the remote vSphere endpoint
VSPHERE_PASSWORD: "..." # The password used to access the remote vSphere endpoint

## -- Required workload cluster default settings -- ##
VSPHERE_SERVER: <vcenter_server_ip> # The vCenter server IP or FQDN
VSPHERE_DATACENTER: <vcenter_data_center> # The vSphere datacenter to deploy the management cluster on
VSPHERE_DATASTORE: <datastore_name> # The vSphere datastore to deploy the management cluster on
VSPHERE_NETWORK: <network_name> # The VM network to deploy the management cluster on
VSPHERE_RESOURCE_POOL: <Path_to_resource_pool> # The vSphere resource pool for your VMs( About how to get correct path of VSPHERE_RESOURCE_POOL, Please check the Troubleshooting #1 section.)
VSPHERE_FOLDER: "" # The VM folder for your VMs. Set to "" to use the root vSphere folder
VSPHERE_TEMPLATE: <template_name> # The VM template to use for your management cluster.
CONTROL_PLANE_ENDPOINT_IP: <control_plane_ip>" # the IP that kube-vip is going to use as a control plane endpoint
VSPHERE_TLS_THUMBPRINT: "..."  # sha1 thumbprint of the vcenter certificate, can be gotten with this command: openssl x509 -sha1 -fingerprint -in ca.crt -noout
EXP_CLUSTER_RESOURCE_SET: "true" # This enables the ClusterResourceSet feature that we are using for deploying CSI
VSPHERE_SSH_AUTHORIZED_KEY: "ssh-rsa ..." # The public ssh authorized key on all machines, Set it to "" if you don't want to use ssh to access to nodes
VSPHERE_STORAGE_POLICY: "" # This is the vSphere storage policy. Set it to "" if you don't want to use a storage policy.

Remember to set these before we init the management cluster, otherwise, we might get an error about the cluster resource set( If this error occurs, we need to edit the deployment of capi-controller-manager as shown below with kubectl)

#deploy/capi-controller-manager
spec:
      containers:
      - args:
        - --leader-elect
        - --metrics-bind-addr=localhost:8080
        - --feature-gates=MachinePool=false,ClusterResourceSet=true,ClusterTopology=false

Set ClusterResourceSet to true like what is shown in the code above.

2. Use Kind and Cluserctl to create a management cluster

(1) Use Kind to create a new Kubernetes cluster (A docker container actually).

(2) Configure kubeconfig, make sure the current context of kubectl is pointing to the Kind cluster.

(3) Use clusterctl to transform the current cluster to a management cluster (remember to specify using vsphere as infrastructure provider).

# Create kind cluster
$ kind create cluster

# Set current context, the default name of this kind cluster will be kind-kind
$ kubectl cluster-info --context kind-kind

# Transform current cluster to management cluster
$ clusterctl init --infrastructure vsphere

In this step, clusterctl will install 4 components into the cluster: cluster-api, bootstrap-kubeadm, control-plan-kubeadm and infrastructure-vsphere. Then we can use kubectl get pods --all-namespaces command to check the status of the management cluster. If some pods are stuck at ImagePullBackOff status, please check Troubleshooting #3 section.

3. Download the ova file and deploy a template on Vsphere

It is required that machines provisioned by CAPV have cloud-init, kubeadm, and a container runtime pre-installed. We can use one of the CAPV machine images generated by SIG Cluster Lifecycle as a VM template. Images are available here: ova download.

4. Deploy workload cluster

(1) Generate YAML file of the cluster:

Use the command shown below to generate a YAML file of the target cluster.

clusterctl generate cluster <cluster name> --kubernetes-version <kubernete version> --control-plane-machine-count 1  --worker-machine-count 3 > cluster.yaml

(2) Deploy target cluster

Deploy cluster with command kubectl apply -f cluster.yaml.

After these two steps, we can get info about the cluster with the command kubectl get cluster command:

$ kubectl get cluster
NAME     PHASE         AGE   VERSION
eighth   Provisioned   23h

If nothing goes wrong (no situation like wrong Vsphere configuration[Troubleshooting #1] or missing kube-vip image[Troubleshooting #2]), then all VMs (master node and workers) will be created on Vsphere.

With kubectl get machine command, we can check the status of VMs that have been created for our cluster.

$ kubectl get machine
NAME                           CLUSTER   NODENAME                       PROVIDERID                                       PHASE     AGE   VERSION
eighth-md-0-7b848c46d8-69m2n   eighth    eighth-md-0-7b848c46d8-69m2n   vsphere://423aff2a-386c-f7e1-ca4a-3d1fcde217c6   Running   23h   v1.20.1
eighth-md-0-7b848c46d8-7m6h5   eighth    eighth-md-0-7b848c46d8-7m6h5   vsphere://423a2464-d3ee-4c75-96ac-3202609cc2ee   Running   23h   v1.20.1
eighth-md-0-7b848c46d8-cbnwf   eighth    eighth-md-0-7b848c46d8-cbnwf   vsphere://423a1791-3d80-1f8f-a071-b7b5900f69ea   Running   23h   v1.20.1
eighth-tf4mg                   eighth    eighth-tf4mg                   vsphere://423af9cd-4b1e-617c-a778-b4e435c4efd0   Running   23h   v1.20.1

And now we can generate a kubeconfig(with clusterctl get kubeconfig <cluster_name> command) to access this new cluster:

$ clusterctl get kubeconfig eighth > cluster.kubeconfig

$ KUBECONFIG=cluster.kubeconfig kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                      READY   STATUS              RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
kube-system   coredns-74ff55c5b-f9qp7                   0/1     Pending             0          3m57s   <none>           <none>                         <none>           <none>
kube-system   coredns-74ff55c5b-qdm4m                   0/1     Pending             0          3m57s   <none>           <none>                         <none>           <none>
kube-system   etcd-eighth-tf4mg                         1/1     Running             0          3m50s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   kube-apiserver-eighth-tf4mg               1/1     Running             0          3m50s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   kube-controller-manager-eighth-tf4mg      1/1     Running             0          3m50s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   kube-proxy-2njsn                          1/1     Running             0          3m57s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   kube-proxy-5cw4z                          1/1     Running             0          22s     10.103.226.235   eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   kube-proxy-5kknb                          1/1     Running             0          7s      10.103.226.236   eighth-md-0-7b848c46d8-7m6h5   <none>           <none>
kube-system   kube-proxy-mmtzw                          1/1     Running             0          78s     10.103.226.234   eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   kube-scheduler-eighth-tf4mg               1/1     Running             0          3m50s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   kube-vip-eighth-tf4mg                     1/1     Running             0          3m49s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   vsphere-cloud-controller-manager-2ldmr    0/1     ImagePullBackOff    0          3m58s   10.103.226.233   eighth-tf4mg                   <none>           <none>
kube-system   vsphere-cloud-controller-manager-5fp4w    0/1     ImagePullBackOff    0          23s     10.103.226.235   eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   vsphere-cloud-controller-manager-6pxj7    0/1     ContainerCreating   0          7s      10.103.226.236   eighth-md-0-7b848c46d8-7m6h5   <none>           <none>
kube-system   vsphere-cloud-controller-manager-nlg8r    0/1     ImagePullBackOff    0          78s     10.103.226.234   eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   vsphere-csi-controller-5456544dd5-htkvn   0/5     Pending             0          3m59s   <none>           <none>                         <none>           <none>
kube-system   vsphere-csi-node-g4wgk                    0/3     ContainerCreating   0          78s     <none>           eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   vsphere-csi-node-lnt7t                    0/3     ContainerCreating   0          3m59s   <none>           eighth-tf4mg                   <none>           <none>
kube-system   vsphere-csi-node-sm2fg                    0/3     ContainerCreating   0          23s     <none>           eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   vsphere-csi-node-xcs5d                    0/3     ContainerCreating   0          7s      <none>           eighth-md-0-7b848c46d8-7m6h5   <none>           <none>

According to the output, there are several problems that need to be solved.

If we get ImagePullBackOff, the Troubleshooting #3 section might be helpful.

Also,coredns pods will be stuck at Pending because CNI is missing in this cluster and we need to deploy it manually.

5. Deploy CNI

Calico is used in my cluster. We can use this command to deploy it to our cluster:

#Deploy CNI
$kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

After CNI is successfully deployed, we can check the status of the cluster:

$kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE   IP                NODE                           NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-558995777d-jkkk4   1/1     Running   0          24h   192.168.139.67    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   calico-node-6gfdj                          1/1     Running   0          24h   10.103.226.236    eighth-md-0-7b848c46d8-7m6h5   <none>           <none>
kube-system   calico-node-92nn8                          1/1     Running   0          24h   10.103.226.234    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   calico-node-jbjvx                          1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   calico-node-k42dc                          1/1     Running   0          24h   10.103.226.235    eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   coredns-74ff55c5b-f9qp7                    1/1     Running   0          24h   192.168.139.68    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   coredns-74ff55c5b-qdm4m                    1/1     Running   0          24h   192.168.139.66    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   etcd-eighth-tf4mg                          1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   kube-apiserver-eighth-tf4mg                1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   kube-controller-manager-eighth-tf4mg       1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   kube-proxy-2njsn                           1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   kube-proxy-5cw4z                           1/1     Running   0          24h   10.103.226.235    eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   kube-proxy-5kknb                           1/1     Running   0          24h   10.103.226.236    eighth-md-0-7b848c46d8-7m6h5   <none>           <none>
kube-system   kube-proxy-mmtzw                           1/1     Running   0          24h   10.103.226.234    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   kube-scheduler-eighth-tf4mg                1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   kube-vip-eighth-tf4mg                      1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   vsphere-cloud-controller-manager-2ldmr     1/1     Running   0          24h   10.103.226.233    eighth-tf4mg                   <none>           <none>
kube-system   vsphere-cloud-controller-manager-5fp4w     1/1     Running   0          24h   10.103.226.235    eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   vsphere-cloud-controller-manager-6pxj7     1/1     Running   0          24h   10.103.226.236    eighth-md-0-7b848c46d8-7m6h5   <none>           <none>
kube-system   vsphere-cloud-controller-manager-nlg8r     1/1     Running   0          24h   10.103.226.234    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   vsphere-csi-controller-5456544dd5-9w49q    5/5     Running   0          22h   192.168.215.194   eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   vsphere-csi-node-g4wgk                     3/3     Running   0          24h   192.168.139.65    eighth-md-0-7b848c46d8-cbnwf   <none>           <none>
kube-system   vsphere-csi-node-lnt7t                     3/3     Running   6          24h   192.168.193.193   eighth-tf4mg                   <none>           <none>
kube-system   vsphere-csi-node-sm2fg                     3/3     Running   1          24h   192.168.215.193   eighth-md-0-7b848c46d8-69m2n   <none>           <none>
kube-system   vsphere-csi-node-xcs5d                     3/3     Running   1          24h   192.168.250.1     eighth-md-0-7b848c46d8-7m6h5   <none>           <none>

All the pods of this cluster are running, which means a Kubernetes cluster is successfully deployed on Vsphere.

Troubleshooting

#1 No VM is created on Vsphere after kubectl apply -f <cluster_yaml_file>

NO new VM is created on Vsphere after kubectl apply -f <cluster_yaml_file> command and ControlPlane stuck at WaitingForKubeadmInit:

# Get cluster status
$clusterctl describe cluster test
NAME                                           READY  SEVERITY  REASON                           SINCE  MESSAGE
/test                                          False  Info      WaitingForKubeadmInit            109m
├─ClusterInfrastructure - VSphereCluster/test  True                                              109m
├─ControlPlane - KubeadmControlPlane/test      False  Info      WaitingForKubeadmInit            20h
│ └─Machine/test-m7n2j                         True                                              21h
└─Workers
  └─MachineDeployment/test-md-0                False  Warning   WaitingForAvailableMachines      21h    Minimum availability requires 3 replicas, current 0 available
    └─3 Machines...                            False  Info      WaitingForControlPlaneAvailable  21h    See test-md-0-f67dc55-dkpqx, test-md-0-f67dc55-g4brk, ...

We can check the logs of capv-controller-manager and capi-controller-manager:

#capv-controller-manager
$ kubectl logs deploy/capv-controller-manager -f
...
E1209 10:33:47.637562       1 controller.go:317] controller/vspherevm "msg"="Reconciler error" "error"="failed to reconcile VM: cannot traverse type VirtualMachine" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereVM"
I1209 10:33:59.999674       1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Watch close - *v1beta1.VSphereMachine total 2 items received
I1209 10:34:12.990178       1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Watch close - *v1beta1.VSphereDeploymentZone total 0 items received

#capi-controller-manager
$ kubectl logs deploy/capi-controller-manager -f
I1209 10:35:44.955263       1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:35:44.955316       1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-sp4j8" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.491520       1 machine_controller_phases.go:220] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.498654       1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:03.498760       1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "name"="vsphere-quickstart-md-0-c8d556cb-hfc6g" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:14.966929       1 machine_controller_phases.go:282] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="vsphere-quickstart" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
I1209 10:36:14.967009       1 machine_controller_noderef.go:48] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="vsphere-quickstart" "machine"="vsphere-quickstart-sp4j8" "name"="vsphere-quickstart-sp4j8" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"

capv-controller-manager printed error message: failed to reconcile VM: cannot traverse type VirtualMachine. It might be because VSPHERE_FOLDER and VSPHERE_RESOURCE_POOL specified in clusterctl.yaml are incorrect.

To get the correct path of VM folder and resource pool, we can use govc CLI( a vSphere command line tool: https://github.com/vmware/govmomi/tree/master/govc).

(1)VSPHERE_RESOURCE_POOL

For example, if the resource pool’s name is “Test” and we want to get the full path of this resource pool:

$govc pool.info -dc=<data_center_name> Test

Name:               Test
  Path:             /<Datacenter>/host/<host_name>/.../Resources/.../Test
  CPU Usage:        965MHz (2.5%)
  CPU Shares:       normal
  CPU Reservation:  0MHz (expandable=true)
  CPU Limit:        -1MHz
  Mem Usage:        59816MB (11.7%)
  Mem Shares:       normal
  Mem Reservation:  0MB (expandable=true)
  Mem Limit:        -1MB

Then the Path shown in the response is the correct path that we can fill in to VSPHERE_RESOURCE_POOL field in clusterctl.yaml file.

(2)VSPHERE_FOLDER

# govc ls command will list all inventory items in target vsphere:
$govc ls

/<datacentor>/vm
/<datacentor>/network
/<datacentor>/host
/<datacentor>/datastore

# List items that are stored in /<datacentor>/vm inventory
$govc ls /<datacentor>/vm
/<datacentor>/vm/Cluster_Conntroller_10.103.64.218
/<datacentor>/vm/test-win10-lan1
...

According to what we’ve got above, it can be told that basically, all the VMs are stored in /<datacenter>/vm inventory, so we can set VSPHERE_FOLDER to “” and vsphere will directly use //vm as the folder to deploy VM.

#2 Stuck at WaitingForKubeadmInit after VM is successfully provisioned

$kubectl get machine
NAME                      CLUSTER   NODENAME   PROVIDERID                                       PHASE         AGE   VERSION
test-m7n2j                test                 vsphere://423aaa89-02b5-4796-7fdc-0b4619a0d4d6   Provisioned   21h   v1.20.1
test-md-0-f67dc55-dkpqx   test                                                                  Pending       21h   v1.20.1
test-md-0-f67dc55-g4brk   test                                                                  Pending       21h   v1.20.1
test-md-0-f67dc55-x48xx   test                                                                  Pending       21h   v1.20.1

It can be told from the output we’ve got above that the master node is successfully provisioned (the VM is successfully created on Vsphere).

We can access to target VM using ssh with account capv and ssh key we’ve specified in clusterctl.yaml file.

# Access to target VM using ssh (Make sure you've set VSPHERE_SSH_AUTHORIZED_KEY in cluterctl.yaml)
$ ssh capv@<vm_ip>

We can check the output of cloud-init first:


# check output of cloud-init
$ cat /var/log/cloud-init-output.log|less
...
[2021-12-16 17:42:57] [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[2021-12-16 17:42:58] [kubeconfig] Writing "admin.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "kubelet.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "controller-manager.conf" kubeconfig file
[2021-12-16 17:42:58] [kubeconfig] Writing "scheduler.conf" kubeconfig file
[2021-12-16 17:42:58] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[2021-12-16 17:42:58] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[2021-12-16 17:42:58] [kubelet-start] Starting the kubelet
[2021-12-16 17:42:58] [control-plane] Using manifest folder "/etc/kubernetes/manifests"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-apiserver"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-controller-manager"
[2021-12-16 17:42:58] [control-plane] Creating static Pod manifest for "kube-scheduler"
[2021-12-16 17:42:58] [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[2021-12-16 17:42:58] [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[2021-12-16 10:09:10] [kubelet-check] Initial timeout of 40s passed.
[2021-12-16 10:12:39]
[2021-12-16 10:12:39]   Unfortunately, an error has occurred:
[2021-12-16 10:12:39]           timed out waiting for the condition
[2021-12-16 10:12:39]
[2021-12-16 10:12:39]   This error is likely caused by:
[2021-12-16 10:12:39]           - The kubelet is not running
[2021-12-16 10:12:39]           - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
[2021-12-16 10:12:39]
[2021-12-16 10:12:39]   If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
[2021-12-16 10:12:39]           - 'systemctl status kubelet'
[2021-12-16 10:12:39]           - 'journalctl -xeu kubelet'
[2021-12-16 10:12:39]
[2021-12-16 10:12:39]   Additionally, a control plane component may have crashed or exited when started by the container runtime.
[2021-12-16 10:12:39]   To troubleshoot, list all containers using your preferred container runtimes CLI.
[2021-12-16 10:12:39]
[2021-12-16 10:12:39]   Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
[2021-12-16 10:12:39]           - 'crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
[2021-12-16 10:12:39]           Once you have found the failing container, you can inspect its logs with:
[2021-12-16 10:12:39]           - 'crictl --runtime-endpoint /var/run/containerd/containerd.sock logs CONTAINERID'
[2021-12-16 10:12:39]
[2021-12-16 10:12:39] error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
...

Seems like there was an error occurred while waiting for the kubelet to boot up the control plane. Then we can check the status and logs of kubelet:

# Check status of kubelet
$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-12-16 17:42:58 UTC; 10h ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 1127 (kubelet)
    Tasks: 15 (limit: 4915)
   CGroup: /system.slice/kubelet.service
           └─1127 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external --container-runtime=remote --container-ru

# If kubelet is running without any error
# use `sudo journalctl -u kubelet` or `sudo journalctl -u kubelet --since <specific_time>`
$sudo journalctl -u kubelet --since "1 day ago"|less
...
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885118    1127 remote_image.go:113] PullImage "ghcr.io/kube-vip/kube-vip:v0.3.5" from image service failed: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885218    1127 kuberuntime_image.go:51] Pull image "ghcr.io/kube-vip/kube-vip:v0.3.5" failed: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885452    1127 kuberuntime_manager.go:829] container &Container{Name:kube-vip,Image:ghcr.io/kube-vip/kube-vip:v0.3.5,Command:[],Args:[start],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:vip_arp,Value:true,ValueFrom:nil,},EnvVar{Name:vip_leaderelection,Value:true,ValueFrom:nil,},EnvVar{Name:vip_address,Value:10.103.226.219,ValueFrom:nil,},EnvVar{Name:vip_interface,Value:eth0,ValueFrom:nil,},EnvVar{Name:vip_leaseduration,Value:15,ValueFrom:nil,},EnvVar{Name:vip_renewdeadline,Value:10,ValueFrom:nil,},EnvVar{Name:vip_retryperiod,Value:2,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kubeconfig,ReadOnly:false,MountPath:/etc/kubernetes/admin.conf,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_ADMIN SYS_TIME],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5): ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to resolve reference "ghcr.io/kube-vip/kube-vip:v0.3.5": failed to do request: Head "https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5": dial tcp 20.205.243.164:443: connect: connection refused
Dec 16 10:08:38 fifth-tgq4j kubelet[1127]: E1216 10:08:38.885518    1127 pod_workers.go:191] Error syncing pod b31b938f7a5929f365eb5caefef24fa5 ("kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5)"), skipping: failed to "StartContainer" for "kube-vip" with ErrImagePull: "rpc error: code = Unknown desc = failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip:v0.3.5\": failed to resolve reference \"ghcr.io/kube-vip/kube-vip:v0.3.5\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.3.5\": dial tcp 20.205.243.164:443: connect: connection refused"
Dec 16 10:08:39 fifth-tgq4j kubelet[1127]: E1216 10:08:39.268272    1127 pod_workers.go:191] Error syncing pod b31b938f7a5929f365eb5caefef24fa5 ("kube-vip-fifth-tgq4j_kube-system(b31b938f7a5929f365eb5caefef24fa5)"), skipping: failed to "StartContainer" for "kube-vip" with ImagePullBackOff: "Back-off pulling image \"ghcr.io/kube-vip/kube-vip:v0.3.5\""
...

Seems like kudeadm fail to pull image of kube-vip during initialization.

# Also we can check manifests folder of kubernetes, which contains all
# the configurations of static pods that kubernete is going to create during
# initailization
$ ls /etc/kubernetes/manifests
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml  kube-vip.yaml

# List all Kubernetes containers running in cri-o/containerd
$ sudo crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause
0154378d48ba2       4aa0b4397bbbc       20 hours ago        Running             kube-scheduler            0                   ea07bd6eb04ff
57bd6d75bad96       75c7f71120808       20 hours ago        Running             kube-apiserver            0                   66da2fd70475a
4008afcd0c408       2893d78e47dc3       20 hours ago        Running             kube-controller-manager   0                   f3b477e5c7ae1
...

If it’s because kudeadm fail to pull the image of kube-vip during initialization, we might need to upload the image manually:

(Assume that the image is successfully downloaded to the local environment)

# Save docker image to file kube-vip (local environment which might have proxy)
$ docker save ghcr.io/kube-vip/kube-vip:v0.3.5 -o kube-vip

# Transfer file to target VM
$ scp kube-vip capv@<ip>:~

# Access to target VM using ssh
$ ssh [email protected]

# Ask containerd to load image
$ sudo ctr -n k8s.io image import kube-vip

After the image is successfully uploaded to the target VM, we need to reset and run kubeadm init again with the same config. Because kube-proxy and coredns are failed to be initialized since kubeadm init has been interrupted before.

#3 Kubernetes pods stuck at ImagePullBackOff status

If ImagePullBackOff error is found when we check the status of the cluster with kubectl get pods --all-namespaces command:

$ kubectl get pods --all-namespaces
NAMESPACE                           NAME                                                             READY   STATUS             RESTARTS   AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-58945b95bf-nr8kv       0/1     ImagePullBackOff   0          3h57m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-58fc8f8c7c-jg9wn   0/1     ImagePullBackOff   0          3h56m
capi-system                         capi-controller-manager-576744d8b7-vlvjt                         0/1     ImagePullBackOff   0          8m7s
capv-system                         capv-controller-manager-6fcb95cd6-pv7m5                          0/1     ImagePullBackOff   0          3h28m
cert-manager                        cert-manager-848f547974-gtvzs                                    1/1     Running            2          3h57m
...

This is because the cluster failed to pull images of these components. We can try to set up a proxy to resolve it.

If the proxy doesn’t work, then we need to find another source for these images.

To solve the ImagePullBackOff problem:

(1) Check deployments of these pods and find out which images they need.

(2) Search the images on docker hub and download them.

(3) Use docker tag <origin> <target>  command to change tag of these images.

(4) Use kind load <image tag> command to load the image into the environment of the management cluster.

Then all the pods of the management cluster will run correctly.

#4 ControlPlane stuck at PoweringOn @ Machine/-

$ clusterctl describe cluster <cluster_name>
NAME                                                READY  SEVERITY  REASON                                SINCE  MESSAGE
/localt<cluster_name>est                                          False  Info      PoweringOn @ Machine/<cluster_name>-svpt4  17h    1 of 2 completed
├─ClusterInfrastructure - VSphereCluster/<cluster_name>  True                                                   17h
├─ControlPlane - KubeadmControlPlane/<cluster_name>      False  Info      PoweringOn @ Machine/<cluster_name>-svpt4  17h    1 of 2 completed
│ └─Machine/<cluster_name>-svpt4                         False  Info      PoweringOn                            17h    1 of 2 completed
└─Workers
  └─MachineDeployment/<cluster_name>-md-0                False  Warning   WaitingForAvailableMachines           17h    Minimum availability requires 3 replicas, current 0 available
    └─3 Machines...                                 False  Info      WaitingForControlPlaneAvailable       17h    See localtest-md-0-7f754dc455-796h5, localtest-md-0-7f754dc455-7r5kd, ...

We can log in to Vsphere center and use the web console to access our new VM(Machine/-svpt4)

(1) If the VSPHERE_TEMPLATE specified in cluterctl.yaml is centos-, then these messages might be shown in the web console:

capv.vm kernel: BAR 13 : failed to assign  [io size 0x1000]

capv.vm kernel: BAR 13 : no space for [io size 0x1000]

capv.vm dracut-initqueue[225]: Warning : dracut-initqueue timeout - starting timeout scripts

capv.vm dracut-initqueue[225]: Warning : Could not boot

Since cloud-init hasn’t finished initializing, I cannot log in to this VM, thus I haven’t figured out how to deal with this situation. So I used ubuntu instead.

(2) If the VSPHERE_TEMPLATE specified in cluterctl.yaml is ubuntu- , then these messages might be shown in the web console:

...
failed to start wait for network to be configured
...

This might be because DHCP service is not set up in this vsphere center, the created VM should have its own IP address which is assigned by DHCP.

#5 Fingerprint error / certificate error

If a pod of vsphere-cloud-controller-manager stuck at CrashLoopBackOff and its log shows:

$kubectl logs -n kube-system pod/vsphere-csi-controller-5456544dd5-44h77 vsphere-csi-controller
...
{"level":"error","time":"2021-12-20T11:33:00.238213779Z","caller":"vanilla/controller.go:121","msg":"failed to get vcenter. err=Post https://...:443/sdk: x509: certificate is valid for ..., not ....","TraceId":"d39d14cf-6386-4585-831c-9a5020ab14ab",...
{"level":"error","time":"2021-12-20T11:33:00.238251257Z","caller":"service/service.go:135","msg":"failed to init controller. Error: Post https://...m:443/sdk: x509: certificate is valid for ..., not ...","TraceId":"5e5b850d-4350-4115-8e86-6beb34f2ebad",...
{"level":"info","time":"2021-12-20T11:33:00.238357103Z","caller":"service/service.go:110","msg":"configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"5e5b850d-4350-4115-8e86-6beb34f2ebad"}
time="2021-12-20T11:33:00Z" level=info msg="removed sock file" path=/var/lib/csi/sockets/pluginproxy/csi.sock
time="2021-12-20T11:33:00Z" level=fatal msg="grpc failed" error="Post https://<vsphere_server_ip>:443/sdk: x509: certificate is valid for ..., not ..."
...

or

$kubectl logs -n kube-system pod/vsphere-csi-controller-5456544dd5-44h77 vsphere-csi-controller
...
{"level":"error","time":"2021-12-21T07:03:22.107974392Z","caller":"service/service.go:135","msg":"failed to init controller. Error: Post <vsphere_server_ip>:443/sdk: x509: certificate signed by unknown authority"...
...
time="2021-12-21T07:03:22Z" level=fatal msg="grpc failed" error="Post https://<vsphere_server_ip>:443/sdk: x509: certificate signed by unknown authority"

Please check whether VSPHERE_SERVER field and VSPHERE_TLS_THUMBPRINT in clusterctl.yaml are matched with each other.

If they are matched and you still get an unknown authority error like what is shown above, then you can add insecure-flag = true to secret/csi-vsphere-config:

# Original secret/csi-vsphere-config
apiVersion: v1
data:
  csi-vsphere.conf: W0dsb2JhbF0KY2x1c3Rlci1pZCA9IZhdWx0L2VpZ2h0aCIKaW5z1cmUtZmxhZyA9IGZhbHNlCnRodW1icHJpbnQgPSAiODA6Mjk6N0I6NkU6OUY6MjU6Nzk6QzM6MM6NEY6NDg6Ng6Nzk6N0Q6NEY6NEE6RjI6REE6REQ6MzUiCgpbVmlydHVhbENlbnRlciAidmNlbnRlci...
kind: Secret
metadata:
...

If you decode the data of csi-vsphere.conf , then you will get:

# csi-vsphere.conf
[Global]
    cluster-id = "default/eighth"
...

Add insecure-flag = true to Global section:

# csi-vsphere.conf
[Global]
    cluster-id = "default/eighth"
	insecure-flag = true
...

Update secret/csi-vsphere-config and rerun pod of vsphere-cloud-controller-manager.