安装
If you encounter problems during installation, such as cloudcore/edgecore don't start successfully, or they both start but edge nodes are always in NotReady
, or pod cannot be deployed to edge nodes. This doc can help you debug how to fix the issues.
Common knowledge
How to check logs?
The first step to start investigating installation failures or when a thing does not work as expected, is to find and look at the related installation logs.
How to check cloudcore logs?
cloudcore in container mode
If you deploy cloudcore
in container mode, in other words, when cloudcore
is deployed inside a k8s cluster and managed by k8s directly, you can use kubectl logs
command to get cloudcore logs, just like below. First use kubectl get pod -n kubeedge
to get cloudcore
pod NAME, and then run kubectl logs cloudcore-f88bbf5bb-kcvf4 -n kubeedge
(please replace cloudcore-f88bbf5bb-kcvf4
with the actual cloudcore pod name), and then you can get logs
# kubectl get pod -n kubeedge
NAME READY STATUS RESTARTS AGE
cloudcore-f88bbf5bb-kcvf4 1/1 Running 0 50m
# kubectl logs cloudcore-f88bbf5bb-kcvf4 -n kubeedge
W1118 02:16:02.810219 1 validation.go:154] TLSTunnelPrivateKeyFile does not exist in /etc/kubeedge/certs/server.key, will load from secret
W1118 02:16:02.810256 1 validation.go:157] TLSTunnelCertFile does not exist in /etc/kubeedge/certs/server.crt, will load from secret
W1118 02:16:02.810263 1 validation.go:160] TLSTunnelCAFile does not exist in /etc/kubeedge/ca/rootCA.crt, will load from secret
I1118 02:16:02.810280 1 server.go:92] Version: v1.12.0
...
...
cloudcore in binary mode
If you deploy cloudcore
in binary mode, in other words, cloudcore
is running as a service/process on the physical machine directly, you should check whether cloudcore
is managed by systemd
or not. If so, you can run journalctl -u cloudcore.service -xe
to get cloudcore logs. If not, the cloudcore logs are written to file /var/log/kubeedge/cloudcore.log
, you can run tail
or cat
related commands to read log file.
# journalctl -u cloudcore.service -xe
# tail -f /var/log/kubeedge/cloudcore.log
# cat /var/log/kubeedge/cloudcore.log
How to check edgecore logs?
edgecore
is always installed in binary mode, that means edgecore
cannot be deployed to a container, it can only run as a physical progress on a machine. You should check whether edgecore
is managed by systemd
or not. If so, you should run journalctl -u edgecore.service -xe
. Or, the edgecore logs are written to file /etc/kubeedge/kubeedge/edge/edgecore.log
, you can run tail
or cat
related commands to read log file.
# journalctl -u edgecore.service -xe
# tail -f /etc/kubeedge/kubeedge/edge/edgecore.log
# cat /etc/kubeedge/kubeedge/edge/edgecore.log
How to update configuration?
How to modify cloudcore configuration?
cloudcore in container mode
If you deploy cloudcore
in container mode, the cloudcore configuration file is stored in the cloudcore
configmap of the kubeedge
namespace, and is automatically mounted to the cloudcore
pod. You can get it using command kubectl get configmap cloudcore -nkubeedge -oyaml
. Cloudcore configuration is stored in its data field.
# kubectl get configmap cloudcore -nkubeedge -oyaml
apiVersion: v1
data:
cloudcore.yaml: "apiVersion: cloudcore.config.kubeedge.io/v1alpha2\nkind: CloudCore\nkubeAPIConfig:\n
\ kubeConfig: \"\"\n master: \"\"\nmodules:\n cloudHub:\n advertiseAddress:\n
...
...
If you want to update the cloudcore configuration, you can run kubectl edit configmap cloudcore -n kubeedge
to update it. After modifying the configmap data, you can find that the configuration file mapped in the pod to /etc/kubeedge/config/cloudcore.yaml
is updated the same as you would expect by exec to the cloudcore pod (using kubectl exec
command).
# kubectl get pod -nkubeedge
NAME READY STATUS RESTARTS AGE
cloudcore-f88bbf5bb-kcvf4 1/1 Running 0 3m29s
# kubectl exec -it cloudcore-f88bbf5bb-kcvf4 -nkubeedge sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/ # cat /etc/kubeedge/config/cloudcore.yaml
apiVersion: cloudcore.config.kubeedge.io/v1alpha2
kind: CloudCore
kubeAPIConfig:
kubeConfig: ""
master: ""
modules:
cloudHub:
advertiseAddress:
- 127.0.0.1
...
...
To apply the changed the configuration file and take it into effect, we need to restart the cloudcore
pod. You can run kubectl delete pod
command to stop the original cloudcore
pod, and k8s will ensure a new cloudcore
pod is created with the new updated configuration.
# kubectl get pod -n kubeedge
NAME READY STATUS RESTARTS AGE
cloudcore-f88bbf5bb-kcvf4 1/1 Running 0 82m
# kubectl delete pod cloudcore-f88bbf5bb-kcvf4 -nkubeedge
pod "cloudcore-f88bbf5bb-kcvf4" deleted
cloudcore in binary mode
If you deploy cloudcore
in binary mode. The default configuration file is located on /etc/kubeedge/config/cloudcore.yaml
. You can modify it with vim
or vi
command. If your operating system supports systemd
and manage cloudcore
with systemd
, you can run systemctl restart cloudcore
to restart cloudcore
with updated configuration. If not, you can just use pkill
command to stop origin cloudcore
progress, and start it again manually.
How to modify edgecore configuration?
We always deploy edgecore
in binary mode, that means, we couldn't deploy edgecore
in a pod. And the default configuration file is located on /etc/kubeedge/config/edgecore.yaml
. You can modify it with vim
or vi
command. If your operating system supports systemd
and manage edgecore
with systemd
, you can run systemctl restart edgecore
to restart edgecore
with updated configuration. If not, you can just use pkill
command to stop origin edgecore
progress, and start it again manually.
Frequently Asked Questions
The following description contains some common problems that might happen during the installation progress, and includes the recommended solutions to resolve it.
cloudcore pre-flight check failed
cloudcore error log contains the below words:
error execution phase preflight: [preflight] Some fatal errors occurred
This may be related to a previous installation failure having left over files from the previous installation attempt. You should run keadm reset
command to clean up those files.
Other errors during the installation you can fix by following the steps shown on the screen according to the error prompt.
timed out waiting for the condition
When you install cloudcore
in container mode with command keadm init
, if you encounter the below problems:
# keadm init
Kubernetes version verification passed, KubeEdge installation will start...
Error: timed out waiting for the condition
Usage:
keadm init [flags]
Examples:
keadm init
- This command will render and install the Charts for Kubeedge cloud component
...
...
execute keadm command failed: timed out waiting for the condition
Due to the fact that the progress of installing cloudcore components in container mode is the same as deploying any one application within a k8s cluster, we can use all the default k8s methods of debugging an application in k8s. By invoking the keadm init
command, we'll deploy all of cloudcore
related resources in k8s kubeedge
namespace. So you can get important information about all related resources with command kubectl get all -n kubeedge
# kubectl get all -nkubeedge
NAME READY STATUS RESTARTS AGE
pod/cloudcore-644d8f55df-sj7xc 0/1 ImagePullBackOff 0 5m58s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cloudcore ClusterIP 10.96.179.211 <none> 10000/TCP,10001/TCP,10002/TCP,10003/TCP,10004/TCP 5m58s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cloudcore 0/1 1 0 5m58s
NAME DESIRED CURRENT READY AGE
replicaset.apps/cloudcore-644d8f55df 1 1 0 5m58s
If all the resources are created, then you can check whether cloudcore
pod is in Running status.
cloudcore in pending status
If the pod cloudcore
is in Pending
status, it's often due to pod scheduling failures. You can run kubectl describe
command to try to fix it. For example, the output may be like shown below.
# kubectl get pod -n kubeedge
NAME READY STATUS RESTARTS AGE
cloudcore-f88bbf5bb-78hzb 0/1 Pending 0 111s
# kubectl describe pod cloudcore-f88bbf5bb-78hzb -nkubeedge
Name: cloudcore-f88bbf5bb-78hzb
Namespace: kubeedge
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 55s (x3 over 2m13s) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate.
We can see that the reason is that 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate.
This is often occurs when you only have one control plane node. By default, your cluster will not schedule Pods on the control plane nodes for security reasons. If you want to be able to schedule Pods on the control plane nodes, for example in a single machine Kubernetes cluster, run:
kubectl taint nodes --all node-role.kubernetes.io/control-plane- node-role.kubernetes.io/master-
For more details, please refer to k8s docs. And then you can find your cloudcore
is running successfully and in Running
status.
# kubectl get pod -n kubeedge
NAME READY STATUS RESTARTS AGE
cloudcore-f88bbf5bb-78hzb 1/1 Running 0 9m1s
For more details about taint and toleration, you can refer to k8s official docs
cloudcore in ImagePullBackOff status
If the pod cloudcore
is in ImagePullBackOff
status, it's often due to pod image issues. You can run kubectl describe
command to find more details. For example, the output may be showing like below.
# kubectl describe pod cloudcore-644d8f55df-sj7xc -nkubeedge
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m39s default-scheduler Successfully assigned kubeedge/cloudcore-644d8f55df-sj7xc to kind-control-plane
Normal Pulling 6m59s (x4 over 8m39s) kubelet Pulling image "kubeedge/cloudcore:v1.12.19"
Warning Failed 6m58s (x4 over 8m37s) kubelet Failed to pull image "kubeedge/cloudcore:v1.12.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/kubeedge/cloudcore:v1.12.19": failed to resolve reference "docker.io/kubeedge/cloudcore:v1.12.19": docker.io/kubeedge/cloudcore:v1.12.19: not found
Warning Failed 6m58s (x4 over 8m37s) kubelet Error: ErrImagePull
Warning Failed 6m45s (x6 over 8m37s) kubelet Error: ImagePullBackOff
Normal BackOff 3m36s (x20 over 8m37s) kubelet Back-off pulling image "kubeedge/cloudcore:v1.12.19"
Here we can find the core reason why cloudcore did't start successfully. It's due to Failed to pull image "kubeedge/cloudcore:v1.12.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/kubeedge/cloudcore:v1.12.19": failed to resolve reference "docker.io/kubeedge/cloudcore:v1.12.19": docker.io/kubeedge/cloudcore:v1.12.19: not found
. There's no docker.io/kubeedge/cloudcore:v1.12.19
image. In other words, we specified a wrong kubeedge version, which does not exist. So we can run keadm reset
to clear the last installation, or just run kubectl delete ns kubeedge
directly, and then run keadm init
to start a new installation, with a correct version.
During the cloudcore container installation progress, keadm will pull the required images from dockerhub by default. Users can pull the required images in advance, which is also supported by keadm
. To do so, users can run keadm config images
related commands to display or pull all the required images.
# keadm config images pull --kubeedge-version=v1.12.0 --part=cloud
Pulling kubeedge/iptables-manager:v1.12.0 ...
Successfully pulled kubeedge/iptables-manager:v1.12.0
Pulling kubeedge/controller-manager:v1.12.0 ...
Successfully pulled kubeedge/controller-manager:v1.12.0
Pulling kubeedge/admission:v1.12.0 ...
Successfully pulled kubeedge/admission:v1.12.0
Pulling kubeedge/cloudcore:v1.12.0 ...
Successfully pulled kubeedge/cloudcore:v1.12.0
cloudcore in CrashLoopBackOff status
If cloudcore
pod is in CrashLoopBackOff
status, it means that cloudcore pod startup failed. Users should check cloudcore logs by running kubectl logs
command. And fix it according to the log error.
EdgeCore failed to get CA certificate
During the edgecore installation, edgecore will communicate with cloudcore to get the CA certificate. Users can encounter the below issues:
failed to get CA certificate, err: Get "https://192.168.47.128:10002/ca.crt": dial tcp 192.168.47.128:10002: connect: connection refused
failed to get CA certificate, err: Get "https://192.168.47.128:10002/ca.crt": EOF
Error: failed to get CA certificate, err: Get "https://192.168.47.128:10002/ca.crt": dial tcp 192.168.47.128:10002: connect: no route to host
failed to get CA certificate, err: Get "https://192.168.47.128:10002/ca.crt": dial tcp 192.168.47.128:10002: i/o timeout
Troubleshooting step:
Does the cloudcore IP address, the edge node is configured to join to, exist in the cloudcore advertise-address list?
Can the edgecore connect to the cloudcore IP address? Are the two nodes connected over the physical network? Are there firewall restrictions?
Do cloudcore components on the cloud side start successfully and are the accessed ports(e.g. 10000 and 10002) listening?
If all the above preceding steps were checked and are not showing any problems, run the following command to check whether cloudcore logs contain any errors:
kubectl logs cloudcore-xxxx -n kubeedge
keadm join
pull image failed in edge nodes
Users may encounter the below errors when running keadm join
command:
edge node join failed: pull Images failed: xxx
Troubleshooting steps:
- Can edge nodes access the Internet?
keadm join
will by default pullkubeedge/installation-package
image, from dockerhub. If so, users should fix according to the specific error information. - If edge nodes cannot access the Internet, there're two ways to avoid this error:
- download images on another machines which can access the Internet, and load images manually to the edge nodes
- download images on another machines which can access the Internet, and push images to an internal image repository, and then run
keadm join
command with flag--image-repository=xxx
to specify the internal image repository.
Token related issues
Error: token credentials are in the wrong format
If you find that edgecore don't start, and the edgecore log contains error message like below
F1121 14:14:39.583329 3644556 certmanager.go:94] Error: token credentials are in the wrong format
This means the edgecore token is not correct. You should run keadm gettoken
on the cloud side to get token, and then copy it to modules.edgeHub.token
field.
# cat /etc/kubeedge/config/edgecore.yaml
apiVersion: edgecore.config.kubeedge.io/v1alpha2
kind: EdgeCore
modules:
edgeHub:
...
...
token: "" # --- here is token, which will be used when joining edge nodes to cloud.
Invalid Token
If edgecore log contains errors like below
F1121 14:22:42.419103 3646881 certmanager.go:94] Error: failed to get edge certificate from the cloudcore, error: Invalid authorization token
And cloudcore log contains errors like below
E1121 06:22:42.418947 1 server.go:104] failed to sign the certificate for edgenode: edge-node, invalid token
This means that you copied an INCORRECT token to edgecore configuration. So you should run keadm gettoken
on the cloud side to get token again, and copy it to modules.edgeHub.token
field correctly and carefully.
edgecore failed to pass the certificate verification
Error: failed to get edge certificate from the cloudcore, error: Get "https://192.168.47.128:10002/edge.crt": x509: certificate is valid for 192.168.47.127, not 192.168.47.128
Error: failed to get edge certificate from the cloudcore, error: Get "https://192.168.47.128:10002/edge.crt": x509: cannot validate certificate for 192.168.47.128 because it doesn't contain any IP SANs
cloudcore requires the configuration parameter advertise-address, which can contain multiple IP addresses, separated by commas, which need to be defined in advance. It's recommended to use a load balancer or gateway address, to ensure load balancing and high availability of cloudcore. Currently, this address cannot be changed once it's configured. The value of cloudcore-ipport
at edgecore must exist in the cloudcore advertise-address
list.
Advertise-address related issues
The most common problems is due to that the IP address that cloudcore expose, is NOT the same as the IP address that edgecore use to connect to.
cloudcore expose IP address is located like below
# cat cloudcore.yaml
apiVersion: cloudcore.config.kubeedge.io/v1alpha1
...
...
modules:
cloudHub:
advertiseAddress:
- 192.168.1.251 # ------------- this IP address is what cloudcore expose.
...
...
IP address that edgecore use to connect to cloud is like below:
# cat edgecore.yaml
apiVersion: edgecore.config.kubeedge.io/v1alpha2
kind: EdgeCore
modules:
edgeHub:
enable: true
heartbeat: 15
httpServer: https://192.168.1.251:10002 # ----- this is edgecore used to get ca/certs
messageBurst: 60
messageQPS: 30
projectID: e632aba927ea4ac2b575ec1603d56f10
quic:
enable: false
handshakeTimeout: 30
readDeadline: 15
server: 192.168.1.251:10001 # ----- this is edgecore used to connect cloud
writeDeadline: 15
websocket:
enable: true
handshakeTimeout: 30
readDeadline: 15
server: 192.168.1.251:10000 # ----- this is edgecore used to connect cloud
writeDeadline: 15
...
...
First, now KubeEdge don't support update cloudcore advertiseAddress, so it's very important for users to set one or more advertise addresses in installation progress.
One important thing is that the two IP address should be kept the same. Or edgecore logs will report the following errors
E1118 16:00:52.632311 1947817 ws.go:78] dial websocket error(x509: certificate is valid for 192.168.1.251, not 127.0.0.1), response message:
E1118 16:00:52.632338 1947817 websocket.go:90] Init websocket connection failed x509: certificate is valid for 192.168.1.251, not 127.0.0.1
So in one word, in our installation, we must ensure the two IP address are the same, regardless of which method we choose, keadm init
or keadm deprecated init
or run cloudcore manually, keadm join
or keadm deprecated join
or run edgecore manually. In other words, we must use --advertise-address
and --cloudcore-ipport
flags in the command line.
keadm init --advertise-address=${THE-EXPOSED-IP}
keadm deprecated init --advertise-address=${THE-EXPOSED-IP}
keadm join --cloudcore-ipport=${THE-EXPOSED-IP}:10000
keadm deprecated join --cloudcore-ipport=${THE-EXPOSED-IP}:10000
By default, if we just run keadm init --advertise-address=${THE-EXPOSED-IP} --profile version=v1.12.0
to install cloudcore. We'll create a cloudcore
service with ClusterIP
as its ServiceType. And a cloudcore
pod which will use hostNetwork
, which means that the pod will run in the host network of the node where the pod is deployed. So here ${THE-EXPOSED-IP}
should be replaced with IP address of your k8s node, where cloudcore
is deployed. You can run kubectl get node -owide
to get k8s node IP addresses. And we also should use this IP address to join edge nodes.
# kubectl get pod -nkubeedge -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cloudcore-f88bbf5bb-blgzv 1/1 Running 0 8m42s 172.18.0.2 kind-control-plane <none> <none>
# kubectl get node -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kind-control-plane Ready control-plane,master 20m v1.21.1 172.18.0.2 <none> Ubuntu 21.04 4.15.0-169-generic containerd://1.5.2
The above output shows that the IP address of the pod is the same as that of the k8s node, both IP addresses are 172.18.0.2
. This also indicates that the pod is running in the host network of the node. So we should use 172.18.0.2
as ${THE-EXPOSED-IP}
, but not others. Or edgecore will report below errors:
F1121 15:21:15.154526 3671032 certmanager.go:94] Error: failed to get CA certificate, err: Get "https://10.96.179.211:10002/ca.crt": dial tcp 10.96.179.211:10002: i/o timeout
One more important thing about cloudcore
container mode is about how to expose cloudcore port to edge nodes. In container mode, we will also create a cloudcore service. And it's your duty to choose a LoadBalancer or adjust it to NodePort
ServiceType, to expose cloudcore
service to edge nodes. For more details, please reference k8s service docs
Runtime(运行时)相关问题
unknown service runtime.v1alpha2.ImageService
如果你在执行keadm join时遇到了类似如下报错:
execute keadm command failed: edge node join failed: pull Images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService
说明你的containerd没有开启cri
插件,你可以检查containerd的配置文件/etc/containerd/config.toml
中的disabled_plugins
字段是否包含cri
。你可以通过编辑配置文件,删除disabled_plugins
中的cri
,修改之后请重启containerd。
failed to reserve sandbox name
如果你在执行keadm join时遇到了类似如下报错:
execute keadm command failed: edge node join failed: copy resources failed: rpc error: code = Unknown desc = failed to reserve sandbox name "edgecore_kubeedge__0": name "edgecore_kubeedge__0" is reserved for ...
说明在你的机器上有残留的同名containerd容器或者任务,你可以按照如下步骤清理:
ctr -n k8s.io t ls
, 如果有残留的task,请执行ctr -n k8s.io t kill {task id}
清理ctr -n k8s.io c ls
, 如果有残留的容器,请执行ctr -n k8s.io c rm {container id}
清理- 执行
systemctl restart containerd.service
重启containerd
cni plugin not initialized/cni config uninitialized
目前在使用cri runtime时,执行keadm join需要先安装cni plugin并配置cni config,您可以参考以下步骤执行。
- 从 https://github.com/containernetworking/plugins/releases 下载 cni-plugins-{OS}-{ARCH}-{VERSION}.tgz,并将其解压到/opt/cni/bin:
$ mkdir -p /opt/cni/bin
$ tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.1.1.tgz
- 创建CNI config
$ mkdir -p /etc/cni/net.d/
$ cat >/etc/cni/net.d/10-containerd-net.conflist <<EOF
{
"cniVersion": "1.0.0",
"name": "containerd-net",
"plugins": [
{
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": true,
"promiscMode": true,
"ipam": {
"type": "host-local",
"ranges": [
[{
"subnet": "10.88.0.0/16"
}],
[{
"subnet": "2001:db8:4860::/64"
}]
],
"routes": [
{ "dst": "0.0.0.0/0" },
{ "dst": "::/0" }
]
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
EOF
- 重启containerd
cgroup driver 不匹配
如果你在安装EdgeCore出现如下报错:
failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups
或者在EdgeCore日志中出现:
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
说明你安装的运行时的cgroup驱动与KubeEdge的配置不符,KubeEdge默认配置的是cgroupfs
cgroup驱动。你可以选择修改runtime的配置,或者在keadm join时设置--remote-runtime-endpoint=unix:///var/run/crio/crio.sock
,或者修改EdgeCore的配置文件(edgecore.yaml)中的如下字段:
modules:
edged:
tailoredKubeletConfig:
cgroupDriver: systemd
当你需要使用systemd
cgroup驱动,并且在使用keadm join安装部分版本(v1.12.0-1.12.4, v1.13.0-1.13.2, v1.14.0-1.14.2)的EdgeCore时,可能也会出现OCI runtime create failed
的报错,建议使用对应release版本的最新patch版本