Kubernetes consists of multiple components and we can host multiple applications so to troubleshoot any error or issue in the k8s cluster, working knowledge of containers is required.
In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. To provide consistent support and service we need to continuously debug and troubleshoot the complete Kubernetes cluster. In this blog, we are going to learn about Kubernetes Troubleshooting.
In this blog, we will discuss:
- What is Kubernetes Troubleshooting?
- Three key points in K8s Troubleshooting
- Troubleshooting Kubernetes Pods
- ImagePullBackOff or ErrImagePull
- CreateContainerConfigError
- CrashLoopBackOff
- OOM kill due to container limit reachedPod Stays Pending
- Troubleshooting Node Not Ready Error
- Troubleshooting Kubernetes Clusters
- Conclusion
What is Kubernetes Troubleshooting?
The process of discovering, diagnosing, and resolving faults in Kubernetes clusters, nodes, pods, or containers is known as troubleshooting. Kubernetes troubleshooting, in a broader sense, encompasses the effective continuing management of defects and the implementation of preventative measures in Kubernetes components.
Three key points in K8s Troubleshooting
Effective troubleshooting in a Kubernetes cluster consists of three components:
1. Understanding:
It might be tough to comprehend what happened and establish the underlying cause of an issue in a Kubernetes system. This usually consists:
- Examining recent modifications to the afflicted cluster, pod, or node to determine the source of the issue.
- Analyzing YAML settings, Github repositories, and logs for VMs or bare metal machines that include the faulty components.
- Examining K8s events and metrics like disc pressure, memory pressure, and usage. Dashboards that provide essential metrics for clusters, nodes, pods, and containers across time should be available in a mature system.
- Comparing comparable components that behave the same way and examining component dependencies to discover if they are linked to the failure.
To accomplish the aforementioned goals, teams generally employ the following technologies:
- Monitoring.
- Observability.
- Live Debugging.
- Logging.
2. Management:
Each component of a microservices architecture is often created and managed by a distinct team. Because manufacturing events frequently include several components, quick issue resolution necessitates coordination.
Once the problem has been identified, there are three options for resolving it:
Ad hoc solutions: created by teams working on the impacted components using tribal knowledge. Frequently, the engineer who designed the component will have unwritten knowledge of how to troubleshoot and resolve it.
Manual runbooks: a clear, written protocol for dealing with each sort of issue. With a runbook, every member of the team can swiftly address the problem.
Automated runbooks: Automated runbook is an automated procedure that is triggered automatically when an issue is discovered and may be implemented as a script, infrastructure as code (IaC) template, or Kubernetes operator. It might be difficult to automate reactions to all typical situations, but it can be extremely useful in terms of reduced downtime and elimination of human mistakes.
3. Prevention:
Successful teams prioritise prevention above everything else. This will decrease the time spent detecting and addressing new issues over time. Preventing production difficulties with Kubernetes entails the following steps:
- After each event, policies, procedures, and playbooks are developed to enable effective repair.
- Investigating whether and how a response to the issue can be automated.
- Determining how to rapidly detect the problem the next time and make the appropriate data available—for example, by instrumenting the key components.
- Assuring that the issue is escalated to the proper teams and that those teams can effectively interact in order to fix it.
Troubleshooting Kubernetes Pods
If you’re having problems with a Kubernetes pod and weren’t able to locate and rapidly address the mistake in the part above, here’s how to go a little further. Running kubectl describe pod [name] is the initial step in identifying pod issues.
In the following sections we are going to troubleshoot some very common pod errors.
ImagePullBackOff or ErrImagePull
This status indicates that a pod was unable to start because it attempted and failed to retrieve a container image from a registry. The pod will not start because it is unable to construct one or more containers specified in its manifest.
To identify the issue run the below command:
$ kubectl get pods
Check the output whether the pod’s status is ImagePullBackOff or ErrImagePull.
This might be one of the following options-
Incorrect image name or tag: This usually occurs because the image name or tag was written erroneously in the pod manifest. Using docker pull, confirm the right image name and update it in the pod manifest.
Authentication failure in the Container Registry: The pod was unable to authenticate with the registry in order to obtain the image. This might be because of a problem with the Secret holding credentials, or because the pod lacks an RBAC role that permits it to do the transaction. Check that the pod and node have the necessary rights and Secrets before attempting the operation manually using docker pull.
CreateContainerConfigError
This issue is typically caused by a missing Secret or ConfigMap. Secrets are Kubernetes objects that are used to hold sensitive data such as database credentials. ConfigMaps are often used to carry configuration information utilized by numerous pods and save data as key-value pairs.
To identify the issue run the below command:
$ kubectl get pods
See the output whether the pod’s status is CreateContainerConfigError.
To get detailed info about the error, run the below command and look for a message indicating which ConfigMap is missing:
$ kubectl describe pod <name>
Run the command to see if the ConfigMap is present in the cluster.
For example “$ kubectl get configmap configmap-3”
If the result is null, the ConfigMap is not present and must be created. Learn how to construct a ConfigMap by following the documentation.
Run get configmap [name] again to ensure that the ConfigMap is still accessible. If you wish to see the ConfigMap’s content in YAML format, use the -o yaml switch.
Once you’ve confirmed that the ConfigMap exists, run kubectl get pods again to ensure that the pod is in the running state.
CrashLoopBackOff
This error means that a pod cannot be scheduled on a node. This might be because the node does not have enough resources to operate the pod, or because the pod was unable to mount the requested volumes.
To identify the issue run the below command:
$ kubectl get pods
To get detailed info about the error, run the below command:
$ kubectl describe pod <name>
The results will assist you in determining the root cause of the problem. The following are some of the most prevalent causes:
Insufficient resources: If a node’s resources are insufficient, you may manually evict pods from the node or scale up your cluster to guarantee more nodes are available for your pods.
Volume mounting: If the issue is with mounting a storage volume, verify the volume the pod is attempting to mount, ensuring it is accurately stated in the pod manifest, and ensure that a storage volume with those definitions is accessible.
Use of hostPort: If you connect pods to a hostPort, you may be limited to scheduling only one pod per node. Most of the time, you can skip using hostPort and instead use a Service object to communicate with your pod.
OOM kill due to container limit reached
This is by far the most basic memory issue that may occur in a pod. You specify a memory limit, and one container attempts to allocate more memory than is permitted, resulting in an error. Typically, this results in a container dying, one pod being unhealthy, and Kubernetes restarting that pod.
Describe pods output would show something like this:
Exit code 137 is significant because it indicates that the container was terminated by the system because it attempted to utilize more memory than its limit.
Troubleshooting:
If the pod was terminated due to a container limit, then-
- Determine whether your program truly needs extra memory. For example, if the program is a website that is experiencing increased traffic, it may require more memory than was initially requested. To address the problem in this scenario, raise the RAM limit for the container in the pod specification.
- If memory uses spikes unexpectedly and does not appear to be connected to application loads, the program may be suffering from a memory leak. Debug the program and find the source of the memory leak. You should not increase the RAM limit in this instance since it will cause the application to consume too many resources on the nodes.
If the pod was terminated due to a node overcommit:
Overcommitment on a node is possible because pods can schedule on a node if their memory demands a value, the minimal memory value is smaller than the memory available on the node.
Kubernetes, for example, may execute 10 containers with a memory request value of 1 GB on a node with 10 GB of RAM. However, if these containers have a memory limit of 1.5 GB, some of the pods may utilize more than the minimum capacity, causing the node to run out of memory and force the termination of some of the pods.
You need to determine why Kubernetes decided to terminate the pod with the OOMKilled error and adjust memory requests and limit values to ensure that the node is not overcommitted.
To completely diagnose and address Kubernetes memory issues, you must monitor your environment, comprehend the memory behaviour of pods and containers in comparison to the restrictions, and fine-tune your settings. Without the proper tools, this may be a difficult and time-consuming task.
Learn more about K8s pods.
Troubleshooting Node Not Ready Error
When a worker node shuts down or crashes, all stateful pods on it become unavailable, and the node status changes to NotReady.
If a node has a NotReady state for more than five minutes (the default), Kubernetes switches the status of pods scheduled on it to Unknown and tries to schedule it on another node, with status ContainerCreating.
To identify the issue run the below command:
$ kubectl get nodes
The problem will be resolved if the failing node is able to recover or is restarted by the user. When the failing node recovers and rejoins the cluster, the following happens:
- The pod with the Unknown status is removed, and the volumes on the failing node are disconnected.
- The pod is rescheduled on the new node, its state is changed from Unknown to ContainerCreating, and the necessary volumes are associated.
- Kubernetes employs a five-minute delay (by default), after which the pod will start running on the node and its state will change from ContainerCreating to Running.
If you don’t have time to wait, or if the node fails to recover, you’ll need to assist Kubernetes in rescheduling the stateful pods on another, operational node. There are two ways to accomplish this:
- Using the command kubectl delete node [name], remove the failing node from the cluster.
- Delete stateful pods with status unknown—using the command kubectl delete pods [pod_name] –grace-period=0 –force -n [namespace]
Troubleshooting Kubernetes Clusters
The first step in diagnosing container difficulties is to gather basic information about the Kubernetes worker nodes and Services that are active in the cluster.
Run kubectl get nodes –show-labels to get a list of worker nodes and their status. The result will look somewhat like this:
Run the following command to obtain information about the Services that are currently operating on the cluster:
$ kubectl cluster-info
The output will be something like this:
Obtaining Cluster Logs:
To detect deeper issues with your cluster’s nodes, you’ll need access to the nodes’ logs. The table below indicates where to find the logs.
NODE TYPE | COMPONENT | WHERE TO FIND LOGS |
Master | API Server | /var/log/kube-apiserver.log |
Master | Scheduler | /var/log/kube-scheduler.log |
Master | Controller Manager | /var/log/kube-controller-manager.log |
Worker | Kubelet | /var/log/kubelet.log |
Worker | Kube Proxy | /var/log/kube-proxy.log |
Let’s take a look at some frequent cluster failure situations, their consequences, and how they are generally addressed. This is not a comprehensive guide on cluster troubleshooting, but it will assist you in resolving the most prevalent difficulties.
API Server VM Shuts Down or Crashes:
- Impact: You will be unable to launch, terminate, or update pods and services if the API server is unavailable.
- Resolution: Restart the API server virtual machine.
- Prevention: Set the API server VM to restart automatically, and configure high availability for the API server.
Control Plane Service Shuts Down or Crashes:
- Impact: Because services such as the Replication Controller Manager, Scheduler, and so on are colocated with the API Server, the impact of any of them shutting down or crashing is the same as the API Server shutting down.
- Resolution: The same as when the API Server VM shuts down.
- Prevention: The same as when the API Server VM shuts off.
API Server Storage Lost:
- Impact: The API Server will not restart after being shut off.
- Resolution: Check that the storage is operational again, manually restore the API Server’s state from backup, and restart it.
- Prevention: Ensure you have a readily available snapshot of the API Server. Use dependable storage, such as Amazon Elastic Block Storage (EBS), which survives the shutdown of the API Server VM, and prioritizes highly available storage.
Worker Node Shuts Down:
- Impact: If the node’s pods cease running, the Scheduler will try to execute them on other available nodes. The cluster’s total capability to operate pods will be reduced.
- Resolution: Identify the problem on the node, restart it, and register it with the cluster.
- Prevention: Use a replication control or a Service in front of pods to guarantee that node failures do not affect users. Create fault-tolerant programs.
Kubelet Malfunction:
- Impact: If a node’s kubelet crashes, you will be unable to create new pods on that node. Existing pods may or may not be removed, and the node will be designated as unhealthy.
- Resolution: The same as when the Worker Node shuts down.
- Prevention: The same as when a worker node shuts off.
Conclusion
The upkeep of a Kubernetes cluster is a constant task. While it aims to make containerized application management easier, it’s a crucial job to maintain and keep the cluster healthy.
In this blog, we learned about the various errors and common issues that can affect a k8s cluster or its components and resources and we have seen how we can deal with those errors and troubleshoot them.
Related/References
- Visit our YouTube channel “K21Academy”
- Certified Kubernetes Administrator (CKA) Certification Exam
- (CKA) Certification: Step By Step Activity Guides/Hands-On Lab Exercise & Learning Path
- Certified Kubernetes Application Developer (CKAD) Certification Exam
- (CKAD) Certification: Step By Step Activity Guides/Hands-On Lab Exercise & Learning Path
- Create AKS Cluster: A Complete Step-by-Step Guide
- Container (Docker) vs Virtual Machines (VM): What Is The Difference?
- How To Setup A Three Node Kubernetes Cluster For CKA: Step By Step
Join FREE Masterclass
To know about what is the Roles and Responsibilities of Kubernetes administrator, why you should learn Docker and Kubernetes, Job opportunities for Kubernetes administrator in the market, and what to study Including Hands-On labs you must perform to clear Certified Kubernetes Administrator (CKA) certification exam by registering for our FREE Masterclass.
The post Kubernetes Troubleshooting: A Complete Guide For Beginner’s appeared first on Cloud Training Program.