Upgrading Your Cluster

One of the things I noticed while studying for my CKAD exam is that my test cluster was a bit behind. It’s been up and running for over 200 days at this point and I’m several versions behind. Version 1.17 is out, the exam was based on 1.16, and I’m running 1.13.

NAME                         STATUS   ROLES    AGE    VERSION
runlevl41c.mylabserver.com   Ready    master   207d   v1.13.5
runlevl42c.mylabserver.com   Ready    <none>   207d   v1.13.5
runlevl43c.mylabserver.com   Ready    <none>   207d   v1.13.5

Read More

Post to Twitter

How I Passed the CKAD Exam

Last year, I decided that since I’d been working on pushing my employer to embrace a Cloud-native mentality that I should lead by example and get certified. We had chosen Kubernetes as our base container management platform and ultimately inked a deal with Red Hat to roll out OpenShift throughout the enterprise.

If you’ve read the blog for any time, you’ll remember that I was originally pursuing RHEL certification. At the time, even though I was in a development role, I was finding myself needing more and more Linux skills. However, my role transitioned to a dedicated focus on Cloud technologies. I abandoned RHEL and began focusing on Kubernetes. However, it wasn’t until about November of this year that I really buckled down to prepare.

In the spirit of openness, I’ll confess that I’ve had the benefit of spending the past several months working in a dedicated fashion deploying multiple OpenShift clusters. This has involved a great deal of tweaking, testing, and troubleshooting which has provided invaluable real-world, hands-on experience working with the Kubernetes system. Now, with that out of the way, I thought I’d share my personal take on how I prepared…and passed…the CKAD exam.

Read More

Post to Twitter

Separate, But Equal

One of the things we noticed in our dev cluster at work during the initial stages of our OpenShift deployment is that while we were deliberately mucking with things, Pods would end up getting scheduled together on the same node. This didn’t seem prudent to us since we were playing around with purposely killing nodes, and such affinity could potentially lead to a negative impact to our clients. This is what led me down the path of anti-affinity. I have to admit that I find the entire premise and behavior of the Kubernetes scheduler fascinating.

The goal of this post is to describe how to spread your workload amongst your nodes for greater fault tolerance. We’ll discuss both hard and soft affinities. Unfortunately, I only have two worker nodes in my lab environment so it’ll be a little more challenging to demonstrate, but I think you’ll get the gist.

Let’s say I just run a simple imperative command to create a quick Deployment with four NGINX replicas:

$ kubectl run no-affinity --replicas=4 --image=nginx --labels=app=no-affinity
deployment.apps/no-affinity created

I can see the the scheduler actually did a great job of evenly distributing the workload.

$ kubectl get po -o wide -l app=no-affinity
NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
no-affinity-6dc48758bf-lwdd5   1/1     Running   0          23s   10.244.2.25    runlevl43c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-nbgzl   1/1     Running   0          23s   10.244.1.111   runlevl42c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-sx6mr   1/1     Running   0          23s   10.244.2.24    runlevl43c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-xmq2r   1/1     Running   0          23s   10.244.1.112   runlevl42c.mylabserver.com   <none>           <none>

Let’s presume that I don’t want any identical pods running together. This is where hard affinity comes into play. Let’s look at a Deployment descriptor which defines hard affinity. Since our goal is to separate our Pods, rather than keep them together, we’re going to use anti-affinity.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: affinity
  name: affinity
spec:
  replicas: 4
  selector:
    matchLabels:
      app: affinity
  template:
    metadata:
      labels:
        app: affinity
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - affinity
            topologyKey: kubernetes.io/hostname
      containers:
      - image: nginx
        name: affinity

In this example, we’re telling the scheduler that we are going to require that any Pods whose label matches app=affinity cannot be scheduled together. Let’s create the Deployment and see what happens.

kubectl apply -f req.yaml

$ kubectl get po -o wide -l app=affinity
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
affinity-dc6c5999-7n6bk   0/1     Pending   0          39s   <none>         <none>                       <none>           <none>
affinity-dc6c5999-brtsx   0/1     Pending   0          39s   <none>         <none>                       <none>           <none>
affinity-dc6c5999-gn4k8   1/1     Running   0          39s   10.244.2.26    runlevl43c.mylabserver.com   <none>           <none>
affinity-dc6c5999-mhjzt   1/1     Running   0          39s   10.244.1.113   runlevl42c.mylabserver.com   <none>           <none>

Notice that now we only have two out of our four declared Pods running. The first two listed are in a Pending state because the scheduler is adhering to our rule that the Pods cannot be scheduled together. This probably isn’t what we really want. Our aim is to have the schedule do its best, though. Let’s change the descriptor so that we tell the scheduler that we prefer this behavior, but that it isn’t required.

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: affinity
  name: affinity
spec:
  replicas: 4
  selector:
    matchLabels:
      app: affinity
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: affinity
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - affinity
      containers:
      - image: nginx
        name: affinity

In this example, we’re now saying that the anti-affinity pattern is preferred, not required. This is known as soft affinity. Note that the weight1 field is required with preferred scheduling. If we delete the existing Deployment and re-create it with this new Descriptor we should see behavior similar to our initial example.

$ kubectl get po -o wide -l app=affinity
NAME                       READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
affinity-944d8c9f9-498hz   1/1     Running   0          10s   10.244.1.114   runlevl42c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-jljmq   1/1     Running   0          10s   10.244.1.115   runlevl42c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-m582c   1/1     Running   0          10s   10.244.2.27    runlevl43c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-plmbf   1/1     Running   0          10s   10.244.2.28    runlevl43c.mylabserver.com   <none>           <none>

And this is what we see. Again, if you have more worker nodes available, it may be easier to see how you can get some disparate balancing. Whether or not this would be a problem for us in production is TBD. However, I plan on accounting for it in our deployments regardless to err on the side of safety and resiliency.


  1. The weight field in preferredDuringSchedulingIgnoredDuringExecution is in the range 1-100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding “weight” to the sum if the node matches the corresponding MatchExpressions. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred. Source ↩︎

Post to Twitter

Executing Commands Against a Pod

I recently responded to another user in a Slack channel regarding this topic and thought I’d post it here as well. The discussion revolved around when you need to use the -- command demarcation with the kubectl exec command.

As for the necessity or not of the — command, it depends on what you’re doing. If you’re not passing any arguments to the command, it’s not needed.

Let’s create a new pod based on busybox to test the theory: kubectl run bb --image=busybox --restart=Never --command -- sleep 3600

If we want to list the directories with ls, we can do it your way:

$ kubectl exec bb ls
bin
dev
etc
home
proc
root
sys
tmp
usr
var

Let’s say we want to know more about the contents so we try to do a long listing. This will generate an error.

$ kubectl exec bb ls -al
Error: unknown shorthand flag: 'a' in -al

However, if we include the --, the command and argument(s) are handled properly and we get the desired results.

$ kubectl exec bb -- ls -al
total 44
drwxr-xr-x    1 root     root          4096 Dec  8 21:17 .
drwxr-xr-x    1 root     root          4096 Dec  8 21:17 ..
-rwxr-xr-x    1 root     root             0 Dec  8 21:17 .dockerenv
drwxr-xr-x    2 root     root         12288 Dec  2 20:12 bin
drwxr-xr-x    5 root     root           360 Dec  8 21:17 dev
drwxr-xr-x    1 root     root          4096 Dec  8 21:17 etc
drwxr-xr-x    2 nobody   nogroup       4096 Dec  2 20:12 home
dr-xr-xr-x  238 root     root             0 Dec  8 21:17 proc
drwx------    2 root     root          4096 Dec  2 20:12 root
dr-xr-xr-x   13 root     root             0 Dec  8 21:16 sys
drwxrwxrwt    2 root     root          4096 Dec  2 20:12 tmp
drwxr-xr-x    3 root     root          4096 Dec  2 20:12 usr
drwxr-xr-x    1 root     root          4096 Dec  8 21:17 var

Hopefully this will clear up any confusion.

Post to Twitter

CKAD Speed Tip

As I prepare for the Certified Kubernetes Application Developer exam, I’m looking for every possible way to shave time off the tasks at hand. One of the things that’s annoyed me when trying to move quickly from task to task is the delay in waiting for resources to be killed off.

One of the things you need to take into consideration is that Kubernetes resources have a default grace period (30 seconds).

Let’s take a look at the time it takes (your numbers will vary) to kill your average pod. We’ll use the time command to capture the numbers. In this example, I’ve created a simple pod running nginx. I’m also efficient and I have kubectl aliased to k.

$ time k delete po nginx
pod "nginx" deleted
real    0m13.416s
user    0m0.080s
sys    0m0.008s

Now let’s run the same command using the --now flag to tell kubectl that we want to set the grace period to 1.

$ time k delete po nginx --now
pod "nginx" deleted
real    0m2.012s
user    0m0.080s
sys    0m0.000s

In this example (and several other tests), I was able to reduce the time by 10 or more seconds each run. Now, if you’re really in a hurry, and don’t care about waiting at all, you can take it one step further. Be forewarned (and the command will warn you as well) that Kubernetes won’t confirm that the pod was terminated and may remain running. We can set the grace period to 0 and force the deletion.

$ time k delete po nginx --grace-period=0 --force=true
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nginx" force deleted
real 0m0.094s
user 0m0.064s
sys 0m0.020s

 

Post to Twitter