Today, I upgraded an EKS cluster from 1.12 to 1.13. While at first the upgrade seemed successful, I quickly realized that new nodes were failing to become ready, as a result of the CNI not initializing. I issued kubectl describe daemonset/aws-node -n kube-system to check the status of the daemonset which runs the aws-node pods which represent the CNI, and discovered the following repeated event: Error creating: pods "aws-node-" is forbidden: unable to validate against any pod security policy: . I checked the Pod Security Policies on the cluster (kubectl get psp) and saw only policies created by Prometheus Operator, not the eks.privileged policy I expected from https://aws.amazon.com/blogs/opensource/using-pod-security-policies-amazon-eks-clusters/. As it turns out, this statement is not strictly true: For clusters that have been upgraded from previous versions, a fully-permissive PSP is automatically created during the upgrade process. After conferring with AWS support, it turns out that the new PSP is not created if any PSPs already exist, as was true in my case.
When upgrading a cluster that has existing Pod Security Policies to EKS 1.13, you must create the eks.privileged PSP and associated RBAC configuration yourself. Conveniently, the YAML to do this is provided in Amazon's documentation at https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html.
I've been using Kubernetes on AWS (via EKS) in production for a few months now. There are a number of things I've learned that I wish I'd known at the outset:
- Out of the box, no protection is in place to ensure that sufficient resources are available for system critical processes such as the Kubelet. This means that running resource intensive workloads can easily destabilize a node, causing strange and undesirable behavior. The documentation at https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ provides information on how to reserve resources for system daemons, but unfortunately provides minimal guidance for how much should be allocated.
- Resources can be allocated to pods in the form of requests and limits, both of which are optional. Requests are used by the scheduler to determine where pods should be launched, while limits are used to trigger CPU throttling or OOM killing. More specifically, the resources are allocated to containers within pods - for complex applications which run multiple containers within a pod, you must allocate resources to each container independently, which can be tricky. It also means that you can have one container in a pod killed for exceeding its memory limit, while the other remains running. What I really wish I'd known, though, is that you almost always want to specify a request and a limit, and they should be set to the same values. There are a couple reasons for this:
- Without specifying requests or limits, you have anarchy: containers will be competing for resources with each other, leading to instability - especially if you haven't allocated resources for system processes. This one is obvious enough.
- The values specified for requests are not manifested within containers, only limits are. For any runtimes that are smart enough to look at cgroup limits to decide their maximum heap size (the JVM, the CLR, etc.) they will aim for the limit, blowing past the request. If you overprovision your limits (i.e. setting the sum total of limits on pods to be greater than the capacity of a node), this will result in containers being killed repeatedly for what at first glance would look like a memory leak. It is possible, of course, to explicitly include the request value in an environment variable. For platforms like the JVM which allow you to specify a heap limit, this could allow you to effectively use a request size different from a limit.
- EKS will refuse to launch your cluster unless its IAM role has AWS managed policies attached to it, even if you have provided equivalent policies. This can be problematic for organizations which leverage tags for security and isolation of systems within an account.
- There isn't a simple mechanism to gracefully terminate all nodes in a cluster in order to pick up OS patches (e.g. replacement AMIs in an autoscaling group). Today I accomplish this via a somewhat manual process of reconfiguring autoscaling group limits and behaviors, tainting old nodes, manually terminating them, and then reverting the autoscaling changes.
- Many applications and services, including the kubectl utility, are tied to a specific version of Kubernetes. You will need to have a strategy for maintaining configuration for multiple versions of these packages to match every version of Kubernetes you have in play (which will be at least two during an upgrade).
- The stable/prometheus-operator helm chart is independent of kube-prometheus, which is independent of prometheus-operator. Some conflicting and confusing documentation has been updated since I had to figure this out, but if you're not sure what you need, use the stable/prometheus-operator chart (don't even think about trying to mix and match, though).
- There is no great, official, or universally adopted solution to endowing a pod with an IAM role. Amazon is working on this problem, but for now there are a few open source projects which each have serious drawbacks.
- The ALB Ingress Controller leverages the standard Kubernetes ingress configuration to configure hostnames on an ALB. Unfortunately, the rules for hostnames in Kubernetes are more restrictive than the rules on ALBs, which can prevent you from applying the exact configuration you want.