Outbreak Labs

I can do anything I want to. And so can you.

A warning for users upgrading EKS from 1.12 to 1.13

Today, I upgraded an EKS cluster from 1.12 to 1.13. While at first the upgrade seemed successful, I quickly realized that new nodes were failing to become ready, as a result of the CNI not initializing. I issued kubectl describe daemonset/aws-node -n kube-system to check the status of the daemonset which runs the aws-node pods which represent the CNI, and discovered the following repeated event: Error creating: pods "aws-node-" is forbidden: unable to validate against any pod security policy: []. I checked the Pod Security Policies on the cluster (kubectl get psp) and saw only policies created by Prometheus Operator, not the eks.privileged policy I expected from https://aws.amazon.com/blogs/opensource/using-pod-security-policies-amazon-eks-clusters/. As it turns out, this statement is not strictly true:  For clusters that have been upgraded from previous versions, a fully-permissive PSP is automatically created during the upgrade process. After conferring with AWS support, it turns out that the new PSP is not created if any PSPs already exist, as was true in my case. 

 

When upgrading a cluster that has existing Pod Security Policies to EKS 1.13, you must create the eks.privileged PSP and associated RBAC configuration yourself. Conveniently, the YAML to do this is provided in Amazon's documentation at https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html.

Things I wish I'd known about Kubernetes and EKS

I've been using Kubernetes on AWS (via EKS) in production for a few months now. There are a number of things I've learned that I wish I'd known at the outset:

 

  • Out of the box, no protection is in place to ensure that sufficient resources are available for system critical processes such as the Kubelet. This means that running resource intensive workloads can easily destabilize a node, causing strange and undesirable behavior. The documentation at https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ provides information on how to reserve resources for system daemons, but unfortunately provides minimal guidance for how much should be allocated.
  • Resources can be allocated to pods in the form of requests and limits, both of which are optional. Requests are used by the scheduler to determine where pods should be launched, while limits are used to trigger CPU throttling or OOM killing. More specifically, the resources are allocated to containers within pods - for complex applications which run multiple containers within a pod, you must allocate resources to each container independently, which can be tricky. It also means that you can have one container in a pod killed for exceeding its memory limit, while the other remains running. What I really wish I'd known, though, is that you almost always want to specify a request and a limit, and they should be set to the same values. There are a couple reasons for this:
    • Without specifying requests or limits, you have anarchy: containers will be competing for resources with each other, leading to instability - especially if you haven't allocated resources for system processes. This one is obvious enough.
    • The values specified for requests are not manifested within containers, only limits are. For any runtimes that are smart enough to look at cgroup limits to decide their maximum heap size (the JVM, the CLR, etc.) they will aim for the limit, blowing past the request. If you overprovision your limits (i.e. setting the sum total of limits on pods to be greater than the capacity of a node), this will result in containers being killed repeatedly for what at first glance would look like a memory leak. It is possible, of course, to explicitly include the request value in an environment variable. For platforms like the JVM which allow you to specify a heap limit, this could allow you to effectively use a request size different from a limit.
  • EKS will refuse to launch your cluster unless its IAM role has AWS managed policies attached to it, even if you have provided equivalent policies. This can be problematic for organizations which leverage tags for security and isolation of systems within an account.
  • There isn't a simple mechanism to gracefully terminate all nodes in a cluster in order to pick up OS patches (e.g. replacement AMIs in an autoscaling group). Today I accomplish this via a somewhat manual process of reconfiguring autoscaling group limits and behaviors, tainting old nodes, manually terminating them, and then reverting the autoscaling changes.
  • Many applications and services, including the kubectl utility, are tied to a specific version of Kubernetes. You will need to have a strategy for maintaining configuration for multiple versions of these packages to match every version of Kubernetes you have in play (which will be at least two during an upgrade).
  • The stable/prometheus-operator helm chart is independent of kube-prometheus, which is independent of prometheus-operator. Some conflicting and confusing documentation has been updated since I had to figure this out, but if you're not sure what you need, use the stable/prometheus-operator chart (don't even think about trying to mix and match, though).
  • There is no great, official, or universally adopted solution to endowing a pod with an IAM role. Amazon is working on this problem, but for now there are a few open source projects which each have serious drawbacks.
  • The ALB Ingress Controller leverages the standard Kubernetes ingress configuration to configure hostnames on an ALB. Unfortunately, the rules for hostnames in Kubernetes are more restrictive than the rules on ALBs, which can prevent you from applying the exact configuration you want.

Marathon client library for .NET

Today I've published my C# client library for Marathon, a container orchestration platform for Mesos and DC/OS.  The API coverage is extremely minimal at this time, exposing only the functionality I need in my other projects, but you may find it useful too.

Get the source code on GitHub or download the package on NuGet.

Simplify your configuration with ConfigureAll

I've been using the new Options pattern for configuration in my .net core apps. I like it, but got tired of having to write lines like  

services.Configure<FrobOptions>(Configuration.GetSection("Frob"));

every time I added a new configuration class. I've written a library called ConfigureAll to address this inconvenience. With this library, you can add an attribute to your configuration classes indicating their configuration key, like 

[ConfigurationObject("Frob")]
    public class FrobOptions
    {        
        public string TestValue { get; set; }
    }


then just call

services.ConfigureAll(this.GetType().GetTypeInfo().Assembly, Configuration);

once, instead of having to configure each type individually. For more information, see the GitHub page: https://github.com/SapientGuardian/SapientGuardian.ConfigureAll

BaseNEncodings.Net Retargeted to .NET Standard 1.1

I've been a long-time user of the WallF BaseNEncodings.Net library. It's the only library I'm aware of that was explicitly built to RFC 4648. I submitted a pull request to retarget the library to netstandard1.1, but received no response from the author. While I had hoped the project were still maintained, that seems not to be the case. To that end, I have forked the library and republished it. The new repo is https://github.com/SapientGuardian/BaseNEncodings.Net. Happy Encoding!

Live Patching of Cisco Webex to Stop the Popup

The company I work for recently started using Cisco WebEx, an online collaboration application. It's been configured with a rather annoying behavior - upon completion of every meeting, it launches our company web page in my browser. This is, of course, the last thing anyone would want to do after the conclusion of their meeting.

After diplomacy with the IT department failed, my first attempt at stopping this was to patch the WebEx binaries to remove the offending routine. This was successful, but the module was overwritten by some sort of auto-update process after a few days of use.

To avoid having the patch inadvertently removed, I developed a small Windows service that waits for WebEx to be launched and patches it in memory to remove the code that spawns the popup.

I have uploaded the source code to the service at https://github.com/SapientGuardian/WebExPopupKiller for anyone curious as to how to write such a thing.

New release and location for .NET Core port of MySQLConnector/Net

Back in October, I released a port of MySQL Connector targeting dnxcore50 (at the time, this was the moniker for .NET Core). It had just enough functionality to do what I needed, had no running tests, and didn't conform to the expected interfaces of new data providers - so it couldn't slot in to any ORMs. I upgraded it to target netstandard 1.5 when RC2 came out, but didn't do much with it other than that.

Over the past two days, I've had contributions to get CI builds running through AppVeyor and a large PR that retargeted to netstandard 1.3, which meant it could be built for other platforms too. Builds are now automated and on NuGet. The open source machine is working!

The repo has been relocated to https://github.com/SapientGuardian/mysql-connector-net-netstandard

LibPacketGremlin is Open Source!

The full source code to LibPacketGremlin is now publicly available, including all of the wireless decryption work that I've posted about in 2014. It is also available as a NuGet package, for use in your own software. My hope is that other developers will want to contribute to this project, but even if they don't, having this code in the public domain will allow me to release some of the other projects I've been working on that use it as a base.