Outbreak Labs

I can do anything I want to. And so can you.

The Quest, 24 Years Later

In 1996, AOL included a game called The Quest along with the AOL 2.0 CD. It was a point-and-click game filled with tie-ins to AOL, and contained content from Cartoon Network and other brands. The objective of the game was to answer ten "riddles" based on the in-game content and linked demos. The reward for completing the quest was... an hour of AOL. Back in 1996, I was unable to progress beyond the first one or two riddles. My CD was lost to time, along with, it seemed, everyone else's. I began a quest of my own, to find a copy of this game and finish it. I found maybe half a dozen references to the game around the Internet, some of which were others who could barely remember the game. I archived every reference I could. I'd largely given up my search, but recently I took another look and found that someone found the game and put it online. I was ecstatic - I downloaded the game, and set up a Windows 95 virtual machine to play it.

All at once, my memories returned. I activated the interface where you answer the riddles. First one, easy. Second one, I didn't quite understand how you were supposed to know the answer, but I knew where in the game you were supposed to get it, giving me nine possibilities. I tried them one by one, until I found it. The third riddle comes on screen; I instantly know that I never got past this point in 1996. 24 years older, surely I could find the answer now. I started poking through the game, realizing that I was tracing the same steps I'd taken 24 years earlier, looking in the same places. I couldn't find it. This was a "fill in the blank" riddle, given a ridiculous recipe that "serves 13,000". I punched the other ingredients into Google, and found a number of recipes that used them - but none that were the one on screen. I started trying their ingredients as answers, one by one. After many guesses, I arrived at the correct answer: nutmeg. I'd finally made some progress. Riddle number four was a simple yes/no question. I guessed it, and got it correct. More progress! Riddle number five - I had absolutely no idea. I found nothing in the game that seemed like it could be relevant, and nothing on the Internet either. I worried that the answer might be lost in some long-defunct AOL keyword page. It was time for an alternative approach.

I thought I might reverse engineer the game. I took a look through its files, and after some research, identified that the game code was baked into DXR files: an output of Macromedia Director, a logical choice for this type of software, back in the 90s. In all this time, nobody has made a decompiler for Director's DXR format, which I found described as a stack-based virtual machine. This seemed like a dead end. I browsed some of the text file based resources, but didn't find any leads there. Finally I tried searching for the answers that I was able to get as strings in the DXR binaries. Success! I located a fragment of game data that contained the answers I'd located so far, and it was obvious that the remaining answers were right beside them. I entered the remaining riddle answers. I could tell that I never would have guessed all but one or two of them. I have no idea how anybody beat this game - maybe nobody has.

Without further ado, here are the answers, which I cheated to get:

  1. Ramses II
  2. Arthur
  3. Nutmeg
  4. Yes
  5. Fish
  6. Banjo
  7. Joker
  8. Chile
  9. Club KidSoft
  10. Sudan

24 years later, The Quest is complete.

Escaping Grafana Variables

I was parameterizing a Prometheus query in a Grafana dashboard like I've done many times before:


Only this time it didn't work. I hard-coded a sample value for each parameter and discovered it was the quantile variable, which has decimal values like 0.5 that was causing the query to return no results. Using the Query Inspector, I could see that the query being sent to Prometheus looked like:


As it turns out, Grafana escapes the variables as though they were going to be used in a regular expression. This happened to be the first time I've used a variable with a period in it, so I hadn't learned of this behavior until now. It turns out that you can use the format ${variable:raw} to get it to not do that. I replaced my query with this:


And it worked perfectly.

A warning for users upgrading EKS from 1.12 to 1.13

Today, I upgraded an EKS cluster from 1.12 to 1.13. While at first the upgrade seemed successful, I quickly realized that new nodes were failing to become ready, as a result of the CNI not initializing. I issued kubectl describe daemonset/aws-node -n kube-system to check the status of the daemonset which runs the aws-node pods which represent the CNI, and discovered the following repeated event: Error creating: pods "aws-node-" is forbidden: unable to validate against any pod security policy: []. I checked the Pod Security Policies on the cluster (kubectl get psp) and saw only policies created by Prometheus Operator, not the eks.privileged policy I expected from https://aws.amazon.com/blogs/opensource/using-pod-security-policies-amazon-eks-clusters/. As it turns out, this statement is not strictly true:  For clusters that have been upgraded from previous versions, a fully-permissive PSP is automatically created during the upgrade process. After conferring with AWS support, it turns out that the new PSP is not created if any PSPs already exist, as was true in my case. 


When upgrading a cluster that has existing Pod Security Policies to EKS 1.13, you must create the eks.privileged PSP and associated RBAC configuration yourself. Conveniently, the YAML to do this is provided in Amazon's documentation at https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html.

Things I wish I'd known about Kubernetes and EKS

I've been using Kubernetes on AWS (via EKS) in production for a few months now. There are a number of things I've learned that I wish I'd known at the outset:


  • Out of the box, no protection is in place to ensure that sufficient resources are available for system critical processes such as the Kubelet. This means that running resource intensive workloads can easily destabilize a node, causing strange and undesirable behavior. The documentation at https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ provides information on how to reserve resources for system daemons, but unfortunately provides minimal guidance for how much should be allocated.
  • Resources can be allocated to pods in the form of requests and limits, both of which are optional. Requests are used by the scheduler to determine where pods should be launched, while limits are used to trigger CPU throttling or OOM killing. More specifically, the resources are allocated to containers within pods - for complex applications which run multiple containers within a pod, you must allocate resources to each container independently, which can be tricky. It also means that you can have one container in a pod killed for exceeding its memory limit, while the other remains running. What I really wish I'd known, though, is that you almost always want to specify a request and a limit, and they should be set to the same values. There are a couple reasons for this:
    • Without specifying requests or limits, you have anarchy: containers will be competing for resources with each other, leading to instability - especially if you haven't allocated resources for system processes. This one is obvious enough.
    • The values specified for requests are not manifested within containers, only limits are. For any runtimes that are smart enough to look at cgroup limits to decide their maximum heap size (the JVM, the CLR, etc.) they will aim for the limit, blowing past the request. If you overprovision your limits (i.e. setting the sum total of limits on pods to be greater than the capacity of a node), this will result in containers being killed repeatedly for what at first glance would look like a memory leak. It is possible, of course, to explicitly include the request value in an environment variable. For platforms like the JVM which allow you to specify a heap limit, this could allow you to effectively use a request size different from a limit.
  • EKS will refuse to launch your cluster unless its IAM role has AWS managed policies attached to it, even if you have provided equivalent policies. This can be problematic for organizations which leverage tags for security and isolation of systems within an account.
  • There isn't a simple mechanism to gracefully terminate all nodes in a cluster in order to pick up OS patches (e.g. replacement AMIs in an autoscaling group). Today I accomplish this via a somewhat manual process of reconfiguring autoscaling group limits and behaviors, tainting old nodes, manually terminating them, and then reverting the autoscaling changes.
  • Many applications and services, including the kubectl utility, are tied to a specific version of Kubernetes. You will need to have a strategy for maintaining configuration for multiple versions of these packages to match every version of Kubernetes you have in play (which will be at least two during an upgrade).
  • The stable/prometheus-operator helm chart is independent of kube-prometheus, which is independent of prometheus-operator. Some conflicting and confusing documentation has been updated since I had to figure this out, but if you're not sure what you need, use the stable/prometheus-operator chart (don't even think about trying to mix and match, though).
  • There is no great, official, or universally adopted solution to endowing a pod with an IAM role. Amazon is working on this problem, but for now there are a few open source projects which each have serious drawbacks.
  • The ALB Ingress Controller leverages the standard Kubernetes ingress configuration to configure hostnames on an ALB. Unfortunately, the rules for hostnames in Kubernetes are more restrictive than the rules on ALBs, which can prevent you from applying the exact configuration you want.

Marathon client library for .NET

Today I've published my C# client library for Marathon, a container orchestration platform for Mesos and DC/OS.  The API coverage is extremely minimal at this time, exposing only the functionality I need in my other projects, but you may find it useful too.

Get the source code on GitHub or download the package on NuGet.

Simplify your configuration with ConfigureAll

I've been using the new Options pattern for configuration in my .net core apps. I like it, but got tired of having to write lines like  


every time I added a new configuration class. I've written a library called ConfigureAll to address this inconvenience. With this library, you can add an attribute to your configuration classes indicating their configuration key, like 

    public class FrobOptions
        public string TestValue { get; set; }

then just call

services.ConfigureAll(this.GetType().GetTypeInfo().Assembly, Configuration);

once, instead of having to configure each type individually. For more information, see the GitHub page: https://github.com/SapientGuardian/SapientGuardian.ConfigureAll

BaseNEncodings.Net Retargeted to .NET Standard 1.1

I've been a long-time user of the WallF BaseNEncodings.Net library. It's the only library I'm aware of that was explicitly built to RFC 4648. I submitted a pull request to retarget the library to netstandard1.1, but received no response from the author. While I had hoped the project were still maintained, that seems not to be the case. To that end, I have forked the library and republished it. The new repo is https://github.com/SapientGuardian/BaseNEncodings.Net. Happy Encoding!

Live Patching of Cisco Webex to Stop the Popup

The company I work for recently started using Cisco WebEx, an online collaboration application. It's been configured with a rather annoying behavior - upon completion of every meeting, it launches our company web page in my browser. This is, of course, the last thing anyone would want to do after the conclusion of their meeting.

After diplomacy with the IT department failed, my first attempt at stopping this was to patch the WebEx binaries to remove the offending routine. This was successful, but the module was overwritten by some sort of auto-update process after a few days of use.

To avoid having the patch inadvertently removed, I developed a small Windows service that waits for WebEx to be launched and patches it in memory to remove the code that spawns the popup.

I have uploaded the source code to the service at https://github.com/SapientGuardian/WebExPopupKiller for anyone curious as to how to write such a thing.