Toger Blog

Chef Too Big for the Kitchen

While pondering running a full ‘ride-the-wave’ auto-scaling solution in AWS, I looked closely at my Chef installation. The environment is very Chef-heavy with fairly generic AMI that runs a slew of Chef recipes to bring it up to the needs of that particular role. The nodes are in Autoscale groups and get their role designation from the Launch Configuration.

On average a node invoked approximately 55 recipes (as recorded by seen_recipes in the audit cookbook). Several of those recipes bring in resources from (authenticated) remote locations that have very good availability but are not in my direct control. Ignoring the remote-based recipes there is still a significant number of moving parts that can be disrupted unexpectedly, such as by other cookbook / role / KV store changes. This is acceptable-if-not-ideal when nodes are generally brought into being under the supervision of an operator who can resolve any issues, or for when the odd 1-of-x00 pooled nodes dies and is automatically replaced. This risk is manageable when the environment is perpetually scaled for peak traffic.

However, when critical nodes are riding the wave of capacity then the chance that something will eventually break during scale-up and cause the ‘wave’ to swamp the application becomes 100%. That requires an operator to adjust the problem under a significant time crunch as the application is overwhelmed by traffic — hardly a recipe for success. The more likely scenario is breakage at some odd hour of the morning as users wake up, and the application fails before an operator can intervene to keep the application alive.

I looked at my Chef construction and realized it was less Infrastructure As Code (IaC) and more like Compile In Production (CIP).

AWS ECS and Docker Exit (137)

I ran into this the other day, my ECS instances were dieing off and docker ps showed Exited (137) About a minute ago. Looking at docker inspect I noticed:

1
2
3
4
"State": {
 "FinishedAt": "2015-09-20T21:38:58.188768082Z",
    "OOMKilled": true
  },

This tells me that ECS / Docker enforced the memory limit for the container, and the out-of-memory-killer killed off the contained processes. Raising the ECS memory limit for this process resolved the issue.

Carl’s Jr, SPF and AWS

Carl’s Jr has a nifty nutritional calculator / order planner at http://www.carlsjr.com/menu/nutritional_calculator. It lets you fully customize your meal, then lets you print or email your order to yourself with all the magic words to say to get your meal as planned (subbing cheese, extra / 2x / no onion, etc).

Tonight I used this to pre-assemble a highly customized meal for my family. I triggered it to send me an email (easier to read on my phone at the order window) and anxiously awaited.

No email was received.

I host my own email and use http://rollernet.us for my public incoming MX relays; they are nifty as they have a ton of highly configurable anti-spam features that I can apply ‘at the edge’ and lets my actual mailserver run much leaner since SpamAssassin et all are resource intensive.

Rollernet logs stated:

1
2
3
4
Connection from 54.236.226.113 rejected by mail.rollernet.us
From: carlsjr@carlsjr.com
To: my_email  
Reason: SPF fail (Mechanism -all matched)

Oh ho! I temporarily disabled SPF checking (and greylisting) and sent another meal through, and the email header said:

1
2
Received-SPF: fail (carlsjr.com: Sender is not authorized by default to use 'carlsjr@carlsjr.com' in 'mfrom' identity (mechanism '-all' matched)) receiver=mail2.rollernet.us; identity=mailfrom; envelope-from="carlsjr@carlsjr.com";
        helo=ip-10-198-0-85.localdomain; client-ip=54.236.168.30

I fetched their SPF record at with http://www.kitterman.com/spf/validate.html (though any DNS fetching tool would work) and received:

1
v=spf1 ip4:63.168.109.0/24 ip4:67.203.173.0/26 ip4:216.87.35.224/27 mx include:spf.protection.outlook.com -all

This tells me they use outlook.com for their internal email, and only allow a few subnets to originate mail from them — and the source IP I got was not one of them. 54.236.168.30 resolves to ec2-54-236-168-30.compute-1.amazonaws.com. www.carlsjr.com resolves to what appears to be a Cloudformation-based AWS environment:

1
2
3
4
$ host www.carlsjr.com
www.carlsjr.com is an alias for CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com.
CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com has address 54.236.231.179
CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com has address 52.1.95.39

I suspect their Carl’s AWS-based web farm is generating the outgoing mails directly, and they have not accounted for that in their SPF configuration. Using Amazon Simple Email Service (SPF) would have accounted for this already (http://docs.aws.amazon.com/ses/latest/DeveloperGuide/authenticate-domain.html). This could also be handled by designating a outgoing mail host from their AWS environment with an Elastic IP attached and add it to their SPF record.

I sent them an email detailing the issue at their corporate email address. I’ll update if I hear back, but I don’t expect it will ever reach anyone who knows what to do with it.

Update: I got an email back stating the issue was being routed ‘to the appropriate department’.

Varnish Cache and req.backend.healthy

An odd issue I ran into the other day: I had a Varnish 3 instance that had logic hinging on req.backend.healthy to show a special error page if all the backends were down. That logic inexpliciably triggered even though all my backends were up! After much head-scratching I identified the issue: one of my historical VCLs was still loaded that no longer had any healthy backends (due to repeated autoscaling up / down), and although the current definition of that director had healthy backends, the historical one did not. Varnish has a habit of not letting go of old VCLs even if you specify vcl.discard on them. So, req.backend.healthy will show the director as down if any prior definition of that director is down. Since the only way to definitively remove VCLs from memory is a restart (which flushes the memory cache), this makes req.backend.healthy fairly unreliable.

This is in v3 and may not apply to v4 anymore.

2FA and Minecraft Server

Two Factor Authentication (2FA) is an additional layer of protection you can add to your Minecraft server. You should already be relying on SSH keys to access your server, but those keys can be lost or leaked if you use them on untrusted machines. 2FA protects you from hacks resulting from someone gaining access to your password or SSH key.

DuoSecurity is a company that provides a 2FA service that is free for personal use. A Mobile Push notification is sent via their Android / iOS app whenever a login is attempted and requires an affirmative response before login can proceed. There is also SMS and computer-generated voice calls, but those consume credits that have a cost to refill. There is also a cost for going over 10 users, which should not be an issue for adminstation of a Minecraft server.

I performed a source install to get the latest version, there are also packages for RHEL/CentOS/Debian/Ubuntu on their website.

They offer a SSH-specific installation and a PAM installation that covers all auth on the machine. PAM is likely the more comprehensive solution but is a more involved process, plus requires a SELinux policy update (and the server resulting from this series has SELinux enabled). Their website tells you how to install the relevant SELinux policy but the necessary objects are missing from their download. I’ve send them a mail and will update if I get clarification. For the record, the error I got was:

1
2
3
4
5
6
7
8
9
10
11
12
make: Entering directory `/home/centos/duo_unix-1.9.14/pam_duo'
checkmodule -M -m -o authlogin_duo.mod authlogin_duo.te
checkmodule:  loading policy configuration from authlogin_duo.te
checkmodule:  unable to open authlogin_duo.te
make: [semodule] Error 1 (ignored)
semodule_package -o authlogin_duo.pp -m authlogin_duo.mod
semodule_package:  Could not open file No such file or directory:  authlogin_duo.mod
make: [semodule] Error 1 (ignored)
semodule -i authlogin_duo.pp
semodule:  Failed on authlogin_duo.pp!
make: [semodule] Error 1 (ignored)
make: Leaving directory `/home/centos/duo_unix-1.9.14/pam_duo'

SSH setup is straightforward. Install the DuoSecurity app on your mobile device. Create an account on their website. Navigate their portal and select Applications, then +New Application. Give it name of Minecraft SSH. This will result in a window showing a Integration key, Secret key (requiring a click to show), and an API hostname. Collect those and store them to the side (securely!)

https://www.duosecurity.com/docs/duounix#instructions describes the compile process:

1
2
3
4
wget https://dl.duosecurity.com/duo_unix-latest.tar.gz
tar zxf duo_unix-latest.tar.gz
cd duo_unix-1.9.14
./configure --prefix=/usr && make && sudo make install

After installing the binaries, modify /etc/duo/login_duo.conf and fill in the ikey/skey/host values from above. Uncomment pushinfo as well so that more details about the login are sent to you.

https://www.duosecurity.com/docs/duounix#centos describes modifying SSH to work with the new system. Keep an alternate root-logged in session open in an alternate window while attempting this so you can back out if you make a mistake. Do not log out of the ‘backup’ window until you’ve thoroughly exercised the system. I will not be responsible if you lock yourself out.

Add ForceCommand /usr/sbin/login_duo to the end of /etc/sshd/sshd_config, then restart sshd (which does not kick out existing sessions) with systemctl restart sshd.service

Now attempt to log in to your account. You will be prompted to enroll in 2FA. Upon logging in again you will be prompted for your 2FA method, which will look like:

1
2
3
4
5
6
7
8
$  ssh minecraft.example.com -i ~/.ssh/minecraft.pem -l centos
Duo two-factor login for centos

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-1234
 2. Phone call to XXX-XXX-1234
 3. SMS passcodes to XXX-XXX-1234

I always choose #1 as I do not want to expend credits, though if push failed I could resort to the other two. When I choose #1 I get a popup on my phone asking me to accept / deny the login, and after choosing Accept I am able to log in.

Now I do not have to worry someone will get ahold of my minecraft.pem and gain access to the server. Further, if I get a login-request on my phone and haven’t attemped to log in, I know my key has been compromised.

DSL Modem and Power Strips

I had cause to call CenturyLink support recently. As one of the troubleshooting steps they seriously claimed that a power strip is not capeable of providing sufficient power to a DSL modem, and that it must be connected directly to the wall to receive sufficient power.

I suspect this was to force me to unplug and powercycle the modem, but I had done that several times already. The idea that a DSL modem draws more power than will flow through a power strip is ridiculous.

Minecraft and Datadog Monitoring

DataDog is a nifty monitoring / statistics gathering system. It is something like a akin to a combination of Graphite / Grafana, but with a social aspect so that your team can attach discussions to a given point in time. They have a free tier that retains data for a day, which is handy for visualizing the state of the Minecraft server.

Java applications normally expose their statistics via JMX. I did not see anything Minecraft-specific in my stock instance, but Java itself exposes several counters that are informative.

I created my Datadog account, procured my API key, and installed the agent with:

1
DD_API_KEY=MyAPIKey  bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)"

JMX is not enabled by default for Java processes, so I updated my systemd unit file in /etc/systemd/system/minecraft.service to include the JMX configuration:

Minecraft and NewRelic

NewRelic is an exceptionally useful tool for monitoring java applications, or at least those that deal with web or other transational workloads. I tried hooking it up to MineCraft and it doesn’t report anything. The free version doesn’t let me look at the JVM stats (threads and such) so it appears to be a waste. However, they also provide a general unix server agent that does provide some nifty dashboards. The procedure to install it is:

1
2
3
4
rpm -Uvh https://yum.newrelic.com/pub/newrelic/el5/x86_64/newrelic-repo-5-3.noarch.rpm
yum install newrelic-sysmond
nrsysmond-config --set license_key=YourLicenseKey
/etc/init.d/

The RPM is EL5-era so doesn’t understand SystemD, so I created a unit file:

Minecraft and SystemD

In the first installment I launched a basic Minecraft service on CentOS7. However, a proper service should not be run from the command line, instead it should be controlled by the system service daemon. In years past this would be by writing a ‘sysV init script’ which would try to determine if the process was running, if not launch it and capture its PID for future reference, and capture its output to a file. CentOS-7 has switched away from that model to one called SystemD which makes much of that easier. There is some controversy over the SystemD model (is it UNIX-y? Too monolithic? Taking over everything?) but seems pretty handy for what it needs to do, plus has some nice security features.

So I will create a Minecraft service definition for SystemD. I used http://0pointer.de/blog/projects/systemd-for-admins-3.html to help me with this. The unit file will look like:

Minecraft Server in AWS Done Too Well

A new Server

This series will go through how to host a MineCraft server, and go totally overboard on the configuration/management of it. I’ll be integrating a variety of management/monitoring tools that go far beyond the needs of the average ‘friends & family’ server, because it is fun. I’ll starting the basics and build up from there.

So, first we need a machine. I’m going to use a Amazon AWS machine for this. I’ll be using some AWS-specific features, but I don’t think any of them will be critical. In some cases I’ll show the non-AWS alternative.